精华内容
下载资源
问答
  • 序列分解1、非季节性时间序列分解 移动平均MA(Moving Average)①SAM(Simple Moving Average) 简单移动平均,将时间序列上前n个数值做简单的算术平均。 SMAn=(x1+x2+…xn)/n②WMA(Weighted Moving Average) ...

    序列分解

    1、非季节性时间序列分解

    移动平均MA(Moving Average)

    ①SAM(Simple Moving Average)
    简单移动平均,将时间序列上前n个数值做简单的算术平均。
    SMAn=(x1+x2+…xn)/n

    ②WMA(Weighted Moving Average)
    加权移动平均。基本思想,提升近期的数据、减弱远期数据对当前预测值的影响,使平滑值更贴近最近的变化趋势。
    用Wi来表示每一期的权重,加权移动平均的计算:
    WMAn=w1x1+w2x2+…+wnxn

    R中用于移动平均的API
    install.packages(“TTR”)
    SAM(ts,n=10)

    • ts 时间序列数据
    • n 平移的时间间隔,默认值为10

    WMA(ts,n=10,wts=1:n)

    • wts 权重的数组,默认为1:n
    #install.packages('TTR')
    library(TTR)
    
    data <- read.csv("data1.csv", fileEncoding="UTF8")
    
    plot(data$公司A, type='l')
    
    data$SMA <- SMA(data$公司A, n=3)
    
    lines(data$SMA)
    
    
    
    plot(data$公司A, type='l')
    
    data$WMA <- WMA(data$公司A, n=3, wts=1:3)
    
    lines(data$WMA)

    这里写图片描述

    2、季节性时间序列分解

    在一个时间序列中,若经过n个时间间隔后呈现出相似性,就说该序列具有以n为周期的周期性特征。
    分解为三个部分:
    ①趋势部分
    ②季节性部分
    ③不规则部分
    R中用于季节性时间序列分解的API
    序列数据周期确定

    • freg<-spec.pgram(ts,taper=0, log=’no’, plot=FALSE)
    • start<-which(freqspec==max(freqspec))周期开始位置
    • frequency<-1/freqfreq[which(freqspec==max(freq$spec))]周期长度

    序列数据分解
    decompose(ts)

    data <- read.csv("data2.csv", fileEncoding = "UTF8")
    
    freq <- spec.pgram(data$总销量, taper=0, log='no', plot=FALSE);
    
    start <- which(freq$spec==max(freq$spec))
    frequency <- 1/freq$freq[which(freq$spec==max(freq$spec))]
    
    
    data$均值 <- data$总销量/data$分店数
    
    freq <- spec.pgram(data$均值, taper=0, log='no', plot=FALSE);
    
    start <- which(freq$spec==max(freq$spec))
    frequency <- 1/freq$freq[which(freq$spec==max(freq$spec))]
    
    plot(data$均值, type='l')
    
    meanTS <- ts(
      data$均值[start:length(data$均值)], 
      frequency=frequency
    ) 
    ts.plot(meanTS)
    
    meanTSdecompose <- decompose(meanTS)
    plot(meanTSdecompose)
    
    #趋势分解
    meanTSdecompose$trend
    #季节性分解数据
    meanTSdecompose$seasonal
    #随机部分
    meanTSdecompose$random
    

    这里写图片描述

    展开全文
  • 季节性时间序列建模与预测 ,孟玲清,王晓雨,所谓所谓时间序列,就是各种社会、经济、自然现象的数量指标按照时间次序排列起来的统计数据,其有多种构成因素,每种因素对系统
  • R-时间序列-分解季节性时间序列

    千次阅读 2018-10-16 17:06:08
    1.季节性时间序列 包含:长期趋势Trend,季节趋势Seasonal,周期循环Circle,随机项Random 这里分解为相加模型X=T+S+C+R   在对时间序列进行分解之前,应该对序列进行检验:(下次写) 2.decompose()函数 将...

    1.季节性时间序列

    包含:长期趋势Trend,季节趋势Seasonal,周期循环Circle,随机项Random

    这里分解为相加模型X=T+S+C+R

     

    在对时间序列进行分解之前,应该对序列进行检验:(下次写)

    2.decompose()函数

    将时间序列进行上述分解

    3.R分解操作过程

    3.1数据读入与可视化

    >#以纽约市月出生数量(1946.1-1959.12)的数据集为例

    > births <-scan("http://robjhyndman.com/tsdldata/data/nybirths.dat")

    Read 168 items

    > birthstimeseries <- ts(births, frequency=12, start=c(1946,1))

    > plot(birthstimeseries)

    出生数量

     

    从图上可以看出,出生数量具有一定的季节性(夏峰冬谷)和周期性,同时趋势性明显;但是每个周期内的波动幅度变化较小,且不随时间趋势而变化,随便波动项随时间变化页不明显。

    3.时间序列分解

    分解为加法模型

    >birthcomponents <- decompose(birthstimeseries)

    > plot(birthcomponents)

    分解图

    4.剔除季节因素

    可以对季节性等进行剔除,现剔除季节因素

    >birthstimeseriesseasonallyadjusted<-birthstimeseries-birthcomponents$seasonal

    >plot(birthstimeseriesseasonallyadjusted)

    出生数量(剔除季节因素)

    展开全文
  • sas季节性时间序列建模的方法与应用
  • 季节性时间序列数据分析 为什么要进行探索性数据分析? (Why Exploratory Data Analysis?) You might have heard that before proceeding with a machine learning problem it is good to do en end-to-end analysis...

    季节性时间序列数据分析

    为什么要进行探索性数据分析? (Why Exploratory Data Analysis?)

    You might have heard that before proceeding with a machine learning problem it is good to do en end-to-end analysis of the data by carrying a proper exploratory data analysis. A common question that pops in people’s head after listening to this as to why EDA?

    您可能已经听说,在进行机器学习问题之前,最好通过进行适当的探索性数据分析来对数据进行端到端分析。 听了为什么要使用EDA的一个普遍问题在人们的脑海中浮现。

    · What is it, that makes EDA so important?

    ·这是什么使EDA如此重要?

    · How to do proper EDA and get insights from the data?

    ·如何进行适当的EDA并从数据中获取见解?

    · What is the right way to begin with exploratory data analysis?

    ·探索性数据分析的正确方法是什么?

    So, let us how we can perform exploratory data analysis and get useful insights from our data. For performing EDA I will take dataset from Kaggle’s M5 Forecasting Accuracy Competition.

    因此,让我们了解如何进行探索性数据分析并从数据中获得有用的见解。 为了执行EDA,我将从Kaggle的M5预测准确性竞赛中获取数据集。

    了解问题陈述: (Understanding the Problem Statement:)

    Before you begin EDA, it is important to understand the problem statement. EDA depends on what you are trying to solve or find. If you don’t sync your EDA with respect to solving the problem it will just be plain plotting of meaningless graphs.

    开始EDA之前,了解问题陈述很重要。 EDA取决于您要解决或找到的内容。 如果您不同步您的EDA以解决问题,那将只是无意义的图形的简单绘图。

    Hence, before you begin understand the problem statement. So, let us understand the problem statement for this data.

    因此,在您开始理解问题陈述之前。 因此,让我们了解此数据的问题陈述。

    问题陈述: (Problem Statement:)

    We here have a hierarchical data for products for Walmart store for different categories from three states namely, California, Wisconsin and Texas. Looking at this data we need to predict the sales for the products for 28 days. The training data that we have consist of individual sales for each product for 1914 days. Using this train data we need to make a prediction on the next days.

    我们在这里拥有来自三个州(加利福尼亚州,威斯康星州和德克萨斯州)不同类别的沃尔玛商店产品的分层数据。 查看这些数据,我们需要预测产品28天的销售量。 我们拥有的培训数据包括1914天每种产品的个人销售。 使用此火车数据,我们需要在未来几天进行预测。

    We have the following files provided from as the part of the competition:

    作为比赛的一部分,我们提供了以下文件:

    1. calendar.csv — Contains information about the dates on which the products are sold.

      calendar.csv-包含有关产品销售日期的信息。
    2. sales_train_validation.csv — Contains the historical daily unit sales data per product and store [d_1 — d_1913]

      sales_train_validation.csv-包含每个产品和商店的历史每日单位销售数据[d_1-d_1913]
    3. sample_submission.csv — The correct format for submissions. Reference the Evaluation tab for more info.

      sample_submission.csv —提交的正确格式。 请参考评估选项卡以获取更多信息。
    4. sell_prices.csv — Contains information about the price of the products sold per store and date.

      sell_prices.csv-包含有关每个商店和日期出售产品的价格的信息。
    5. sales_train_evaluation.csv — Includes sales [d_1 — d_1941] (labels used for the Public leaderboard)

      sales_train_evaluation.csv-包括销售[d_1-d_1941](用于公共排行榜的标签)

    Using this dataset we need to make the sales prediction for the next 28 days.

    使用此数据集,我们需要对未来28天进行销售预测。

    分析数据框: (Analyzing Dataframes:)

    Now, after you have understood the problem statement well, the first thing to do, to begin with, EDA, is analyze the dataframes and understand the features that are present in our dataset.

    现在,在您很好地理解了问题陈述之后,首先要做的是EDA,首先要分析数据框并了解数据集中存在的特征。

    As mentioned earlier, for this data we have 5 different CSV files. Hence, to begin with, EDA we will first print the head of each of the dataframe to get the intuition of features and the dataset.

    如前所述,对于此数据,我们有5个不同的CSV文件。 因此,首先,EDA我们将首先打印每个数据框的头部,以获取要素和数据集的直觉。

    Here, I am using Python’s pandas library for reading the data and printing the first few rows. View the first few rows and write your observations.:

    在这里,我正在使用Python的pandas库读取数据并打印前几行。 查看前几行并写下您的观察结果:

    日历数据: (Calendar Data:)

    First Few Rows:

    前几行:

    Value Counts Plot:

    值计数图:

    To get a visual idea about our data we will plot the value counts in each of the category of calendar dataframe. For this we will use the Seaborn library.

    为了对我们的数据有一个直观的了解,我们将在日历数据框的每个类别中绘制值计数。 为此,我们将使用Seaborn库。

    Image for post
    Code-Snippet for Plotting Value Counts of Each Feature
    用于绘制每个功能的值计数的代码段
    Image for post
    Value_counts for each day of week
    一周中每一天的Value_counts
    Image for post
    Value_counts for each month
    每个月的Value_counts
    Image for post
    Value_counts for each year
    每年的Value_counts
    Image for post
    Value Counts for each event based on name
    基于名称的每个事件的值计数
    Image for post
    Value_counts for each event based on event_name
    每个事件基于event_name的Value_counts
    Image for post
    Value_counts for type of event in type_1
    type_1中事件类型的Value_counts
    Image for post
    Value_counts for the type of event in type_2
    type_2中事件类型的Value_counts

    日历数据框的观察结果: (Observations from Calendar Dataframe:)

    1. We have the date, weekday, month, year and event for each of day for which we have the forecast information.

      我们拥有每天的日期工作日月份年份事件 ,并为其提供了预测信息。

    2. Also, we see many NaN vales in our data especially in the event fields, which means that for the day there is no event, we have a missing value placeholder.

      同样,我们在数据中看到许多NaN值,尤其是在事件字段中,这意味着在没有事件的那天,我们缺少一个占位符。
    3. We have data for all the weekdays with equal counts. Hence, it is safe to say we do not have any kind of missing entries here.

      我们拥有所有平日的数据,并且计数相同。 因此,可以肯定地说我们在这里没有任何缺失的条目。
    4. We have a higher count of values for the month of March, April and May. For the last quarter, the count is low.

      我们在3月,4月和5月的值计数更高。 对于最后一个季度,这一数字很低。
    5. We have data from 2011 to 2016. Although we don’t have the data for all the days of 2016. This explains the higher count of values for the first few months.

      我们拥有2011年至2016年的数据。尽管我们没有2016年所有时间的数据。这解释了前几个月的价值较高。
    6. We also have a list of events, that might be useful in analyzing trends and patterns in our data.

      我们还提供了事件列表,这可能有助于分析数据中的趋势和模式。
    7. We have more data for cultural events rather than religious events.

      我们有更多的文化活动而非宗教活动数据。

    Hence, by just plotting a few basic graphs we are able to grab some useful information about our dataset that we didn’t know earlier. That is amazing indeed. So, let us try the same for other CSV files we have.

    因此,只需绘制一些基本图形,我们就可以获取一些我们之前不知道的有关数据集的有用信息。 确实是太神奇了。 因此,让我们对已有的其他CSV文件尝试相同的操作。

    销售验证数据集: (Sales Validation Dataset:)

    First few rows:

    前几行:

    Next, we will explore the validation dataset provided to us:

    接下来,我们将探索提供给我们的验证数据集:

    Image for post
    First five rows of validation data
    验证数据的前五行

    Value counts plot:

    值计数图:

    Image for post
    Code-Snippet for count_plot
    count_plot的代码段
    Image for post
    Value_counts plot for each store
    每个商店的Value_counts图
    Image for post
    Value_counts plot for each state
    每个州的Value_counts图
    Image for post
    Value_counts plot for each category
    每个类别的Value_counts图
    Image for post
    Value_counts plot for each department
    每个部门的Value_counts图

    来自销售数据的观察: (Observations from Sales Data:)

    1. We have data for three different categories which are Household, Food and Hobbies

      我们有三个不同类别的数据,分别是家庭,食品和嗜好
    2. We have data for three different states California, Wisconsin and Texas. Of these three states, maximum sales are from the state of California.

      我们有加利福尼亚,威斯康星州和德克萨斯州三个不同州的数据。 在这三个州中,最大的销售量来自加利福尼亚州。
    3. Sales for the category of Foods is maximum.

      食品类别的销售额最高。

    卖价数据: (Sell Price Data:)

    First few rows:

    前几行:

    Image for post
    First 5 rows for Sell Price Data
    售价数据的前5行

    Observations:

    观察结果:

    1. Here we have the sell_price of each item.

      这里我们有每个项目的sell_price。
    2. We have already seen the item_id and store_id plots earlier.

      我们之前已经看过item_id和store_id的图。

    向您的数据提问: (Asking Questions to your Data:)

    Till now we have seen the basic EDA plots. The above plots gave us a brief overview about the data that we have. Now, for the next phase we need to find answers of the questions that we have from put data. This depends on the problem statement that we have.

    到目前为止,我们已经看到了基本的EDA图。 上面的图对我们提供的数据进行了简要概述。 现在,对于下一阶段,我们需要从放置数据中找到问题的答案。 这取决于我们的问题陈述。

    For Example:

    例如:

    In our data we need to forecast the sales for each product on the next 28 days. Hence, for this we need to know if there are any kind of patterns in the sales earlier before that 28 days? Because, if that is so then the sales is likely to follow the same pattern for next 28 days too.

    在我们的数据中,我们需要预测未来28天每种产品的销售额。 因此,为此,我们需要知道在那28天之前的销售情况中是否存在任何类型的模式? 因为,如果是这样,那么接下来的28天销售量也可能会遵循相同的模式。

    So, here goes our first question?

    那么,这是我们的第一个问题?

    过去的销售分布是什么? (What is the Sales distribution in the past?)

    So, to find out the same, let us randomly select few products and see their sales distribution for 1914 days given in our validation data:

    因此,要找出相同的结果,让我们随机选择一些产品,并在我们的验证数据中查看其1914天的销售分布:

    Image for post
    Code-snippet for plotting sales of a product
    用于绘制产品销售的代码段
    Image for post
    FOODS_3_0900_CA_3_validationFOODS_3_0900_CA_3_validation的销售分配图
    Image for post
    HOUSEHOLD_2_348_CA_1_validationHOUSEHOLD_2_348_CA_1_validation的销售分配图
    Image for post
    FOODS_3_325_TX_3_validationFOODS_3_325_TX_3_validation的销售分配图

    Observations:

    观察结果:

    1. The plots are very random and it is difficult to find out a pattern.

      这些图是非常随机的,很难找到一个模式。
    2. For FOODS_3_0900_CA_3_validation we see that on day1 the sales were high after which it was Nil for sometime. After that once again it reached high and is fluctuating up and down since then. The sudden fall after day1 might be because the product got out of stock.

      对于FOODS_3_0900_CA_3_validation,我们 看到第一天的销售量很高,此后一段时间内为零。 此后,它再次达到高点,此后一直在上下波动。 第一天过后的突然下跌可能是因为产品缺货

    3. For HOUSEHOLD_2_348_CA_1_validation we see that the sales plot is extremely random. It has a lot of noise. On some day the sales are high and on some it got lowered considerably.

      对于HOUSEHOLD_2_348_CA_1_validation,我们看到销售情况非常随机。 它有很多噪音。 有一天,销售很高,有的时候却大大降低了。

    4. For FOODS_3_325_TX_3_validation we see absolutely no sales for first 500 days. This means that for the first 500 days the product was not in stock. After that the sales reached a peak in every 200 days. Hence, for this food product we see a seasonal dependency.

      对于FOODS_3_325_TX_3_validation,我们发现前500天绝对没有销售。 这意味着前500天该产品没有库存。 此后,销量每200天达到峰值。 因此,对于这种食品,我们看到了季节依赖性。

    Hence, by just randomly plotting few sales graph we are able to take our some important insights from our dataset. These insights will also help us in choosing the right model for training process.

    因此,仅通过随机绘制少量销售图,我们就可以从数据集中获取一些重要见解。 这些见解还将帮助我们为培训过程选择正确的模型。

    每周,每月和每年的销售方式是什么? (What is the Sales Pattern on Weekly, Monthly and Yearly Basis?)

    We saw earlier that there are seasonal trends in our data. So, next let us break down the time variables and see the weekly, monthly and yearly sales pattern:

    之前我们看到数据中存在季节性趋势。 因此,接下来让我们分解时间变量,并查看每周,每月和每年的销售模式:

    Image for post
    Code-Snippet for Weekly Average Sales Distribution
    每周平均销售分配的代码段
    Image for post
    HOUSEHOLD_1_118_CA_3_validationHOUSEHOLD_1_118_CA_3_validation的每周平均分配

    For this particular HOUSEHOLD_1_118_CA_3_validation we can see that the sales see a drop after Tuesday and hits minimum on Saturday.

    对于此特定的HOUSEHOLD_1_118_CA_3_validation,我们可以看到销售在周二之后有所下降,在周六达到最低。

    Image for post
    Code-Snippet for Monthly Average Sales Distribution
    每月平均销售分布的代码段
    Image for post
    HOUSEHOLD_1_118_CA_3_validationHOUSEHOLD_1_118_CA_3_validation的月平均分配

    The monthly sales drop in the middle of the year. After which we can say that it reaches a minimum in 7th month that is July.

    每月的销售额在年中下降。 之后,我们可以说它在7月份的第7个月达到了最小值。

    Image for post
    Code-Snippet for Yearly Average Sales Distribution
    年度平均销售分布的代码段
    Image for post
    HOUSEHOLD_1_118_CA_3_validationHOUSEHOLD_1_118_CA_3_validation的年平均分布

    From the above graph we can see that the sales just dropped to zero from 2013 to 2014. This means that the product might be have been updated with a new product version or just removed from this store. From this plot it will be safe to say that for days to predict the sales should still be zero.

    从上图可以看出,从2013年到2014年,销售刚刚下降到零。这意味着该产品可能已经使用新产品版本进行了更新,或者刚刚从该商店中删除。 从该图可以肯定地说,几天来可以预测销售额仍为零。

    每个类别的销售分布是什么? (What is the Sales Distribution in Each Category?)

    We have sales data belonging to three different categories. Hence, it might be good to see if the sales of product depend on the category it belongs to. The same we will do now:

    我们拥有属于三个不同类别的销售数据。 因此,最好查看产品的销售是否取决于其所属的类别。 我们现在将做的相同:

    Image for post
    Code-Snippet for Sales Distribution Category-Wise
    明智的销售分销类别代码段
    Image for post
    Sales-Distribution for each Category
    每个类别的销售分布

    We see that the sales is maximum for Foods. Also, the sales curve for FOOD do not overlap at all with the other two categories. This shows that on any day the sales of Food is more than Household and Hobbies.

    我们看到食品的销售量最大。 另外,食品的销售曲线与其他两个类别完全不重叠。 这表明,在任何一天,食品的销量都超过了家庭嗜好

    每个州的销售分布是什么? (What is the Sales Distribution for Each State?)

    Besides category we also have state to which the sales belong. So, let us analyze if there is a state for which the sales follow a different pattern:

    除了类别,我们还具有销售所属的州。 因此,让我们分析一下是否存在销售遵循不同模式的状态:

    Image for post
    Code-Snippet for Sales Distribution State-Wise
    精明的销售分布代码段
    Image for post
    Sales-Distribution for each State
    每个州的销售分配

    在每周,每月和每年的基础上,属于“兴趣”类别的产品的销售分布是什么? (What is the Sales Distribution for Products that belong to category of Hobbies on weekly, monthly and yearly basis?)

    Now, let us see the sales of randomly selected products from the categories Hobbies and see if their weekly, monthly or yearly average follows a pattern:

    现在,让我们查看“兴趣爱好”类别中随机选择的产品的销售情况,并查看其每周,每月或每年的平均值是否遵循以下模式:

    Image for post
    Image for post
    Code-Snippet for plotting sales distribution of products from Hobbies
    代码段,用于绘制爱好产品的销售分布图
    Image for post
    Image for post
    Image for post
    Image for post
    Image for post
    Image for post
    Image for post

    观察结果 (Observations)

    From the above plot we see that in meed week usually for 4th and 5th day (Tuesday and Wednesday), the sales drop especially in the case when states are ‘WI’ and ‘TX’.

    从上图可以看出,通常在第4天和第5天(星期二和星期三)的一周中,销量下降,尤其是在州为“ WI”和“ TX”的情况下。

    Let us analyze the results on individual states to see this more clearly, as we see different sales pattern for different states. And, this brings us to our next question:

    让我们分析各个州的结果,以便更清楚地看到这一点,因为我们看到了不同州的不同销售模式。 并且,这将我们带入下一个问题:

    特定州在每周,每月和每年的基础上属于“兴趣”类别的产品的销售分布是什么? (What is the Sales Distribution for Products that belong to the category of Hobbies on weekly, monthly and yearly basis for a particular state?)

    Image for post
    Code-Snippet for selecting Sales of products from Hobbies category and state of Wisconsin
    用于从“兴趣爱好”类别和威斯康星州选择产品销售的代码段
    Image for post
    Code-Snippet for selecting few products at random and plotting their distribution
    用于随机选择几种产品并绘制其分布的代码片段
    Image for post
    Image for post
    Image for post
    Image for post
    Image for post
    Image for post
    Image for post

    观察结果: (Observations:)

    1. From the above plots, we can see that in the state of Wisconsin, for most of the products the sales decrease considerably in mid-week.

      从上面的图可以看出,在威斯康星州,大多数产品的销售在星期三中大幅下降。
    2. This also gives us a little sense of life-style of people in Wisconsin, that people here do not shop much during day 3–4 which is Monday and Tuesday. This probably might be because are these are the busiest days of the week.

      这也使我们对威斯康星州人们的生活方式有所了解,即这里的人们在周一至周二的第3至4天购物不多。 这可能是因为这些是一周中最忙的日子。
    3. From the monthly average we can see that, in first quarter the sales often experienced a dip.

      从每月平均数可以看出,第一季度的销售额经常出现下降。
    4. For the product HOBBIES_1_369_WI_2_validation, we see that the sales data is nill till year 2014. This shows that this product was introduced after this year and the weekly and monthly pattern that we see for this product is after the year 2014.

      对于产品HOBBIES_1_369_WI_2_validation,我们看到直到2014年为止的销售数据都是零。这表明该产品是在今年之后推出的,而我们看到的该产品的每周和每月模式是在2014年之后。

    每周,每月和每年,属于食品类别的产品的销售分布是什么? (What is the Sales Distribution for Products that belong to category of Foods on weekly, monthly and yearly basis?)

    Now, doing analysis for Hobbies individually gave us some useful insights. Let, us try the same for the category of Foods:

    现在,分别对爱好进行分析可以为我们提供一些有用的见解。 让我们对食品类别尝试相同的方法:

    Image for post
    Code-Snippet for Food making dataframe with only products of Food Category
    仅包含食品类别产品的食品数据代码段
    Image for post
    Code-Snippet for plotting weekly, monthly and yearly average sales for food products
    代码段,用于绘制食品的每周,每月和每年的平均销售额
    Image for post
    Image for post
    Image for post
    Image for post
    Image for post
    Image for post

    观察: (Observation:)

    1. From the plots above we can say that, for food items categories the purchase is more in the early week as compared to the last two days.

      从上面的图可以看出,对于食品类别,与前两天相比,在前一周的购买量更多。
    2. This is might be because people are habituated of buying food supplies during the start of the week and then keep it for the entire week. This curves shows us the similar behavior.

      这可能是因为人们习惯于在一周开始时购买食品,然后整个星期都保持食用。 该曲线向我们展示了类似的行为。

    每周,每月和每年,属于家庭类别的产品的销售分布是什么? (What is the Sales Distribution for Products that belong to category of Household on weekly, monthly and yearly basis?)

    Image for post
    Code-Snippet for plotting sales distribution of products from Houehold category
    用于绘制Houehold类别产品的销售分布图的代码段
    Image for post
    Image for post
    Image for post
    Image for post
    Image for post
    Image for post

    观察: (Observation:)

    1. From the plots above we can say that, for Household items categories the purchase shows a dip for Monday and Tuesday.

      从上面的图可以看出,对于家庭用品类别,购买显示星期一和星期二有所下降。

    2. In the start of week people are busy with office work and hardly go for shopping. This is the pattern that we see here.

      在一周的开始,人们忙于办公室工作,几乎不去购物。 这就是我们在这里看到的模式。

    有没有办法在不丢失信息的情况下更清楚地看到产品的销售情况? (Is there a way to see the sales of products more clearly without losing information?)

    We saw plots for sales distribution earlier for each products. These were quite cluttered and we couldn’t see the pattern clearly. Hence, you might be wondering if there is a way to do so. And, the good news is yes there is.

    我们早先看到了每种产品的销售分布图。 这些非常混乱,我们看不清模式。 因此,您可能想知道是否有办法做到这一点。 而且,好消息是,是的。

    Here comes denoising in picture. We will denoise our dataset and see the distribution.

    图片降噪 。 我们将对数据集进行去噪并查看分布。

    Here we will see two common denoising techniques. Wavelet denoising and Moving average.

    在这里,我们将看到两种常见的降噪技术。 小波去噪移动平均

    Wavelet Denoising:

    小波去噪:

    From the sales plots of invidual products we saw that the sales changes rapidly. This is because the sales of a product on a day depend on multiple factors. So, let us try denoising our data and see if we are able to find anything intresesting.

    从单个产品的销售图上,我们看到销售变化Swift。 这是因为一天的产品销售取决于多个因素。 因此,让我们尝试对数据进行去噪处理,看看是否能够找到令人感兴趣的东西。

    The basic idea behind wavelet denoising, or wavelet thresholding, is that the wavelet transform leads to a sparse representation for many real-world signals and images. What this means is that the wavelet transform concentrates signal and image features in a few large-magnitude wavelet coefficients. Wavelet coefficients which are small in value are typically noise and you can “shrink” those coefficients or remove them without affecting the signal or image quality. After you threshold the coefficients, you reconstruct the data using the inverse wavelet transform.

    小波去噪或小波阈值处理的基本思想是,小波变换导致许多现实信号和图像的稀疏表示。 这意味着小波变换将信号和图像特征集中在几个大幅度的小波系数中。 小值的小波系数通常是噪声,您可以“缩小”这些系数或将其删除而不影响信号或图像质量。 对系数设定阈值后,您可以使用小波逆变换来重建数据。

    For wavelet denoising, we require the the library pywt.

    对于小波去噪,我们需要库pywt。

    Here we will use wavelet denoising. For deciding the threshold of denoising we will use Mean Absolute Deviation.

    在这里,我们将使用小波去噪。 为了确定降噪的阈值,我们将使用平均绝对偏差

    Image for post
    Code-Snippet for Wavelet Denoising
    小波去噪的代码片段
    Image for post
    Image for post

    Observations:

    观察结果:

    We are able to see a pattern more clear after denoising the data. It shows the same pattern every 500 days which we were not able to see before denoising.

    去噪数据后,我们可以看到更清晰的图案。 它每500天显示一次相同的模式,这是我们在去噪之前无法看到的。

    Moving Average Denoising:

    移动平均降噪:

    Let us now try a simple smoothing technique.In this technique, we take a fixed window sie and move it along out time-series data calculating the average. We also take a stride value so as to leave the intervals accordingly. For example, let's say we take a window size of 20 and stride as 5. Then our first point will be the mean of points from day1 to day 20, the next will be the mean of points from day5 to day25, then day10 to day30 and so on.

    现在让我们尝试一种简单的平滑技术,在此技术中,我们采用固定的窗口sie并将其沿时间序列数据移出以计算平均值。 我们还采用跨度值,以便相应地保留间隔。 例如,假设我们的窗口大小为20,跨度为5,那么我们的第一个点将是从第1天到第20天的点的平均值,下一个是从第5天到第25天的点的平均值,然后是从第10天到第30的点的平均值。等等。

    So, let us try this average smoothing on our dataset and see if we find any kind of patterns here.

    因此,让我们对数据集尝试这种平均平滑处理,看看是否在这里找到任何类型的模式。

    Image for post
    Code-Snippet for Moving Window Average Calculation
    移动窗口平均值计算的代码片段
    Image for post

    Observations:

    观察结果:

    We see that the average smoothing does remove some noise but not as effective as the wavelet decomposition.

    我们看到,平均平滑确实消除了一些噪声,但效果不如小波分解。

    每个州的总销售额是否有所不同? (Do the sales vary overall for each state?)

    Now, from a broader perspective let us see if the sales vary for each state:

    现在,从更广泛的角度来看,让我们看看每个州的销售额是否有所不同:

    Image for post
    Code-Snippet for Average Sales in Each state
    各州平均销售额的代码段
    Image for post
    Sales-pattern for each state
    每个州的销售模式
    Image for post
    Box-plot for Sales distribution of each state
    各州销售分布的箱形图

    观察结果: (Observations:)

    1. From the above plot we can see that the sales for store CA_3 lie above the sales for all other states. The same applies for CA_4 where the sales are lowest. For other sales the patterns are distinguishable to some extent.

      从上图可以看出,商店CA_3的销售额高于所有其他州的销售额。 CA_4的销售额最低也是如此。 对于其他销售,这些模式在一定程度上是可以区分的。
    2. One thing that we observe that all these patterns follow a similar trend that repeats itself after some time. Also, the sales reaches a higher value in the graph.

      我们观察到的一件事是,所有这些模式都遵循类似的趋势,并在一段时间后重复出现。 同样,销售额在图中达到更高的值。
    3. As we saw from the line-plot, the box plot also shows non-overlapping sales patternf for CA_3 nd CA_4.

      从线图中可以看到,箱形图还显示了CA_3和CA_4的非重叠销售模式f。
    4. No overlapping between the stores of California and totally independent of the fact that all of these belong to the same state. This shows high variance for the state of California.

      加利福尼亚的商店之间没有重叠,并且完全独立于所有这些商店都属于同一州。 这表明加利福尼亚州的差异很大。
    5. For Texas the states TX_1 and TX_3 have quite smiliar patterns and intersect a couple of times. But TX_2 lies above them with maximum sales and more disparity as compared to the other two. In the later parts, we see that TX_3 is growing rapidly and is approaching towards TX_2. Hence, from this, we can conclude that sales for TX_3 increase at the fastest pace.

      对于得克萨斯州,州TX_1和TX_3具有相当明显的模式,并且相交几次。 但是TX_2位于它们之上,与其他两个相比,其销售量最大且差异更大。 在后面的部分中,我们看到TX_3正在快速增长,并且正在接近TX_2。 因此,由此可以得出结论,TX_3的销售额增长最快。

    结论: (Conclusion:)

    Hence, by just plotting few simple graphs we are able to know our dataset quite well. Its just a matter of questions that you want to ask to the data. The plotting will give you all the answers.

    因此,仅绘制几个简单的图,我们就能很好地了解我们的数据集。 这只是您要向数据询问的问题。 绘图将为您提供所有答案。

    I hope this would have given you an idea of doing simple EDA. You can find the complete code in my github repository.

    我希望这会给您带来进行简单EDA的想法。 您可以在我的github存储库中找到完整的代码。

    1. https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course

      https://www.appliedaicourse.com/course/11/Applied-Machine-learning-course

    2. https://www.kaggle.com/tarunpaparaju/m5-competition-eda-models/output

      https://www.kaggle.com/tarunpaparaju/m5-competition-eda-models/output

    3. https://mobidev.biz/blog/machine-learning-methods-demand-forecasting-retail

      https://mobidev.biz/blog/machine-learning-methods-demand-forecasting-retail

    4. https://www.mygreatlearning.com/blog/how-machine-learning-is-used-in-sales-forecasting/

      https://www.mygreatlearning.com/blog/how-machine-learning-is-used-in-sales-forecasting/

    5. https://medium.com/@chunduri11/deep-learning-part-1-fast-ai-rossman-notebook-7787bfbc309f

      https://medium.com/@chunduri11/deep-learning-part-1-fast-ai-rossman-notebook-7787bfbc309f

    6. https://www.kaggle.com/anshuls235/time-series-forecasting-eda-fe-modelling

      https://www.kaggle.com/anshuls235/time-series-forecasting-eda-fe-modelling

    7. https://eng.uber.com/neural-networks/

      https://eng.uber.com/neural-networks/

    8. https://www.kaggle.com/mayer79/m5-forecast-keras-with-categorical-embeddings-v2

      https://www.kaggle.com/mayer79/m5-forecast-keras-with-categorical-embeddings-v2

    翻译自: https://medium.com/analytics-vidhya/how-to-guide-on-exploratory-data-analysis-for-time-series-data-34250ff1d04f

    季节性时间序列数据分析

    展开全文
  • 0 SARIMAX模型时间序列分析步骤 1.用pandas处理时序数据 2. 检验时序数据的平稳 3. 将时序数据平稳化 4. 确定order 的 p.d.q值 5. 确定season_order的四个值 6.应用SARIMAX模型对时序数据进行预测 其实...

    0 SARIMAX模型时间序列分析步骤

    1. 用pandas处理时序数据

    2. 检验时序数据的平稳性

    3. 将时序数据平稳化

    4. 确定order 的 p.d.q值

    5. 确定season_order的四个值

    6. 应用SARIMAX模型对时序数据进行预测

    其实SARIMAX比ARIMA模型就多了个season_order参数的确定,但也是这里最费时间的一个步骤

    1 将数据转化成为时序数据

    先一股脑导入一下工具包

    import pandas as pd
    import datetime
    import matplotlib.pyplot as plt
    from pylab import mpl
    mpl.rcParams['font.sans-serif']=['SimHei']
    import seaborn as sns
    import statsmodels.tsa.stattools as ts
    import statsmodels.api as sm
    from statsmodels.tsa.arima_model import ARIMA
    from statsmodels.stats.diagnostic import unitroot_adf
    from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
    import itertools
    import warnings
    import numpy as np
    from statsmodels.tsa.seasonal import seasonal_decompose
    #读取数据
    data = pd.read_csv('factor.csv')
    data.index = pd.to_datetime(data['date'])
    data.drop(['date'], axis=1, inplace=True)
    data = data.result
    data.head()
    #数据大致情况展示
    data.plot(figsize=(12,8))
    plt.legend(bbox_to_anchor=(1.25, 0.5))
    plt.title('result')
    sns.despine()
    plt.show()

     

    2 序列平稳性检测

    #数据平稳性检测 因为只有平稳数据才能做时间序列分析
    def judge_stationarity(data_sanya_one):
        dftest = ts.adfuller(data_sanya_one)
        print(dftest)
        dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
        stationarity = 1
        for key, value in dftest[4].items():
            dfoutput['Critical Value (%s)'%key] = value 
            if dftest[0] > value:
                    stationarity = 0
        print(dfoutput)
        print("是否平稳(1/0): %d" %(stationarity))
        return stationarity
    stationarity = judge_stationarity(data)

    3 序列平稳化 

    #若不平稳进行一阶差分
    if stationarity == 0:
        data_diff = data.diff()
        data_diff = data_diff.dropna()
        plt.figure()
        plt.plot(data_diff)
        plt.title('一阶差分')
        plt.show()
    
    #再次进行平稳性检测
    stationarity = judge_stationarity(data_diff)

     4 做一下季节性分解看看有没有季节性

    #季节性分解
    decomposition = seasonal_decompose(data,freq=28)
    trend = decomposition.trend
    seasonal = decomposition.seasonal
    residual = decomposition.resid
    
    plt.figure(figsize=[15, 7])
    decomposition.plot()
    print("test: p={}".format(ts.adfuller(seasonal)[1]))
    
    #季节平稳性检测
    stationarity = judge_stationarity(residual)

     5 对order参数p、q定阶,下面两种都可以,但图我还是看不来,下一种傻瓜式(稍微费时间)

    #画ACF图和PACF图来确定p、q值
    from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
    
    def draw_acf_pacf(ts,lags):
        f = plt.figure(facecolor='white')
        ax1 = f.add_subplot(211)
        plot_acf(ts,ax=ax1,lags=lags)  #lags 表示滞后的阶数,值为30,显示30阶的图像
        ax2 = f.add_subplot(212)
        plot_pacf(ts,ax=ax2,lags=lags)  
        plt.subplots_adjust(hspace=0.5)
        plt.show()
    draw_acf_pacf(myts_diff,30)
    #对模型p,q进行定阶
    warnings.filterwarnings("ignore") # specify to ignore warning messages
    from statsmodels.tsa.arima_model import ARIMA 
    
    pmax = int(5)    #一般阶数不超过 length /10
    qmax = int(5)
    bic_matrix = []
    for p in range(pmax +1):
        temp= []
        for q in range(qmax+1):
            try:
                temp.append(ARIMA(data, (p, 1, q)).fit().bic)
            except:
                temp.append(None)
            bic_matrix.append(temp)
     
    bic_matrix = pd.DataFrame(bic_matrix)   #将其转换成Dataframe 数据结构
    p,q = bic_matrix.stack().idxmin()   #先使用stack 展平, 然后使用 idxmin 找出最小值的位置
    print(u'BIC 最小的p值 和 q 值:%s,%s' %(p,q))  #  BIC 最小的p值 和 q 值:0,1

    6  通过网格搜索对seasonal_order进行定阶

    #通过网格搜索对seasonal_order进行定阶,目前就是pdq=011,seasonal_order=2, 2, 1, 52效果比较好,RMSE=202.4582
    def get_ARIMA_params(data, pdq, m=12):
        p = d = q = range(0, 3)
        seasonal_pdq = [(x[0], x[1], x[2], m) for x in list(itertools.product(p, d, q))]
        score_aic = 1000000.0
        warnings.filterwarnings("ignore") # specify to ignore warning messages
        for param_seasonal in seasonal_pdq:
            mod = sm.tsa.statespace.SARIMAX(data,
                                            order=pdq,
                                            seasonal_order=param_seasonal,
                                            enforce_stationarity=False,
                                            enforce_invertibility=False)
            results = mod.fit()
            print('x{}12 - AIC:{}'.format(param_seasonal, results.aic))
            if results.aic < score_aic:
                score_aic = results.aic
                params = param_seasonal, results.aic
        param_seasonal, results.aic = params
        print('x{}12 - AIC:{}'.format(param_seasonal, results.aic))
    pdq = [0, 1, 1]
    get_ARIMA_params(data, pdq, m=52)

    这上面最关键的是这个m值怎么设定,我设置默认为12。m是季节周期,参考别人代码的时候,月度数据的m为12,那我这里以周为单位,季节周期应该是52吧。一年12个月,52个星期,这个逻辑应该没有问题。

     7  根据定阶参数进行模型拟合

    mod = sm.tsa.statespace.SARIMAX(data,
                                    order=(0, 1, 1),
                                    seasonal_order=(2, 1, 2, 52),
                                    enforce_stationarity=False,
                                    enforce_invertibility=False)
    results = mod.fit()
    print(results.summary().tables[1])
    results.plot_diagnostics(figsize=(15, 12))
    plt.show()

    这里模型拟合的时候用的数据的原始数据,而不是差分后的数据,因为order参数中已经设置了d为1,在拟合的时候会自动进行一阶差分,并在预测的时候对预测结果进行差分还原。

    8  对预测值和真实值作图,并计算RMSE值作为评估参数

    predict_ts = results.predict(tpy='levels')  #tpy='levels'直接预测值,没有的话预测的是差值
    myts = data[predict_ts.index]  # 过滤没有预测的记录
    
    predict_ts.plot(color='blue', label='Predict',figsize=(12,8))
    
    myts.plot(color='red', label='Original',figsize=(12,8))
    
    plt.legend(loc='best')
    plt.title('RMSE: %.4f'% np.sqrt(sum((predict_ts-myts)**2)/myts.size))
    plt.show()

     

    9 向后对forecast值作图

    steps = 20
    start_time = myts.index[-1]
    forecast_ts = results.forecast(steps)
     
    fore = pd.DataFrame()
    fore['date'] = pd.date_range(start=start_time ,periods=steps, freq='7D')
    fore['result'] = pd.DataFrame(forecast_ts.values)
    fore.index = pd.to_datetime(fore['date'])
     
    predict_ts['2019/1/18':].plot(color='blue', label='Predict',figsize=(12,8))
    myts['2019/1/18':].plot(color='red', label='Original',figsize=(12,8))
    fore.result.plot(color='black', label='forecast',figsize=(12,8))
    
    plt.legend(loc='best')
    plt.show()

    这里有个函数pd.date_range是专门用于产生时间序列索引的,start = 开始时间,end = 结束时间,periods=时间索引的个数,freq=‘7D’表示7天为一个时间索引间隔,也可以是'7W'七周,'M'一个月等等。

    由于预测的数据没有时间索引,只有序号所以我要在这给他生成时间索引,并合并到dataframe,这样就可以和其他值一起在图像上展示了。

     

    最后forecast的效果还是可以的嘛,保存forecast文件

    fore.to_csv('forecast_20steps.csv')

     

    展开全文
  • 概念 时间序列(Time Series)  时间序列是均匀时间间隔上的观测值序列 ... 时间写按照季节性来分类,分为季节性时间序列和非季节性时间序列季节性时间序列:趋势部分、不规则部分; 季节性时间...
  • 时间序列预测,非季节性ARIMA及季节性SARIMA

    万次阅读 多人点赞 2019-03-24 21:55:00
    我们将首先介绍和讨论自相关,平稳性和季节性的概念,并继续应用最常用的时间序列预测方法之一,称为ARIMA。 介绍 时间序列提供了预测未来价值的机会。基于以前的价值观,可以使用时间序列来预测经济,天气和...
  • 分解时间序列季节性数据)

    万次阅读 2017-11-20 20:02:22
    一个季节性时间序列中会包含三部分,趋势部分、季节性部分和无规则部分。分解时间序列就是要把时间序列分解成这三部分,然后进行估计。 对于可以使用相加模型进行描述的时间序列中的趋势部分和季节性部分,...
  • 本次,开始分解季节性时间序列。 一个季节性时间序列中会包含三部分,趋势部分、季节性部分和无规则部分。分解时间序列就是要把时间序列分解成这三部分,然后进行估计。 对于可以使用相加模型进行描述的时间...
  • 分解时间序列,就是将一个时间序列拆分成不同的构成元件...一个非季节性时间序列包含一个趋势部分和一个不规则部分。 分解时间序列即为试图把时间序列拆分成这些成分,也就是说, 需要估计趋势的和不规则的这两个部...
  • 用于操纵顺序和季节性时间序列的对象。 处理基于时间点和持续时间的顺序时间序列。 两者都可以规则或不均匀地间隔(允许重叠的持续时间)。 日期和时间仅使用POSIX *格式。 提供了以下类:POSIXcti,POSIXctp,...
  • 对于季节性时间序列的预测,可以采用“季节系数法”来预测时间序列的变化趋势。在时间序列问题中,季节并不单纯代表一年四季,一些既有季度性、周期性的时间序列也可以用季节系数法来进行预测,例如月份。其步骤如下...
  • 目录1 概述2 时间序列中的季节成分3 机器学习的好处4 季节性的类型4 消除季节性5 每日最低温度数据6 差分6.1 每日数据的差分6.2 月平均数据的差分7 通过建模来进行修正 1 概述 时间序列数据可能包含一些季节性的...
  • 季节性ARIMA:时间序列预测

    千次阅读 2019-02-17 15:22:36
    SARIMAX (seasonal autoregressive integrated moving average with exogenous regressor)是一种常见的时间序列预测方法,可以分为趋势部分和周期部分;每个部分又可以分为自回归、差分和平滑部分。 趋势稳定...
  • 测试序列稳定:看以看到整体的序列并没有到达稳定要求,要将时间序列转为平稳序列,有如下几种方法:DeflationbyCPILogarithmic(取对数)FirstDifference(一阶差分)SeasonalDifference(季节差分)...
  • 时间序列--去除季节性因素

    万次阅读 2018-12-23 20:14:05
    时间序列数据集可以包含季节性成分。这是一个随时间重复的周期,如每月或每年。这种重复的循环可能会模糊我们在预测时希望建模的信号,从而可能为我们的预测模型提供一个强大的信号。 可以看出有很强的季节性成分...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 521
精华内容 208
关键字:

季节性时间序列