背景：LSTM与CNN (Background: LSTMs vs. CNNs)
An LSTM (long-short term memory network) is a type of recurrent neural network that allows for the accounting of sequential dependencies in a time series.
Given that correlations exist between observations in a given time series (a phenomenon known as autocorrelation), a standard neural network would treat all observations as independent, which is erroneous and would generate misleading results.
A convolutional neural network is one that applies a process known as convolution in determining the relationships between two functions. e.g. given two functions f and g, the convolution integal expresses how the shape of one function is modified by the other. Such networks are traditionally used for image classification, and do not account for sequential dependencies in the way that a recurrent neural network is able to do.
卷积神经网络是在确定两个函数之间的关系时应用称为卷积的过程的网络。 例如给定的两种功能FA ND g，则卷积integal表示如何一个函数的形状是由其他改性。 传统上，此类网络用于图像分类，并且不像递归神经网络能够做到的那样考虑顺序依赖性。
However, the main advantage of CNNs that make them suited to forecasting time series is that of dilated convolutions - or the ability to use filters to compute dilations between each cell. That is to say, the size of the space between each cell, which in turn allows the neural network to better understand the relationships between the different observations in the time series.
但是，使CNN适于预测时间序列的主要优点是膨胀卷积的优点 -或使用过滤器计算每个像元之间的膨胀的能力。 也就是说，每个单元格之间的空间大小可以使神经网络更好地理解时间序列中不同观测值之间的关系。
For this reason, LSTM and CNN layers are often combined when forecasting a time series. This allows for the LSTM layer to account for sequential dependencies in the time series, while the CNN layer further informs this process through the use of dilated convolutions.
With that being said, standalone CNNs are increasingly being used for time series forecasting, and the combination of several Conv1D layers can actually produce quite impressive results — rivalling that of a model which uses both CNN and LSTM layers.
How is this possible? Let’s find out!
The below example was designed using a CNN template from the Intro to TensorFlow for Deep Learning course from Udacity — this particular topic is found in Lesson 8: Time Series Forecasting by Aurélien Géron.
我们的时间序列问题 (Our Time Series Problem)
The below analysis is based on data from Antonio, Almeida and Nunes (2019): Hotel booking demand datasets.
Imagine this scenario. A hotel is having difficulty in forecasting hotel booking cancellations on a day-to-day basis. This is leading to difficulty in forecasting revenues and also in the efficient allocation of hotel rooms.
想象一下这种情况。 旅馆每天都难以预测旅馆预订的取消情况。 这导致难以预测收入以及有效分配酒店房间。
The hotel would like to solve this problem by building a time series model that can forecast the fluctuations in daily hotel cancellations with reasonably high accuracy.
Here is a time series plot of the fluctuations in daily hotel cancellation bookings:
型号配置 (Model Configuration)
The neural network is structured as follows:
Here are the important model parameters that must be accounted for.
内核大小 (Kernel Size)
The kernel size is set to 3, meaning that each output is calculated based on the previous three time steps.
Here is a rough illustration:
Setting the correct kernel size is a matter of experimentation, as a low kernel size risks poor model performance, while a high kernel size risks overfitting.
As can be seen from the diagram, three input time steps are taken and used to generate a separate output.
In this instance, causal padding is used in order to ensure that the output sequence has the same length as the input sequence. In other words, this ensures that the network “pads” time steps from the left side of the series in order to ensure that future values on the right side of the series are not being used in generating the forecast — this will quite obviously lead to false results and we will end up overestimating the accuracy of our model.
The stride length is set to one, which means that the filter slides forward by one time step at a time when forecasting future values.
However, this could be set higher. For instance, setting the stride length to two would mean that the output sequence would be approximately half the length of the input sequence.
A long stride length would mean that the model might potentially discard valuable data in generating the forecast, but increasing the stride length can be useful when it comes to capturing longer-term trends and smoothing out noise in the series.
Here is the model configuration:
model = keras.models.Sequential([
keras.layers.Lambda(lambda x: x * 200)
lr_schedule = keras.callbacks.LearningRateScheduler(
lambda epoch: 1e-8 * 10**(epoch / 20))
optimizer = keras.optimizers.SGD(lr=1e-8, momentum=0.9)
Firstly, let’s make forecasts using the above model on different window sizes.
It is important that the window size is large enough to account for the volatility across time steps.
Suppose we start with a window size of 5.
window_size = 5 (window_size = 5)
The training loss is as follows:
plt.axis([1e-8, 1e-4, 0, 30])
Here is a visual of the forecasts versus actual daily cancellation values:
rnn_forecast = model_forecast(model, series[:, np.newaxis], window_size)
rnn_forecast = rnn_forecast[split_time - window_size:-1, -1, 0]
The mean absolute error is calculated:
>>> keras.metrics.mean_absolute_error(x_valid, rnn_forecast).numpy()
With a mean of 19.89 across the validation set, the model accuracy is reasonable. However, we do see from the diagram above that the model falls short in terms of forecasting more extreme values.
window_size = 30 (window_size = 30)
What if the window size was increased to 30?
The mean absolute error decreases slightly:
>>> keras.metrics.mean_absolute_error(x_valid, rnn_forecast).numpy()
As mentioned, the stride length can be set higher if we wish to smooth out the forecast — with the caveat that such a forecast (the output sequence) will have less data points than that of the input sequence.
没有LSTM层的预测 (Forecasting without LSTM layer)
Unlike an LSTM, a CNN is not recurrent, which means that it does not retain memory of previous time series patterns. Instead, it can only train based on the data that is inputted by the model at a particular time step.
However, by stacking several Conv1D layers together, it is in fact possible for a convolutional neural network to effectively learn long-term dependencies in the time series.
This can be done using a WaveNet architecture. Essentially, this means that the model defines every layer as a 1D convolutional layer with a stride length of 1 and a kernel size of 2. The second convolutional layer uses a dilation rate of 2, which means that every second input timestep in the series is skipped. The third layer uses a dilation rate of 4, the fourth layer uses a dilation rate of 8, and so on.
这可以使用WaveNet体系结构来完成。 本质上，这意味着该模型将每一层定义为步长为1且内核大小为2的一维卷积层。第二个卷积层使用了2的膨胀率，这意味着该系列中的每个第二输入时间步长为跳过。 第三层使用4的膨胀率，第四层使用8的膨胀率，依此类推。
The reason for this is that it allows the lower layers to learn short-term patterns in the time series, while the higher layers learn longer-term patterns.
The WaveNet model is defined as follows:
model = keras.models.Sequential()
for dilation_rate in (1, 2, 4, 8, 16, 32):
optimizer = keras.optimizers.Adam(lr=3e-4)
metrics=["mae"])model_checkpoint = keras.callbacks.ModelCheckpoint(
early_stopping = keras.callbacks.EarlyStopping(patience=50)
history = model.fit(train_set, epochs=500,
A window size of 64 is used in training the model. In this instance, we are using a larger window size than was used with the CNN-LSTM model, in order to ensure that the CNN model picks up longer-term dependencies.
Note that early stopping is used when training the neural network. The purpose of this is to ensure that the neural network halts training at the point where further training would result in overfitting. Determining this manually is quite an arbitrary process, so early stopping can greatly assist with this.
注意提前停止 在训练神经网络时使用。 这样做的目的是确保神经网络在进一步训练会导致过度拟合的点停止训练。 手动确定此过程是一个任意过程，因此尽早停止可对此提供很大帮助。
Let’s now generate forecasts using the standalone CNN model that we just built.
cnn_forecast = model_forecast(model, series[..., np.newaxis], window_size)
cnn_forecast = cnn_forecast[split_time - window_size:-1, -1, 0]
Here is a plot of the forecasted vs. actual data.
The mean absolute error came in slightly higher at 7.49.
Note that for both models, the Huber loss was used as the loss function. This type of loss tends to be more robust to outliers, in that it is quadratic for smaller errors and linear for larger ones.
This type of loss is suitable for this scenario, as we can see that some outliers are present in the data. Using MSE (mean squared error) would overly inflate the forecast error yielded by the model, whereas MAE on its own would likely underestimate the size of the error by not taking such outliers into account. The use of a Huber loss function allows for a happy medium.
这种类型的丢失适用于这种情况，因为我们可以看到数据中存在一些异常值。 使用MSE(均方误差)会过分夸大模型产生的预测误差，而MAE本身可能会由于不考虑这些离群值而低估了误差的大小。 使用Huber损失函数可得出满意的结果。
>>> keras.metrics.mean_absolute_error(x_valid, cnn_forecast).numpy()
Even with a slightly higher MAE, the CNN model has performed quite well in forecasting daily hotel cancellations, without having to be combined with an LSTM layer in order to learn long-term dependencies.
In this example, we have seen:
- The similarities and differences between CNNs and LSTMs in forecasting time series CNN和LSTM在预测时间序列上的异同
- How dilated convolutions assist CNNs in forecasting time series 膨胀卷积如何帮助CNN预测时间序列
- Modification of kernel size, padding and strides in forecasting a time series with CNN 修改CNN预测时间序列中的内核大小，填充和步幅
- Use of a WaveNet architecture to conduct a time series forecast using stand-alone CNN layers 使用WaveNet架构通过独立的CNN层进行时间序列预测
In particular, we saw how a CNN can produce similarly strong results compared to a CNN-LSTM model through the use of dilation.
Many thanks for your time, and any questions, suggestions or feedback are greatly appreciated.
As mentioned, this topic is also covered in the Intro to TensorFlow for Deep Learning course from Udacity course — I highly recommend the chapter on Time Series Forecasting for further detail on this topic.
You can also find the full Jupyter Notebook that I used for running this example on hotel cancellations here.
The original Jupyter Notebook (Copyright 2018, The TensorFlow Authors) can also be found here.
也可以在此处找到原始的Jupyter Notebook(版权所有2018，TensorFlow Authors)。