精华内容
下载资源
问答
  • 时间序列模式识别· 1. Introduction· 2. Exploratory Data Analysis ∘ 2.1 Pattern Changes ∘ 2.2 Correlation Between Features· 3. Anomaly Detection and Pattern Recognition ∘ 3.1 Point Anomaly ...

    时间序列模式识别

    · 1. Introduction· 2. Exploratory Data Analysis2.1 Pattern Changes2.2 Correlation Between Features· 3. Anomaly Detection and Pattern Recognition3.1 Point Anomaly Detection (System Fault)3.2 Collective Anomaly Detection (External Event)3.3 Clustering and Pattern Recognition (External Event)· 4. Conclusion·

    · 1.简介 · 2.探索性数据分析 ∘2.1 模式更改 ∘2.2 特征之间的相关性 · 3.异常检测和模式识别 ∘3.1 点异常检测(系统故障) ∘3.2 集体异常检测(外部事件) ∘3.3 聚类和模式认可(外部事件) · 4.结论 ·

    Note: The detailed project report and the datasets used in this post can be found in my GitHub Page.

    注意 :本文中使用的详细项目报告和数据集可以在我的GitHub Page中找到。

    1.简介 (1. Introduction)

    This project was assigned to me by a client. There is no non-disclosure agreement required and the project does not contain any sensitive information. So, I decide to make this project public as part of my personal data science portfolio while anonymizing the client’s information.

    该项目是由客户分配给我的。 不需要保密协议,该项目不包含任何敏感信息。 因此,我决定将该项目公开,作为我的个人数据科学投资组合的一部分,同时匿名化客户的信息。

    In the project, there are two data sets, each consists of one week of sensor readings are provided to accomplish the following four tasks:

    在该项目中,有两个数据集,每个数据集包含一个星期的传感器读数,以完成以下四个任务:

    1. Find anomalies in the data set to automatically flag events

    1.在数据集中查找异常以自动标记事件

    2. Categorize anomalies as “System fault” or “external event”

    2.将异常分类为“系统故障”或“外部事件”

    3. Provide any other useful conclusions from the pattern in the data set

    3.根据数据集中的模式提供其他有用的结论

    4. Visualize inter-dependencies of the features in the dataset

    4.可视化数据集中要素的相互依赖性

    In this report I am going to briefly walk through the steps I use for data analysis, visualization of feature correlation, machine learning techniques to automatically flag “system faults” and “external events” and my findings from the data.

    在本报告中,我将简要介绍我用于数据分析,特征关联可视化,机器学习技术以自动标记“系统故障”和“外部事件”以及我从数据中发现的步骤。

    2.探索性数据分析 (2. Exploratory Data Analysis)

    My code and results in this section can be found here.

    我在本节中的代码和结果可以在这里找到。

    The dataset comes with two CSV files, both of which can be accessed from my GitHub Page. I first import and concatenate them into one Pandas dataframe in Python. Some rearrangements are made to remove columns except the 11 features that we are interested in:

    该数据集带有两个CSV文件,都可以从我的GitHub Page中访问它们。 我首先将它们导入并用Python连接到一个Pandas数据框中。 除我们感兴趣的11个功能外,还进行了一些重新排列以删除列:

    • Ozone

      臭氧
    • Hydrogen Sulfide

      硫化氢
    • Total VOCs

      总VOC
    • Carbon Dioxide

      二氧化碳
    • PM 1

      1号纸
    • PM 2.5

      下午2.5
    • PM 10

      下午10点
    • Temperature (Internal & External)

      温度(内部和外部)
    • Humidity (Internal & External).

      湿度(内部和外部)。

    The timestamps span from May 26 to June 9, 2020 (14 whole days in total) in EDT (GMT-4) time zone. By subtraction, different intervals are found between each reading, ranging from 7 seconds to 3552 seconds. The top 5 frequent time intervals are listed below in Table 1, where most of them are close to 59 and 60 seconds, so it can be concluded that the sensor reads every minute. However, the inconsistency of reading intervals might be worth looking into if it is no deliberate interference involved since it might cause trouble in future time series analysis.

    时间戳跨越EDT(GMT-4)时区的2020年5月26日至6月9日(共14天)。 通过减法,可以在每个读数之间找到不同的间隔,范围从7秒到3552秒。 下面的表1中列出了前5个最频繁的时间间隔,其中大多数时间间隔接近59秒和60秒,因此可以得出结论,传感器每分钟都会读取一次。 但是,如果不涉及故意的干扰,则可能需要研究读取间隔的不一致,因为这可能会在以后的时间序列分析中造成麻烦。

    Image for post
    Table 1: Top 5 Time Intervals of the Sensor Measurements
    表1:传感器测量的前5个时间间隔

    For each of the features, the time series data are on different scales, so they are normalized in order for better visualization and machine learning efficiencies. Then they are plotted and visually inspected to discover any interesting patterns.

    对于每个功能,时间序列数据的比例不同,因此对其进行了归一化,以实现更好的可视化和机器学习效率。 然后将它们绘制出来并进行视觉检查,以发现任何有趣的图案。

    2.1模式变更 (2.1 Pattern Changes)

    Some of the features seem to share similar pattern changes at specific time points. Three of the most significant ones (Temperature External, Humidity External, and Ozone) are shown below in Figure 1. It can be clearly seen that the areas highlighted with pink tend to have flat signals while the unhighlighted areas are sinusoidal.

    一些功能似乎在特定时间点共享相似的模式更改。 图1中显示了三个最重要的区域(外部温度,外部湿度和臭氧)。可以清楚地看到,用粉红色突出显示的区域倾向于发出平坦的信号,而未突出显示的区域则是正弦的。

    Image for post
    Figure 1: Readings of Temperature (External), Humidity (External), and Ozone. They experience similar pattern changes at the same time points.
    图1:温度(外部),湿度(外部)和臭氧的读数。 他们在同一时间点经历相似的模式变化。

    According to common sense, the outdoor temperature reaches its high point at noon and goes down at night, I start to wonder the possibility that different test environments were involved during this 14-day period. To test the idea, Toronto weather data is queried from Canada Weather Stats [1]. The temperature and relative humidity are overlaid and compared with the external temperature and humidity in this dataset. The plot is shown in Figure 2. It can be seen that the actual temperature and humidity fluctuate in a sinusoidal fashion. Most parts of the temperature and humidity readings correlate well with the weather data, while the areas highlighted in pink remains relatively invariant. I am not provided with any relevant information on the environments that the measurements were taken, but from the plot, it can be reasonably inferred that the device has been relocated between indoor and outdoor environments during the 14-day period. This is also tested later in the automatic anomaly detection in Section 3.3.

    根据常识,室外温度会在中午达到最高点,并在晚上下降,我开始怀疑在这14天的时间内是否涉及不同的测试环境。 为了验证这个想法,可以从加拿大天气统计中查询多伦多的天气数据[1]。 覆盖温度和相对湿度,并与该数据集中的外部温度和湿度进行比较。 该图如图2所示。可以看出,实际温度和湿度以正弦形式波动。 温度和湿度读数的大部分与天气数据具有良好的相关性,而用粉红色突出显示的区域则相对不变。 在进行测量的环境中,我没有得到任何相关信息,但是从绘图中可以合理地推断出该设备在14天的时间内已在室内和室外环境之间重新放置。 稍后还将在3.3节中的自动异常检测中对此进行测试。

    Image for post
    Figure 2: Temperature (External) and Humidity (External) overlaid with Toronto weather data. The pink highlighted areas remain relatively invariant compared to others.
    图2:用多伦多天气数据覆盖的温度(外部)和湿度(外部)。 与其他区域相比,粉红色突出显示的区域保持相对不变。

    2.2特征之间的关联 (2.2 Correlation Between Features)

    Correlation is a technique for investigating the relationship between two quantitative, continuous variables in order to represent their inter-dependencies. Among different correlation techniques, Pearson’s correlation is the most common one, which measures the strength of association between the two variables. Its correlation coefficient scales from -1 to 1, where 1 represents the strongest positive correlation, -1 represents the strongest negative correlation and 0 represents no correlation. The correlation coefficients between each pair of the dataset are calculated and plotted as a heatmap, shown in Table 2. The scatter matrix of selected features is also plotted and attached in the Appendix section.

    关联是一种技术,用于研究两个定​​量的连续变量之间的关系,以表示它们之间的相互依赖性。 在不同的相关技术中,Pearson相关是最常见的一种,它测量两个变量之间的关联强度。 其相关系数从-1到1,其中1表示最强的正相关,-1表示最强的负相关,0表示无相关。 计算每对数据集之间的相关系数,并将其绘制为热图,如表2所示。选定要素的散布矩阵也被绘制并附在附录部分。

    The first thing to be noticed is that PM 1, PM 2.5, and PM 10 are highly correlated with each other, which means they always fluctuate in the same fashion. Ozone is negatively correlated with Carbon Dioxide and positively correlates with Temperature (Internal) and Temperature (External). On the other hand, it is surprising not to find any significant correlation between Temperature (Internal) and Temperature (External), possibly due to the superior thermal insulation of the instrument. However, since there is no relevant knowledge provided, no conclusions can be made on the reasonability of this finding. Except for Ozone, Temperature (Internal) is also negatively correlated with Carbon Dioxide, Hydrogen Sulfide, and the three particulate matter measures. On the contrary, Temperature (External) positively correlates with Humidity (Internal) and three particulate matter measures, while negatively correlates with Humidity (External), just as what can be found from the time series plots in Figure 1.

    首先要注意的是,PM 1,PM 2.5和PM 10彼此高度相关,这意味着它们始终以相同的方式波动。 臭氧与二氧化碳呈负相关,与温度(内部)和温度(外部)呈正相关。 另一方面,令人惊讶的是,没有发现温度(内部)和温度(外部)之间的任何显着相关性,这可能是由于仪器具有出色的隔热性。 但是,由于没有提供相关的知识,因此无法就此发现的合理性得出任何结论。 除臭氧外,温度(内部)还与二氧化碳,硫化氢和三种颗粒物度量值呈负相关。 相反,温度(外部)与湿度(内部)和三个颗粒物度量呈正相关,而与湿度(外部)呈负相关,正如图1的时间序列图所示。

    Image for post
    Table 2: Heatmap of Pearson’s Correlation Coefficients between Features.
    表2:特征之间的Pearson相关系数热图。

    3.异常检测和模式识别 (3. Anomaly Detection and Pattern Recognition)

    In this section, various anomaly detection methods are examined based on the dataset. The data come without labels, so there is no knowledge or classification rule is provided to distinguish between “system faults”, “external events” and others. Any details of the instrument and experiments are not provided either. Therefore, my results in this section might deviate from expectations, but I am trying my best to make assumptions, define problems, and then accomplish them based on my personal experience. The section consists of three parts: Point Anomaly Detection, Collective Anomaly Detection, and Clustering.

    在本节中,将基于数据集检查各种异常检测方法。 数据没有标签,因此没有提供知识或分类规则来区分“系统故障”,“外部事件”和其他。 也没有提供仪器和实验的任何细节。 因此,我在本节中的结果可能会偏离预期,但我会尽力做出假设,定义问题,然后根据我的个人经验完成这些假设。 本节包括三个部分:点异常检测,集体异常检测和聚类。

    3.1点异常检测(系统故障) (3.1 Point Anomaly Detection (System Fault))

    Point anomalies, or global outliers, are those data points that are entirely outside the scope of the usual signals without any support of close neighbors. It is usually caused by human or system error and needs to be removed during data cleaning for better performance in predictive modeling. In this dataset, by assuming the “system faults” are equivalent to such point anomalies, there are several features that are worth examining, such as the examples shown below in Figure 3.

    点异常或全局离群值是完全不在常规信号范围内而没有任何近邻支持的数据点。 它通常是由人为或系统错误引起的,需要在数据清理过程中将其删除以在预测建模中获得更好的性能。 在此数据集中,通过假设“系统故障”等效于此类点异常,有几个特性值得研究,例如下面的图3中所示的示例。

    Image for post
    Figure 3: Time series signals before (left) and after (right) automatic flagging. The point anomalies (outliers) are marked as red. From top to bottom: Humidity (Internal), Total VOCs, and Carbon Dioxide.
    图3:自动标记之前(左)和之后(右)的时间序列信号。 点异常(异常值)标记为红色。 从上至下:湿度(内部),总VOC和二氧化碳。

    Here, from Humidity (Internal) to Total VOC to Carbon Dioxide, each represents a distinct complexity of point anomaly detection tasks. In the first one, three outliers sit on the level of 0, so a simple boolean filter can do its job of flagging these data points. In the second one, the outliers deviate significantly from the signal that we are interested in, so linear thresholds can be used to separate out the outliers. Both cases are easy to implement because they can be done by purely experience-based methods. When it comes to the third case, it is not possible to use a linear threshold to separate out the outliers since even if they deviate from its neighbors, the values may not be as great as the usual signals at other time points.

    在这里,从湿度(内部)到总VOC到二氧化碳,每个都代表了点异常检测任务的独特复杂性。 在第一个中,三个离群值位于0级别上,因此一个简单的布尔过滤器可以完成标记这些数据点的工作。 在第二个中,离群值明显偏离了我们感兴趣的信号,因此可以使用线性阈值来分离离群值。 两种情况都易于实现,因为它们可以通过纯粹基于经验的方法来完成。 在第三种情况下,不可能使用线性阈值来分离离群值,因为即使离群值偏离其邻域,其值也可能不如其他时间点的正常信号大。

    For such cases, there are many ways to approach. One of the simple ways is to calculate the rolling mean or median at each time point and test if the actual value is within the prediction interval that is calculated by adding and subtracting a certain fluctuation range from the central line. Since we are dealing with outliers in our case, the rolling median is more robust, so it is used in this dataset.

    对于这种情况,有很多方法可以解决。 一种简单的方法是计算每个时间点的滚动平均值或中位数,并测试实际值是否在预测区间内,该预测区间是通过从中心线增加或减去某个波动范围而得出的。 由于在这种情况下我们正在处理离群值,因此滚动中位数更为稳健,因此在此数据集中使用了滚动中位数。

    Image for post
    Figure 4: Zoomed-in time series plot of Carbon Dioxide, with its rolling median and prediction interval
    图4:二氧化碳的放大时间序列图及其滚动中位数和预测间隔

    From Figure 4, it is even clearer to demonstrate this approach: Equal-distance prediction intervals are set based on the rolling median. Here, the rolling window is set to 5, which means that for each data point, we take four of its closest neighbors and calculate the median as the center of prediction. Then a prediction interval of ±0.17 is padded around the center. Any points outside are considered as outliers.

    从图4中可以更清楚地证明这种方法:等距预测间隔基于滚动中值设置。 在这里,滚动窗口设置为5,这意味着对于每个数据点,我们取其四个最近的邻居并计算中位数作为预测中心。 然后,在中心周围填充±0.17的预测间隔。 外部的任何点均视为异常值。

    It is straightforward and efficient to detect the point anomalies using this approach. However, it has deficiencies and may not be reliable enough to deal with more complex data. In this model, there are two parameters: rolling window size and prediction interval size, both of which are defined manually through experimentations with the given data. We are essentially solving the problem encountered when going from case 2 to case 3 as seen in Figure 3 by enabling the self-adjusting capability to the classification boundary between signal and outliers according to time. However, the bandwidth is fixed, so it becomes less useful in cases when the definition of point anomalies changes with time. For example, the definition of a super cheap flight ticket might be totally different between a normal weekday and the holiday season.

    使用这种方法来检测点异常非常简单有效。 但是,它有缺陷并且可能不够可靠,无法处理更复杂的数据。 在此模型中,有两个参数:滚动窗口大小和预测间隔大小,这两个参数都是通过对给定数据进行实验来手动定义的。 如图3所示,我们基本上是通过从情况2转到情况3来解决问题,方法是根据时间启用信号和离群值之间的分类边界的自调整功能。 但是,带宽是固定的,因此在点异常的定义随时间变化的情况下,带宽变得不再有用。 例如,在正常工作日和假日季节之间,超廉价机票的定义可能完全不同。

    That is when machine learning comes into play, where the model can learn from the data to adjust the parameters by itself when the time changes. It will know when a data point can be classified as point anomaly or not at a specific time point. However, I cannot construct such a supervised machine learning model on this task for this dataset since the premise is to have labeled data, and ours is not. It would be still be suggested to go along this path in the future because with such a model, the most accurate results will be generated no matter how complex the system or the data are.

    那就是机器学习开始发挥作用的时候,模型可以从数据中学习,以在时间变化时自行调整参数。 它将知道何时可以在特定时间点将数据点归类为点异常或不将其分类为点异常。 但是,我不能为此数据集在此任务上构造这样的监督机器学习模型,因为前提是要有标记数据,而我们没有。 仍建议在将来沿这条路走,因为使用这种模型,无论系统或数据多么复杂,都将产生最准确的结果。

    Even though supervised learning methods would not work in this task, there are unsupervised learning techniques that can be useful such as clustering, also discussed in subsection 3.3. Clustering can group unlabeled data by using similarity measures, such as the distance between data points in the vector space. Then the point anomalies can be distinguished by selecting those far away from cluster centers. However, in order to use clustering for point anomaly detection in this dataset, we have to follow the assumption that external events are defined with the scope of multiple features instead of treating every single time series separately. More details on this are discussed in the following two subsections 3.2 and 3.3.

    即使在这种任务下不能使用监督学习方法,也有一些有用的无监督学习技术,例如聚类,也在3.3小节中进行了讨论。 聚类可以使用相似性度量(例如向量空间中数据点之间的距离)对未标记的数据进行分组。 然后可以通过选择远离聚类中心的点来区分点异常。 但是,为了将聚类用于此数据集中的点异常检测,我们必须遵循这样的假设:外部事件是在多个特征的范围内定义的,而不是分别处理每个时间序列。 以下两个小节3.2和3.3讨论了有关此问题的更多详细信息。

    3.2集体异常检测(外部事件) (3.2 Collective Anomaly Detection (External Event))

    If we define the “system fault” as point anomaly, then there are two directions to go with the “external event”. One of them is to define it as a collective anomaly that appears in every single time-series signals. The idea of collective anomaly is on the contrary to point anomaly. Point anomalies are discontinued values that are greatly deviated from usual signals while collective anomalies are usually continuous, but the values are out of expectations, such as significant increase or decrease at some time points. In this subsection, all 11 features are treated separately as a single time series. Then the task is to find the abrupt changes that happen in each of them.

    如果我们将“系统故障”定义为点异常,则“外部事件”有两个方向。 其中之一是将其定义为出现在每个时间序列信号中的集体异常。 集体异常的概念与点异常相反。 点异常是不连续的值,与通常的信号有很大的偏离,而集体异常通常是连续的,但是这些值超出了预期,例如在某些时间点显着增加或减少。 在本小节中,所有11个功能部件均被视为一个时间序列。 然后的任务是找到每个变化都发生的突变。

    For such a problem, one typical approach is to tweak the usage of time series forecasting: we can fit a model to a certain time period before and predict the value after. The actual value is then compared to see if it falls into the prediction interval. It is very similar to the rolling median method used in the previous subsection, but only the previous time points are used here instead of using neighbors from both directions.

    对于这样的问题,一种典型的方法是调整时间序列预测的用法:我们可以将模型拟合到之前的某个时间段,然后预测之后的值。 然后将实际值进行比较,以查看其是否落在预测间隔内。 它与上一小节中使用的滚动中值方法非常相似,但是此处仅使用前一个时间点,而不是使用双向的邻居。

    There are different options for the model as well. Traditional time series forecast models like SARIMA is a good candidate, but the model may not be complex enough to accommodate the “patterns” that I mentioned in Section 2.1 and Section 3.3. Another option is to train a supervised regression model for the time series, which is quite widely used nowadays.

    该模型也有不同的选择。 传统的时间序列预测模型(如SARIMA)是不错的选择,但该模型可能不够复杂,无法适应我在2.1节和3.3节中提到的“模式”。 另一种选择是为时间序列训练监督回归模型,该模型在当今已被广泛使用。

    The idea is simple: Features are extracted from the time series using the concept of the sliding window, as seen in Table 3 and Figure 5. The sliding window size (blue) is set to be the same as the desired feature number k. Then for each data point (orange) in the time series, the features are data point values from its lag 1 to lag k before. As a result, a time series with N samples is able to be transformed into a table of N-k observations and k features. Next, by implementing the concept of “forward chaining”, each point is predicted by the regression model trained using observations from indices of 0 to k-1. In addition to the main regression model, two more quantile regressors are trained with different significance levels to predict the upper and lower bounds of the prediction interval, with which we are able to tell the actual value is above or below the interval band.

    这个想法很简单:使用滑动窗口的概念从时间序列中提取特征,如表3和图5所示。滑动窗口的大小(蓝色)设置为与所需特征编号k相同。 然后,对于时间序列中的每个数据点(橙色),特征都是从滞后1到滞后k的数据点值。 结果,具有N个样本的时间序列可以转换为Nk个观测值和k个特征的表。 接下来,通过实施“正向链接”的概念,通过使用从0到k-1的索引的观察值训练的回归模型来预测每个点。 除了主要的回归模型外,还训练了另外两个具有不同显着性水平的分位数回归器,以预测预测区间的上限和下限,由此我们可以判断出实际值是在区间带之上还是之下。

    Image for post
    Table 3: Feature extraction algorithm through the sliding window
    表3:通过滑动窗口的特征提取算法
    Image for post
    Figure 5: Concepts of the sliding window and forward chaining [2]. The features and targets are extracted from the time series using a sliding window. The training process is based on “forward chaining”.
    图5:滑动窗口和前向链接的概念[2]。 使用滑动窗口从时间序列中提取特征和目标。 培训过程基于“前向链接”。

    This method is applied to the given dataset, and an example is shown below as a result in Figure 6. The Ozone time series is hourly sampled for faster training speed and the features are extracted and fed into three Gradient Boosting Regressor models (1 main and 2 quantile regressors) using Scikit-Learn. The significance levels are chosen so that the prediction level represents a 90% confidence interval (shown as green in Figure 6 Top). The actual values are then compared with the prediction interval and flagged with red (unexpected increase) and blue (unexpected decrease) in Figure 6 Bottom.

    此方法适用于给定的数据集,结果如图6所示。下面是示例。每小时对臭氧时间序列进行采样,以提高训练速度,并提取特征并将其输入到三个Gradient Boosting Regressor模型(1个主要模型和2个分位数回归)使用Scikit-Learn。 选择显着性水平,以便预测水平代表90%的置信区间(在图6顶部显示为绿色)。 然后,将实际值与预测间隔进行比较,并在图6底部用红色(意外增加)和蓝色(意外减少)标记。

    Image for post
    Figure 6: Flagging results with the Ozone time series data (hourly sampled) by applying the quantile regressor method. Top: data and prediction interval; Bottom: data with flags showing above or below the prediction interval.
    图6:通过应用分位数回归方法将臭氧时间序列数据(每小时采样)标记结果。 顶部:数据和预测间隔; 底部:带有标志的数据显示在预测间隔之上或之下。

    The results may not be super impressive yet, because more work still needs to be done around regression model selections and hyperparameter fine-tunings. However, it is already showing its capability of fagging these abrupt increases and decreases in all these spikes. One great thing about using machine learning models is that the model learns and evolves by itself when feeding with data. From Figure 6 Bottom, it can be seen that the last three hills (after about 240 hours) have fewer flagged points than the previous ones. It is not only because the magnitudes are smaller, but also due to the fact that the model is learning from the previous experience and start to adapt to the “idea” that it is now in “mountains” and periodic fluctuations should be expected. Therefore, it is not hard to conclude that the model performance can get better and better if more data instances are fed.

    结果可能不会令人印象深刻,因为围绕回归模型选择和超参数微调仍需要做更多的工作。 但是,它已经显示出阻止所有这些尖峰中这些突然增加和减少的能力。 使用机器学习模型的一大好处是,模型在输入数据时会自行学习和演化。 从图6的底部可以看出,最后三个山丘(约240小时后)的标记点比以前的要少。 这不仅是因为幅度较小,而且还因为该模型是从以前的经验中学到并开始适应“思想”的事实,因此它现在处于“山脉”中,应该预期会出现周期性波动。 因此,不难得出这样的结论:如果馈送了更多的数据实例,则模型性能会越来越好。

    Beyond this quantile regression model, deep learning models such as LSTM might be able to achieve better performance. Long Short Term Memory (LSTM) is a specialized artificial Recurrent Neural Network (RNN) that is one of the state-of-the-art choices of sequence modeling due to its special design of feedback connections. However, it takes much longer time and effort to set up and fine-tune the network architecture, which exceeds the time allowance of this project, so it is not included as the presentable content in this report.

    除了这种分位数回归模型之外,诸如LSTM之类的深度学习模型也许还能获得更好的性能。 长短期记忆(LSTM)是一种专门的人工循环神经网络(RNN),由于其特殊的反馈连接设计,它是序列建模的最新选择之一。 但是,建立和微调网络体系结构需要花费更多的时间和精力,这超出了该项目的时间限制,因此在本报告中未将其包括在内。

    Again, on the other hand, as I mentioned in previous sections, the provided data does not come with labels showing which data points are considered as anomalies. It does cause difficulties in the collective anomaly detection task discussed in this subsection by limiting the model choices and performance. In the future, if some labels are provided, it will become a semi-supervised or supervised learning problem, so that it will become easier to achieve better results.

    同样,另一方面,正如我在前面的部分中提到的那样,所提供的数据没有附带标签,这些标签显示哪些数据点被视为异常。 通过限制模型的选择和性能,确实在本小节讨论的集体异常检测任务中造成了困难。 将来,如果提供一些标签,它将成为半监督或有监督的学习问题,以便变得更容易获得更好的结果。

    3.3聚类和模式识别(外部事件) (3.3 Clustering and Pattern Recognition (External Event))

    As I mentioned in subsection 3.2 above, recognizing the “external events” can be approached in two directions: one is to treat every time series separately and monitor any unexpected changes happened to the sensor signals, and the other one is to assume the events affect multiple features at the same time so that we hope to distinguish the events by looking at the unique characteristics shown in different features. In this case, if we have labeled data, it would be a common classification problem, but even without labels, we are still able to approach by clustering.

    正如我在上面的3.2小节中提到的,认识“外部事件”可以从两个方向进行:一个是分别处理每个时间序列并监视传感器信号发生的任何意外变化,另一个是假设事件影响因此,我们希望通过查看不同功能中显示的独特特征来区分事件。 在这种情况下,如果我们为数据加标签,这将是一个常见的分类问题,但是即使没有标签,我们仍然能够通过聚类进行处理。

    Clustering is an unsupervised machine learning technique that finds similarities between data according to the characteristics and groups similar data objects into clusters. It can be used as a stand-alone tool to get insights into data distribution and can also be used as a preprocessing step for other algorithms. There are many distinct methods of clustering, and here I am using two of the most commonly used ones: K-Means and DBSCAN.

    聚类是一种无监督的机器学习技术,可根据特征查找数据之间的相似性,并将相似的数据对象分组。 它可以用作独立工具来深入了解数据分布,也可以用作其他算法的预处理步骤。 有许多不同的集群方法,这里我使用两种最常用的方法:K-Means和DBSCAN。

    K-means is one of the partitioning methods used for clustering. It randomly partitions objects into nonempty subsets and constantly adding new objects and adjust the centroids until a local minimum is met when optimizing the sum of squared distance between each object and centroid. On the other hand, Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a density-based method where a cluster is defined as the maximal set of density-connected points [3].

    K均值是用于聚类的一种分区方法。 它将对象随机划分为非空子集,并不断添加新对象并调整质心,直到优化每个对象与质心之间的平方距离之和达到局部最小值为止。 另一方面,带噪声的基于密度的应用程序空间聚类(DBSCAN)是一种基于密度的方法,其中,将聚类定义为密度连接点的最大集合[3]。

    Image for post
    Figure 7: Clustering results projected to the first 2 principal components. Left: K-Means; Right: DBSCAN
    图7:聚类结果预测到前两个主要组成部分。 左:K-均值; 右:DBSCAN
    Image for post
    Figure 8: Clustering results projected to the first 3 principal components. Left: K-Means; Right: DBSCAN
    图8:聚类结果预测到前三个主要组成部分。 左:K-均值; 右:DBSCAN
    Image for post
    Figure 9: Clustering results on time series of Temperature (External). Left: K-Means; Right: DBSCAN
    图9:温度(外部)时间序列的聚类结果。 左:K-均值; 右:DBSCAN

    Principal Components Analysis (PCA) is a dimension reduction technique that creates new uncorrelated variables in order to increase interpretability and minimize information loss. In this project, after applying K-means and DBSCAN algorithms on the normalized data, PCA is performed and the clustering results are plotted in both 2D (Figure 7) and 3D (Figure 8) using the first 2 and 3 principal components. In addition, to view the clustering results from another perspective, the labeled time series plots are made, and the Temperature (External) plot is shown in Figure 9 as an example.

    主成分分析(PCA)是一种降维技术,可创建新的不相关变量以提高可解释性并最大程度地减少信息丢失。 在这个项目中,在对标准化数据应用K-means和DBSCAN算法之后,执行PCA,并使用前2个和3个主成分在2D(图7)和3D(图8)中绘制聚类结果。 此外,要从另一个角度查看聚类结果,请绘制标记的时间序列图,并以温度(外部)图为例,如图9所示。

    From the plots, it can be clearly seen that both methods are able to distinguish between the indoor/outdoor pattern changes that I mentioned in section 2.1. The main difference is that the partition-based K-means method is more sensitive to the magnitude changes caused by day/night alternation. Many variables in the dataset are subject to such obvious sinusoidal changes that happen at the same time, including Temperature (External and Internal), Humidity (External and internal) and Ozone. K-means tend to treat peaks and valleys differently. On the other hand, density-based DBSCAN cares less about the magnitude difference but pays more attention to the density distributions. Therefore, it clusters the whole sinusoidal part as one mass cloud as seen in Figure 7 and Figure 8.

    从图中可以清楚地看到,两种方法都能够区分我在2.1节中提到的室内/室外模式变化。 主要区别在于基于分区的K均值方法对昼夜交替引起的幅度变化更敏感。 数据集中的许多变量会同时发生明显的正弦变化,包括温度(外部和内部),湿度(外部和内部)和臭氧。 K均值倾向于以不同的方式对待峰谷。 另一方面,基于密度的DBSCAN不太关心幅度差异,而更多地关注密度分布。 因此,它将整个正弦曲线部分聚集为一个质量云,如图7和图8所示。

    It is not possible to comment on which clustering method is better than the other at this stage because they are distinctive enough to function for different interests. If we are more interested in treating the high-low portions of the sinusoidal signals differently, we are going to use K-means; if we only want to distinguish between indoor/outdoor mode, then DBSCAN is better. In addition, since it is an unsupervised learning task, there is no way of quantifying the performance between models except for visualizing and judging by experience. In the future, if some labeled data is provided, the results can be turned into a semi-supervised learning task, and more intuitions can be gained toward model selection.

    在此阶段,无法评论哪种聚类方法比另一种更好,因为它们具有足够的独特性,可以针对不同的利益发挥作用。 如果我们对以不同的方式处理正弦信号的高-低部分更感兴趣,我们将使用K-means。 如果我们只想区分室内/室外模式,则DBSCAN更好。 此外,由于这是一项无监督的学习任务,因此除了根据经验进行可视化和判断外,无法量化模型之间的性能。 将来,如果提供一些标记的数据,则可以将结果转换为半监督学习任务,并且可以获得更多的直觉来进行模型选择。

    4。结论 (4. Conclusion)

    In this post, I briefly walk through the approaches and findings of exploratory data analysis and correlation analysis, as well as the constructions of three distinct modeling pipelines that used for point anomaly detection, collective anomaly detection, and clustering.

    在本文中,我简要介绍了探索性数据分析和相关性分析的方法和发现,以及用于点异常检测,集体异常检测和聚类的三个不同的建模管道的构建。

    In the exploratory data analysis section, the sensor reading time intervals are found to vary severely. Even if most of them are contained around a minute, the inconsistency problem is still worth looking into due to the fact that it might lower the efficiency and performance of analytical tasks. The measuring environment is found subject to changes from the time series plots and later reassured by aligning with the actual Toronto weather data as well as the clustering results. In addition, the correlation between features is studied and exemplified. A few confusions are raised such as the strange relationship between Temperature (Internal) and Temperature (External), which needs to be studied through experiments or the device itself.

    在探索性数据分析部分中,发现传感器读取时间间隔变化很大。 即使其中大多数都在大约一分钟之内,但由于不一致的问题可能会降低分析任务的效率和性能,因此仍然值得研究。 发现测量环境可能会随着时间序列图的变化而变化,随后通过与实际的多伦多天气数据以及聚类结果保持一致来放心。 另外,研究并举例说明了特征之间的相关性。 引起了一些混乱,例如温度(内部)和温度(外部)之间的奇怪关系,需要通过实验或设备本身进行研究。

    In the anomaly detection section, since “system fault” and “external event” are not clearly defined, I split the project into three different tasks. Point anomalies are defined as severely deviated and discontinued data points. The rolling median method is used here to successfully automate the process of labeling such point anomalies. Collective anomalies, on the other hand, are defined as the deviated collection of data points, usually seen as abrupt increases or decreases. This task is accomplished by extracting features from time series data and then training of regression models. Clustering is also performed on the dataset using K-mean and DBSCAN, both of which play to their strength and successfully clustered data by leveraging their similar and dissimilar characteristics.

    在异常检测部分,由于未明确定义“系统故障”和“外部事件”,因此我将项目分为三个不同的任务。 点异常定义为严重偏离和中断的数据点。 这里使用滚动中值法来成功地自动标记此类点异常的过程。 另一方面,集体异常定义为偏离的数据点集合,通常被视为突然增加或减少。 通过从时间序列数据中提取特征,然后训练回归模型来完成此任务。 还使用K均值和DBSCAN对数据集执行聚类,两者均发挥了自己的优势,并通过利用它们的相似和不同特性成功地对数据进行了聚类。

    All of the anomaly detection models introduced in this project are only prototypes without extensive model sections and fine-tunings. There are great potentials for each of them to evolve into better forms if putting more effort and through gaining more knowledge of the data. For point anomalies, there are many more machine-learning-based outlier detection techniques such as isolation forest and local outlier factors to accommodate for more complex data forms. For collective anomaly, state-of-the-art LSTM is worth putting effort into, especially in time series data and sequence modeling. For clustering, there are many other families of methods, such as hierarchical and grid-based clustering. They are capable of achieving similar great performance.

    本项目中介绍的所有异常检测模型只是原型,没有广泛的模型部分和微调。 如果付出更多的努力并获得更多的数据知识,他们每个人都有很大的潜力发展成更好的形式。 对于点异常,还有更多基于机器学习的离群值检测技术,例如隔离林和局部离群值因素,可以适应更复杂的数据形式。 对于集体异常,值得投入最新的LSTM,特别是在时间序列数据和序列建模方面。 对于群集,还有许多其他方法系列,例如分层和基于网格的群集。 它们能够实现类似的出色性能。

    Of course, these future directions are advised based on the premise of no labeled data. If experienced engineers or scientists are able to give their insights on which types of data are considered as “system fault” or “external event”, more exciting progress will surely be made by transforming the tasks into semi-supervised or supervised learning problems, where more tools will be available to choose.

    当然,这些未来的方向是在没有标签数据的前提下提出的。 如果经验丰富的工程师或科学家能够就哪些数据类型被视为“系统故障”或“外部事件”给出自己的见解,那么将任务转变为半监督或监督学习问题肯定会取得更加令人兴奋的进展。有更多工具可供选择。

    翻译自: https://towardsdatascience.com/time-series-pattern-recognition-with-air-quality-sensor-data-4b94710bb290

    时间序列模式识别

    展开全文
  • 深度学习进行时间序列模式识别

    万次阅读 2016-03-13 19:56:55
    思路:将时间序列转化为图像类似的格式,然后进行CNN识别。 http://danielnouri.org/notes/2014/01/10/using-deep-learning-to-listen-for-whales/ Using deep learning to listen for whales January 10, ...


    思路:将时间序列转化为图像类似的格式,然后进行CNN识别。


    深度学习在流量识别中的应用

    http://wenku.baidu.com/link?url=5MHAlbaAFX8g9uCnWyR6RGEsWdka7suaR3bFLanZ-ZOlsPgmEvErHApDslC1HI73zrmTLAyEQQGMzCVU799gyS4opj-pjeXyP4KYsuQRZnC


    http://danielnouri.org/notes/2014/01/10/using-deep-learning-to-listen-for-whales/

    Using deep learning to listen for whales

    January 10, 2014 | categories: PythonBiologyProgrammingBioacousticsMachine Learning | 31 Comments

    Since recent breakthroughs in the field of speech recognition and computer vision, neural networks have gotten a lot of attention again. Particularly impressive were Krizhevsky et al.'s seminal results at the ILSVRC 2012 workshop, which showed that neural nets are able to outperform conventional image recognition systems by a large margin; results that shook up the entire field. [1]

    Krizhevsky's winning model is a convolutional neural network (convnet), which is a type of neural net that exploits spatial correlations in 2-d input. Convnets can have hundreds of thousands of neurons (activation units) and millions of connections between them, many more than could be learned effectively previously. This is possible because convnets share weights between connections, and thus vastly reduce the number of parameters that need to be learned; they essentially learn a number of layers of convolution matrices that they apply to their input in order to find high-level, discriminative features.

    http://danielnouri.org/media/deep-learning-whales-krizhevsky-lsvrc-2012-predictions.jpg

    Figure 1: Example predictions of ILSVRC 2012 winner; eight images with their true label and the net's top five predictions below. (source)

    Many papers have since followed up on Krizhevsky's work and some were able to improve upon the original results. But while most attention went into the problem of using convnets to do image recognition, in this article I will describe how I was able to successfully apply convnets to a rather different domain, namely that of underwater bioacoustics, where sounds of different animal species are detected and classified.

    My work on this topic began with last year's Kaggle Whale Detection Challenge, which asked competitors to classify two-second audio recordings, some of which had a certain call of a specific whale on them, and others didn't. The whale in question was the North Atlantic Right Whale (NARW), which is a whale species that's sadly nearly extinct, with less than 400 individuals estimated to still exist. Believing that this could be a very interesting and meaningful way to test my freshly acquired knowledge around convolutional neural networks, I entered the challenge early, and was able to reach a pretty remarkable Area Under Curve (AUC) score of roughly 97% after only two days into the competition. [2]

    http://danielnouri.org/media/deep-learning-whales-leaderboard.png

    Figure 2: The Kaggle leaderboard after two days into the competition.

    The trick to my early success was that I framed the problem of finding the whale sound patterns as an image reconition problem, by turning the two-second sound clips each into spectrograms. Spectrograms are essentially 2-d arrays with amplitude as a function of time and frequency. This allowed me to use standard convnet architectures quite similar to those Krizhevsky had used when working with the CIFAR-10 image dataset. With one of few differences in architecture stemming from the fact that CIFAR-10 uses RGB images as input, while my spectrograms have one real number value per pixel,not unlike gray-scale images.

    http://danielnouri.org/media/deep-learning-whales-spectrogram.jpg

    Figure 3: Spectrogram containing a right whale up-call.

    Spurred by my success, I registered for the International Workshop on Detection, Classification, Localization, and Density Estimation (DCLDE) of Marine Mammals using Passive Acousticsin St Andrews. A world-wide community of scientists meets every two years at this workshop to discuss the latest developments in using passive acoustics (listening for sounds) to detect and track marine mammals.

    That there was such a breadth of research around this topic was entirely new to me, and it was fascinating to learn about it. Another thing that I've since learned is that this breadth is sadly dwarfed by the amounts of massive underwater noise that humans produce today, through shipping, oil exploration, and military sonar. And this noise severely affects the lives of animals for which "listening is as important as seeing is for humans – they communicate, locate food, and navigate using sound."

    The talk that I gave at the DCLDE 2013 workshop was well received. In it, I elaborated how my method relied on little to no problem-specific human engineering, and therefore could be easily adapted to detect and classify all sorts of marine mammal sounds, not just right whale up-calls.

    At DCLDE, the execution speed of detection algorithms was frequently quoted as being x times faster than real-time, with x often being a fairly low number around 1 to 10. My GPU-powered implementation turned out to be on the faster side here: on my workstation, it detects and classifies sounds 700x faster than real-time, which means it runs detections on one year of audio recordings in roughly twelve hours, using only a single NVIDIA GTX 580 graphics card.

    In terms of accuracy, it was somewhat hard to get an idea of which of the algorithms presented really worked better than others. This had two reasons: the inconsistent use of reliable metrics such as AUC and use of cross-validation, and a lack of standard datasets that everyone could test their algorithms against. [3]

    However, it should be mentioned that good datasets are a bit tricky to come by in this field. The nature of hydrophone recordings is that the signal you're listening for could be generated a few meters away, or many kilometers, and therefore be very faint. Plus, recordings often contain a lot of ambient noise coming from cargo ships, offshore drilling, hydrophone cable flutter, and the like. With the effect that often it's hard even for a human expert to tell if the particular sound they're listening to is a vocalization of the mammal they're looking for, or just noise. Thus, analysts will often label segments as unsure, and two analysts will sometimes even give conflicting labels to the same sound.

    width="420" height="315" src="http://www.youtube-nocookie.com/embed/1WFnX4zO9GU?rel=0" frameborder="0" allowfullscreen="" style="box-sizing: border-box;">

    Four NARW up-calls that are easy to detect.
    (Here's a much messier example. And some more fascinating recordings of marine mammals.)

    This leads to a situation where people tend to ignore noisy sounds altogether, since if you consider them, predictions become difficult to verify manually, and good training examples harder to collect. But more importantly, when you ignore sounds with a bad signal-to-noise ratio (SNR), your algorithms will have an easier time learning the right patterns, too, and they will make fewer mistakes. As it turns out, noise is often more of a problem for algorithms than it is for human specialists.

    The approach of ignoring sounds with a bad SNR seems fine until you're in a situation where you've put a lot of effort into collecting recordings, and then they turn out to be unusually noisy, and trying to adjust your model's detection threshold yields either way too many false positives detections, or too many calls are missed.

    One of the very nice people I met at DCLDE was Holger Klinck from Oregon State University. He wanted to try out my convnet with one of his lab's "very messy" recordings. Some material that his group at OSU had collected at five sites near Iceland and Greenland in 2007 and 2008 had unusually high levels of noise in them, and their detection algorithms had maybe worked less than optimal there.

    http://danielnouri.org/media/deep-learning-whales-osu-iceland-detections.png

    Figure 4: "Locations of passive acoustic moorings near Iceland and southern Greenland (black spots), and the number of right whale upcalls detected per day in late 2007 at the five sites." Taken from [4]. Note the very low number of calls detected at the CE and SE sites.

    I was rather amazed when a few weeks later I had a hard disk from OSU in my hands containing in total many years of hydrophone recordings from two sites near Iceland and four locations on the Scotian Shelf. I dusted off the model that I had used for the Kaggle Whale Detection Challenge and quite confidently started running detections on the recordings. Which is when I was in for a surprise: the predictions my shiny 97% model made were all really lousy! Very many obvious non-whale noises were detected wrongly. How was it possible?

    To solve this puzzle, I had to understand that the Kaggle Whale Detection Challenge's train and test datasets had a strong selection bias in them. The tens of thousands of examples that I had used to train my model for the challenge were unrepresentative of all whale, and particularly, a lot of similar-sounding non-whale sounds out there. That's because the Kaggle challenge's examples were collected by use of a two-stage pipeline, where an automated detector would first pick out likely candidates from the recording, only after which a human analyst would label them with true or false. I realized that what we were building in the Kaggle challenge was a classifier that worked well only if it had a certain detector running in front of it that would take care of the initial pass of detection. My neural net had thus never seen during training anything like the sounds that it mistook for whale calls now.

    If I wanted my convnet to be usable by itself, on continuous audio recordings, and independently of this other detector, I would have to train it with a more balanced training set. And so I ditched most of the training examples I had, and started out with only a few hundred, and trained a new model with them. As was expected, training with only few examples left me with a pretty weak model that would overfit and make lots of obvious mistakes. But this allowed me to pick up the worst mistakes, label them correctly, and feed them back into the system as training examples. And then repeat that. (A process that Olivier Grisel later told me amounts to active learning.)

    Many (quite enjoyable) hours of listening to underwater sounds later, I had collected some 2000 training examples this way, some of which were already pretty tricky to verify. And luckily, the newly trained model started to make pretty good predictions. When I sent my results back to Holger, he said that, yes, the patterns I'd found were very similar to those that his group had found for the Scotian Shelf sites!

    http://danielnouri.org/media/deep-learning-whales-my-scotian-shelf-detections.png

    Figure 5: Number of right whale up-call detections per hour at two sites on the Scotian Shelf, detected by the convnet. The numbers and seasonal pattern match with what Mellinger et al. reported in [5].

    The OSU team had used a three-stage detection process to produce their numbers. Humans verified in phases two (broadly) and three (in more detail) the detections that the algorithm came up with in phase one. Whereas my detection results came straight out of the algorithm.

    A case-by-case comparison still needs to happen, but the similarities of the overall call patterns suggests that the convnet reaches comparable performance, but without the need for human analysts to be part of the detection pipeline, making it potentially much more time-efficient to use in practice.

    What's even more exciting is that the neural net was able to find right whale up-calls at the problematic SE site near Iceland, where previously no up-calls could be detected due to high noise levels.

    http://danielnouri.org/media/deep-learning-whales-my-iceland-detections.png

    Figure 6: NRW up-call detections per day at sites SW and SE near Iceland, detected by the convnet. The patterns at the SW site match roughly with what was reported in [4], while no calls could be identified previously at the SE site (cf. Figure 4).

    Another thing we're currently looking into is wheter or not the relatively small but constant number of calls that the convnet detected during Winter season are real, or if they're false positives. Right whales are not known to hang around so high up North during that time of the year, so proving that would constitute significant news for people studying the migration routes of these whales.

    (Comments also on Hacker News.)

    [1] For a more detailed history and recent developments around neural nets, see this article in Nature: "Computer science: The learning machines".
    [2] See the mention of my results in this Wired article: "Wanted: Right Whale Caller ID".
    [3] For a comparison of machine learning algorithms in use, see: Mellinger DK, et al. 2007. An overview of fixed passive acoustic observation methods for cetaceans. Oceanography 20:36–45.
    [4] (12) Mellinger DK, et al. 2011. Confirmation of right whales near a historic whaling ground east of Southern Greenland. Biol Lett 7:411−413
    [5] Mellinger DK, et al. 2007. Seasonal occurrence of North Atlantic right whale (Eubalaena glacialis) vocalizations at two sites on the Scotian Shelf. Mar. Mamm. Sci. 23, 856–867.

    展开全文
  • 字符序列模式识别 时间限制: 1内存限制: 32 MB 提交状态讨论版 题目描述 试写一个算法,识别字符序列是否为形如‘子序列1&子序列2’模式的字符序列,其中子序列2是子序列1的逆序列,0子序列字符串长度,且都...

                                        字符序列模式识别

    时间限制: 1  内存限制: 32 MB

    提交状态讨论版

    题目描述

    试写一个算法,识别字符序列是否为形如‘子序列1&子序列2’模式的字符序列,其中子序列2是子序列1的逆序列,0<子序列字符串长度<1000,且都为小写字母。输出YES或者NO。

    输入

    一行字符序列

    输出

    YES或NO

    样例输入

    hello&ollhe

    样例输出

    NO

    题目链接:http://acm.hnucm.edu.cn/JudgeOnline/problem.php?id=1311

    可以写下,多踩点坑。

    一道水题,有很多坑点,导致通过率很低

    坑点:

    1.判断输入字符串是否全为小写字母或&

    2.判断输入字符串是否只有1个&

    3.判断输入字符串是否为奇数,偶数肯定不能满足字符逆序

    4.子字符串长度要相等,输入字符串最中间字符一定为&

    这些都判断好了之后,最后就可以字符比较是不是逆序了。

    AC Code: 

    #include<iostream>
    #include<cstring>
    #include<algorithm>
    #include<cmath>
    #include<cstdio>
    #include<queue>
    #include<climits>
    #include<set>
    #include<stack>
    using namespace std;
    static const int MAX_N = 1e5 + 5;
    inline int read(){int x = 0, f = 1;char ch = getchar();
        while(ch > '9' || ch < '0') {if(ch == '-') f = -1;ch = getchar();}
        while(ch >= '0' && ch <= '9'){x = x * 10 + ch - '0';ch = getchar();}
        return x * f;
    }
    
    int main(){
    //    ios::sync_with_stdio(false);
        char str[4005];
        while(scanf("%s", str) != EOF){
            int len = strlen(str);
            int sum = 0;
            bool flag = true;
            for(int i = 0; i < len; i++){
                if(str[i] == '&'){
                   sum++;
                }
                else if(str[i] <= 'z' && str[i] >= 'a'){
                    flag = true;
                }
                else{
                    flag = false;
                    break;
                }
    
            }
            if(len % 2 == 0 || str[len / 2] != '&' || sum != 1){
                flag = false;
            }
            if(!flag){
                printf("NO\n");
                continue;
            }
            for(int i = 0; i < len / 2; i++){
                if(str[i] != str[len - 1 - i]){
                    flag = false;
                    break;
                }
            }
            if(flag) printf("YES\n");
            else printf("NO\n");
        }
        return 0;
    }

     

    展开全文
  • 基于时间序列分析的人体运动行为模式识别研究; 行为识别;时间序列分析;小波阈值降噪;长短时记忆;隐马尔科夫模型
  • 模式识别时间序列杂想

    千次阅读 2009-05-18 11:01:00
    比如在图像识别方面,一个小孩都能在几秒内识别一张脸,而计算机却要花很长时间去计算这张脸的各个部位尺寸,人在识别的时候并没有那么做,明显和计算机的识别过程走的不是一个路子。 钱学森先生指出应该和心理学家...


    人工智能在模拟人类逻辑思维方面战功赫赫,许多方面已经远远超过人类。然而在模拟人类形象思维方面一直举步维艰。比如在图像识别方面,一个小孩都能在几秒内识别一张脸,而计算机却要花很长时间去计算这张脸的各个部位尺寸,人在识别的时候并没有那么做,明显和计算机的识别过程走的不是一个路子。
    钱学森先生指出应该和心理学家合作,可惜心理学许多东西还找不到和计算机结合的方法。许多自然科学的东西和人文科学的东西(也可以说逻辑思维和形象思维)似乎难于找到结合的方法,而是各行其是,无法进行学科交叉。

    在气象、地震、股市、彩票的预测中,神经网络(ANN)和支持向量机(SVM)等取得了一系列令人瞩目的成果。然而其准确度还远远达不到人们的预期。原因是这些数据模型拟合能力不够强大或者泛化能力不够吗?我曾经这么认为过。我甚至怀疑有些问题是不能用时间序列方法来解的,比如彩票,对其历史数据的分析未必有助于对未来的预测,因为他是完全随机的,历史并不对未来产生任何影响。
    但是我现在更倾向于说,这是因为对问题的特征提取有问题,也可以说是情景描述或者说知识表达问题,包括彩票也不是什么完全随机,只不过是一种模式识别,只要能准确描述彩票机的初始行为就能得出中奖号码。
    【修改】:彩票的预测不能用时间序列方法,因为下一次的中奖号码与历史号码并无任何联系,只和彩票摇奖机器的当前状态有关。股票也类似,主要和当前政治经济等因素有关,与历史数据关系不大。所以说股票和彩票的预测还是不能用时间序列的思路来解,当今市场上这方面的软件明显走入了死胡同(相关作者别仍臭鸡蛋,俺只是希望大家试试多种办法)。
    用神经网络或svm等方法搞这些预测是不可行的,除非输入数据是摇奖机器状态数据(对股票来说是描述当前政治经济形式的数据),很明显要得到这些数据是难如登天。

    在对骰子的预测中,就发生过许多辩论。自然科学家认为:只要能得到骰子踯出时的方向力度旋转,以及空气湿度风向等等所有因素,就能准确算出骰子的结果。可见连骰子也不是什么随机过程,这是决定论,克拉克等人为代表。那么是不是说,概率随机等玩意只是权宜之计,暂时用来解决那些难于完全把握的问题?

    蝴蝶效应同样如此,由于过程太复杂难于测出每一步每一个物理量所以无法预测,于是就把它神秘化了。

    在考察了气象、地震、股市、彩票、骰子、蝴蝶效应这些复杂的,至今困扰科学界的问题之后,发现一个问题(考察之后能找到一点结论这才是关键),就是这些问题的解决不仅仅是靠提高cup速度或者建构比神经网络更强大的工具就能解决的,知识表达才是突破口。陆汝钤院士提出的知件与硬件软件并列大概正是认识到知识表示重要性的高瞻远瞩的提议。

    上述几个问题代表了时间序列预测的难题,共同点是,对他们的描述极其困难。比如气象,气象卫星、雷达、高空气球等等手段收集到的数据量无比巨大,然而就算是这么巨大的数据量也无法保证对大气地球的完整描述。当今最大型的计算机均用于气象和地震数据分析,就是因为数据太多,且抓不到重点,不知道哪些数据更重要些,能不能给每个数据赋予权重。结果就是气象和地震预测从来没有准确过。各种预测软件在股市预测中也同样遭遇滑铁卢。这些问题都属于当今最难的问题,属于复杂巨系统的例子。中科院戴汝为院士对包括人类社会和网络虚拟社会在内的复杂巨系统研究不知有么有什么新突破。

    复杂性问题作为新兴的科学最前沿,是产生图灵奖得主最多的领域。以上几个难题之所以难于解决,我曾说过是知识表达问题,说的更本质些,我想应该是复杂性问题,他们是NP难题,过于复杂的描述使得它们的求解成为指数复杂度,ANN,SVM之类的联结解法面对庞大的数据同样无可奈何。数据挖掘又成为了一个可能的突破口。
    展开全文
  • 声纹识别的模式识别方法2011/01/05对于模式识别,有以下几大类方法:模板匹配方法:利用动态时间弯折(DTW)以对准训练和测试特征序列,主要用于固定词组的应用(通常为文本相关任务);最近邻方法:训练时保留所有特征...
  • 为了能有效支持识别多种相似性形变,提出涨落模式(FP)的概念,以涨落模式保存原序列的趋势变化信息,利用最长公共子序列算法计算涨落模式的相似度,消除振幅伸缩、振幅漂移和线性漂移等对相似性挖掘带来的影响,...
  • 压缩算法识别时间序列中的“重要”点并丢弃其他点。该算法在线性时间内运行,需要恒定的内存,并且对各种各样的时间序列都有很好的结果。我们不仅将重要点用于压缩,还用于索引时间序列数据库,支持对模式的有效搜索...
  • R语言:时间序列

    2020-12-08 18:12:37
    1. 时间序列概述 按照时间的顺序把随机事件变化发展的过程记录下来就构成了一个时间序列时间序列进行观察、研究,找寻它变化发展的规律,预测它将来的走势就是时间序列分析 时间序列建模基本步骤: 解释建模的...
  • 时间序列的小波分析

    2015-02-16 11:28:31
    目前,小波分析理论已在信号处理、图像压缩、模式识别、数值分析和大气科学等众多的非线性科学领域内得到了广泛的应。在时间序列研究中,小波分析主要用于时间序列的消噪和滤波,信息量系数和分形维数的计算,突变点...
  • 时间序列分割方法综述,孙文远,苏晓龙,随着数据库知识发现(KDD)和模式识别等计算技术的发展,时间序列包含的数据量大,维数高,更新快。因此,时间序列的分割就显得必不�
  • 何时应用 1D CNN?CNN 可以很好地识别出数据中的简单模式,然后使用这些简单模式在更高级的层中...1D CNN 可以很好地应用于传感器数据的时间序列分析(比如陀螺仪或加速度计数据);同样也可以很好地用于分析具有固定...
  • 背景介绍 「时间序列」是指某一个指标按照时间的统计或者观测而成...「异常检测」是指对反常的、和历史不同的行为模式识别。如某台一直空闲的机器,CPU使用率突然飙升至100%、某系统在本应业务繁忙的时间段请求数量降.
  • 借助该数据集,我们旨在在传感器数据的时间序列中查找个人属性指纹,即可以用于推断数据主体除其活动之外的性别或个性的特定于属性的模式时间序列对应于数据主体的步行活动(代码3)。 有12个功能。 一些注意...
  • 6.27 超越模式识别 直至最近,人工神经网络的突出表现大部分在于完成这些任务上,比如识别静态图像的模式。但是,另一类神经网络也正在变得有名——特别是在处理随着时间展开的事件方面。递归神经网络(Recurrent ...
  • 时间序列的归一化方法

    万次阅读 2015-03-30 19:16:42
    在机器学习、模式识别的模型训练之前,通常需要对数据进行预处理工作,在哪种情况下选用哪种预处理方法,仍然是很多工作人员比较头疼的事情。本人前面的两篇博客总结了一些比较常用的数据归一化方法: 数据预处理之...
  • 时间序列模型的用处

    2019-09-26 15:15:17
    什么是所有预测类问题所共同包含的特点? 模式识别。所谓模式, 是在随机和无序的现象中蕴含的规律性(用另一个词说就是信息)...
  • python数据分析:时间序列分析(Time series analysis)

    千次阅读 热门讨论 2018-12-06 14:01:20
    时间序列用于统计,信号处理,模式识别,计量经济学,数学金融,天气预报,地震预测,脑电图,控制工程,天文学,通信工程,以及主要涉及时间测量的任何应用科学和工程领域。 时间序列分析包括用于分析时间序列数据...
  • 使用神经网络能执行几种典型的任务:聚类、拟合、分类(模式识别)以及时间序列预测。 其中分类任务可以说是最常应用的场景之一,在之前的文章里也使用了分类任务作为案例对神经网络进行了入门讲解。 时常遇到想要...
  • 这是模式识别课上的一个作业,本人花了一点时间做了一下,发现分类器的学习结果勉强还能接受,在这里分享一下。关于CpG岛:我们知道,生物的基因组是由A,T,G,C四种核苷酸组成的序列。人的一套基因组单链是由约30亿个...
  • Vincent S.Tseng等人提出的基于聚类和遗传算法的时间序列分割算法中,对于适应值函数的定义存在缺陷,本文对此进行了改进:用归一化处理消除子序列幅度对距离计算的影响,并引入类间距使分割结果的类间差异(模式之间的...
  • 针对欧几里德距离描述分段趋势的不足和各种模式距离对应分段之间距离值的离散化问题,提出一种基于形态相似距离的时间序列相似性度量方法,标准数据集上完成的识别和聚类实验表明了该方法的可行性和有效性。
  • 以大同忻州窑煤矿为例,采用时间序列模型中的ARIMA季节性模型和门限自回归模型分别对未来微震事件释放能量进行预测,比较了两种方法的优缺点及适用条件;构建了微震能量方差变化的特征函数,基于此特征函数提出了冲击...
  • 11.时间序列分析狠

    2019-05-20 21:21:00
    1.模式识别 2.参数故居估计 3. 模型检验与预测禁烟 转载于:https://www.cnblogs.com/Firesun/p/10896557.html
  • 时间序列异常事件检测

    万次阅读 2015-02-11 19:03:42
    在人们对数据进行分析处理的过程中,经常会遇到少量这样的数据,它们与数据一般模式不一致,或者说与大多数样相比有些不一样。我们称这样的数据为异常数据,对异常数据的处理在某些领域很有价值,例如在网络安全领域...
  • 时间序列模型有什么实际用处?

    千次阅读 2018-03-24 12:46:57
    模式识别。所谓模式, 是在随机和无序的现象中蕴含的规律性(用另一个词说就是信息), 可是我们信以为真的规律有时候却会把我们骗的团团转。比如你在书里看到受一些超市里发现买啤酒的顾客通常有较高的概率买尿布...
  • (8)卷积神经网络如何处理一维时间序列数据? 概述 许多文章都关注于二维卷积神经网络(2D CNN)的使用,特别是图像识别。而一维卷积神经网络(1D CNNs)只在一定程度上有所涉及,比如在自然语言处理(NLP)中的...

空空如也

空空如也

1 2 3 4 5 ... 16
收藏数 304
精华内容 121
关键字:

时间序列模式识别