Mining Variable-Length Motifs in Large-Scale Time Series Data



Journal Title

Journal ISSN

Volume Title



With the widespread use of sensor networks, large-scale time series have become ubiquitous in both industrial processes and research applications. How to mine useful information and make the decision based on such time series data has become a popular topic in various research fields, including medicine, meteorology, biology, astronomy, etc. In recent decades, the task of detect repeated patterns, as known as motif discovery, in time series has received a great amount of attention in recent years. The discovered motifs play an essential role in many time series data mining tasks such as data visualization, classification, clustering, etc. Despite the significant advances of motif discovery research in the recent decade, how to detect motifs in a large-scale time series is still a challenging problem. Besides, in some downstream tasks that using motifs, having motifs of different lengths is crucial as variable-length patterns can naturally co-exist in the time series and represent different unique aspects of the data. The task of finding patterns of different lengths is named variable-length motif discovery. Compared with finding the single-length motifs, the search space is dramatically increased. Therefore, the process is even more time-consuming. As a result, most variable-length motif algorithms often focus on small to medium size datasets (e.g., the dataset contains approximately one hundred thousand sample points). On the other hand, a large amount of time series is generated in our daily life. Lacking a reliable variable-length motif discovery algorithm to detect motifs in large-scale time series has become one of the most critical challenges to fully fulfill the potential ability of the motifs as a useful time series primitive. Motivated by this challenge, in this dissertation, we introduce a series of time- and space-efficient approximate algorithms for detecting variable-length motifs. The proposed methods enable motif discovery in large-scale time series, which ultimately benefit a large range of downstream research tasks. Specifically, we introduced three algorithms to tackle the following challenging tasks in variable-length motif discovery for large-scale time series: Task I: mining motifs in over one hundred million scale time series Task II: mining motifs with significantly different length scales Task III: mining co-evolving subdimensional motifs For task I, we introduce a grammar induction based motif discovery framework named DP-Sequitur. DP-Sequitur is designed for detecting motifs in large-scale time series data with a relatively small length difference. For task II, we introduce an algorithm named Hierarchy based Motif Enumeration (HIME), HIME is designed for detecting variable-length motifs with a large length range in million scale time series. Finally, for task III, we introduced an algorithm named Collaborative Hierarchy based Motif Enumeration (CHIME). CHIME addressed a previously almost untouched problem -- finding co-evolving subdimensional patterns in multivariate time series of different lengths. We demonstrate that all of the proposed algorithms can efficiently detect meaningful variable-length motifs in various large-scale, real-world time series. Ultimately, the proposed algorithms can benefit various downstream tasks such as data visualization, classification, clustering, anomaly detection, and rule discovery.