A Framework for Finding Patterns in Mixed and Streaming Data



Journal Title

Journal ISSN

Volume Title



Pattern mining is an umbrella term used for data mining algorithms with the goal of finding relationships between attributes, such as association rules, frequent sets, contrast patterns, emerging patterns, etc. While pattern mining is a well-researched topic in data mining, with many applications in diverse disciplines, there remain some open problems that have not been addressed by existing work. We summarize some of the problems encountered and addressed in this dissertation. In many real-world applications such as manufacturing, data contain both continuous and categorical attributes. In our work, we propose novel methodologies to find patterns in datasets with such mixed attributes. More specifically, our algorithms dynamically discretizes continuous attributes in an itemset in a supervised fashion. We propose a top-down recursive approach to find intervals for continuous attributes that result in statistically significant patterns. As opposed to a global discretization scheme, where each attribute is discretized exactly once, our approach allows local discretization --- that is, any continuous attribute can be discretized in different ways based on the consequent. This approach makes it possible to capture different inter-variable relationships. We evaluate our algorithm with several synthetic and real datasets, including Intel manufacturing data that motivated this research. The experimental results and analysis indicate that our algorithm is capable of finding more meaningful rules for multivariate data than existing algorithms. Also, in many real-world scenarios, the data arrives in a streaming manner; the goal is to find and maintain the most current representation/model of the data. General challenges for streaming data include data arriving at high intensity, detecting and handling concept drift and updating the model in reasonable time. Since the data arrive at a fast speed, we propose a weighted average method using a sliding window to update patterns. Our updating strategy detects concept drift, detects anomalous patterns and provides a consistent view of the data. To overcome the challenge of handling large volumes of data, we propose scalable solutions using pruning and parallelization. Most pattern mining algorithms rely on pruning to be computationally feasible. This works well if the data fits in the main memory. However, in a parallel environment it may not be possible to share pruning information among nodes. In our work, we propose a way to divide the data among nodes to maximize pruning. We developed algorithms to find contrast patterns for large, streaming and mixed data using methods we developed for finding association rules and frequent sets. By incorporating feedback from users, we improve the quality of patterns discovered and shown to the user.