Weighted Clustering Ensembles




Al-Razgan, Muna Saleh

Journal Title

Journal ISSN

Volume Title



Clustering is a popular approach to exploratory data analysis and mining. How- ever, clustering faces difficult challenges due to its ill-posed nature. First, it is well known that off-the-shelf clustering methods may discover different patterns in a given set of data, because each clustering algorithm has its own bias resulting from the optimization of different criteria. Second, there is no ground truth against which the clustering result can be validated. High dimensional data also pose a difficult challenge to the clustering process. Various clustering algorithms can handle data with low dimensionality, but as the dimensionality of the data increases, these algorithms tend to break down. In this dissertation, we introduce novel clustering ensemble techniques and novel semi-supervised approaches to address these problems. Clustering ensembles offer a solution to challenges inherent to clustering arising from its ill-posed nature: they can provide more robust and stable solutions by making use of the consensus across multiple clustering results, and they can average out the emergent spurious structures which arise due to the various biases of each participating algorithm, and due to the variance induced by different data samples. We introduce and analyze three new consensus functions for ensembles of subspace clusterings. The ultimate goal of our consensus functions is to provide hard partitions of the data, and weight vectors which convey information regarding the subspaces within which the individual clusters exist. We demonstrate the effectiveness of our three techniques by running experiments with several real datasets, including high dimensional text data, and investigate the issue of diversity and accuracy in our ensemble techniques. We also study scenarios in which limited knowledge on the data (in terms of pair- wise constraints) is available from the user. We develop a methodology to embed such constraints into the ensemble components, so that the desired structure emerges via the consensus clustering. We introduce a mechanism which leverages the ensemble framework to bootstrap informative constraints directly from the data and from the various clusterings, without intervention from the user. We demonstrate the effectiveness of our proposed techniques with experiments using real datasets and other state-of-the-art semi-supervised techniques.



Clustering, Ensembles, Subspace Clustering, Consensus functions, Accuracy, Text data