Nonparametric Bayesian Models for Unsupervised Learning

Date

2011-05-25

Authors

Wang, Pu

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Unsupervised learning is an important topic in machine learning. In particular, clustering is an unsupervised learning problem that arises in a variety of applications for data analysis and mining. Unfortunately, clustering is an ill-posed problem and, as such, a challenging one: no ground-truth that can be used to validate clustering results is available. Two issues arise as a consequence. Various clustering algorithms embed their own bias resulting from di erent optimization criteria. As a result, each algorithm may discover di erent patterns in a given dataset. The second issue concerns the setting of parameters. In clustering, parameter setting controls the characterization of individual clusters, and the total number of clusters in the data. Clustering ensembles have been proposed to address the issue of di erent biases induced by various algorithms. Clustering ensembles combine di erent clustering results, and can provide solutions that are robust against spurious elements in the data. Although clustering ensembles provide a signi cant advance, they do not address satisfactorily the model selection and the parameter tuning problem. Bayesian approaches have been applied to clustering to address the parameter tuning and model selection issues. Bayesian methods provide a principled way to address these problems by assuming prior distributions on model parameters. Prior distributions assign low probabilities to parameter values which are unlikely. Therefore they serve as regularizers for modeling parameters, and can help avoid over- tting. In addition, the marginal likelihood is used by Bayesian approaches as the criterion for model selection. Although Bayesian methods provide a principled way to perform parameter tuning and model selection, the key question \How many clusters?" is still open. This is a fundamental question for model selection. A special kind of Bayesian methods, nonparametric Bayesian approaches, have been proposed to address this important model selection issue. Unlike parametric Bayesian models, for which the number of parameters is nite and xed, nonparametric Bayesian models allow the number of parameters to grow with the number of observations. After observing the data, nonparametric Bayesian models t the data with nite dimensional parameters. An additional issue with clustering is high dimensionality. High-dimensional data pose a di cult challenge to the clustering process. A common scenario with high-dimensional data is that clusters may exist in di erent subspaces comprised of di erent combinations of features (dimensions). In other words, data points in a cluster may be similar to each other along a subset of dimensions, but not in all dimensions. People have proposed subspace clustering techniques, a.k.a. co-clustering or bi-clustering, to address the dimensionality issue (here, I use the term co-clustering). Like clustering, also co-clustering su ers from the ill-posed nature and the lack of ground-truth to validate the results. Although attempts have been made in the literature to address individually the major issues related to clustering, no previous work has addressed them jointly. In my dissertation I propose a uni ed framework that addresses all three issues at the same time. I designed a nonparametric Bayesian clustering ensemble (NBCE) approach, which assumes that multiple observed clustering results are generated from an unknown consensus clustering. The under- lying distribution is assumed to be a mixture distribution with a nonparametric Bayesian prior, i.e., a Dirichlet Process. The number of mixture components, a.k.a. the number of consensus clusters, is learned automatically. By combining the ensemble methodology and nonparametric Bayesian modeling, NBCE addresses both the ill-posed nature and the parameter setting/model selection issues of clustering. Furthermore, NBCE outperforms individual clustering methods, since it can escape local optima by combining multiple clustering results. I also designed a nonparametric Bayesian co-clustering ensemble (NBCCE) technique. NBCCE inherits the advantages of NBCE, and in addition it is e ective with high dimensional data. As such, NBCCE provides a uni ed framework to address all the three aforementioned issues. NBCCE assumes that multiple observed co-clustering results are generated from an unknown consensus co-clustering. The underlying distribution is assumed to be a mixture with a nonparametric Bayesian prior. I developed two models to generate co-clusters in terms of row- and column- clusters. In one case row- and column-clusters are assumed to be independent, and NBCCE assumes two independent Dirichlet Process priors on the hidden consensus co-clustering, one for rows and one for columns. The second model captures the dependence between row- and column-clusters by assuming a Mondrian Process prior on the hidden consensus co-clustering. Combined with Mondrian priors, NBCCE provides more exibility to t the data. I have performed extensive evaluation on relational data and protein-molecule interaction data. The empirical evaluation demonstrates the e ectiveness of NBCE and NBCCE and their advantages over traditional clustering and co-clustering methods.

Description

Keywords

Unsupervised Learning, Clustering, Bayesian Nonparametrics, Clustering Ensembles

Citation