Exemplar-driven Learning for Data Clustering




Mani, Priya

Journal Title

Journal ISSN

Volume Title



Clustering is a fundamental machine learning problem which seeks to discover groups of data based on a notion of similarity. Clustering is ill-posed as the notion of an optimal clustering is subjective to the application at hand. The clustering solution obtained depends on the characteristics of the data as well as on the design choices of algorithms. Clustering high-dimensional data poses additional challenges due to unreliable estimations of distances and density. High-dimensional data often reside in subspaces which entails discovering the dimensions relevant to each cluster. Another challenge in high-dimensional spaces is the emergence of the hubness phenomenon, whereby few data points, known as hubs, appear frequently as nearest neighbors. While certain hubs exhibit useful clustering properties of the data, others can negatively influence neighborhood computation and clustering results. As such, not all data points are beneficial for clustering, and more accurate and reliable clustering solutions can be obtained by leveraging an informative subset of data. In my dissertation, I propose data-driven and adaptive selection strategies to leverage exemplars and guide the optimization of unsupervised learning algorithms for data clustering. I introduce a new geometric characterization of hubs to guide the discovery of sub- space clusters, and introduce a hubness-driven algorithm to find subspace clusters in high-dimensional data. Furthermore, I leverage selective neighborhoods to approximate the data manifold and to regularize non-negative matrix factorization for data clustering. As a result, I design an unsupervised manifold regularized matrix factorization algorithm which jointly learns a sparse set of representatives and their neighbor affinities, along with the data factorization. I further propose a fast and effective approximation of my approach by relaxing the selectivity constraints on the data. Finally, data exemplars can be leveraged to learn unsupervised deep representations. To this end, I use hubs to regularize a variational auto-encoder and to learn a discriminative embedding for unsupervised down-stream tasks. I introduce an unsupervised and data- driven regularization of the latent space using a mixture of hub-based priors and a hub-based contrastive loss. I evaluate the quality of data clustering and generative modeling within the learned latent embedding, and achieve competitive performance with respect to state-of-the-art methods on benchmark data.



Computer science, Exemplars, Hubness phenomenon, Matrix factorization, Selective Regularization, Subspace Clustering, Variational auto-encoders