Data Compression in Statistical Learning by Means of Quantized Random Projection



Journal Title

Journal ISSN

Volume Title



This dissertation explores advancements in random projection, a method of dimensionality reduction that reduces the number of variables in a dataset via multiplication with a smaller, randomly generated matrix. Compared to other dimensionality reduction methods, random projection has very fast operation time and a small memory footprint, at the cost of potentially more information loss. This dissertation investigates the effects of quantization on random projection, particularly in conjunction with machine learning algorithms. We demonstrate that in many settings the amount information loss can be managed to be of minimal practical consequence. Beyond establishing an effective method of data compression for machine learning, this dissertation provides formulae and code that allow future researchers, or anyone who seeks to reduce their data under certain circumstances, to determine an optimal compression scheme. We also feel that the structure of our proofs can be generalized to suit them a variety of machine learning algorithms, or quantization methods. The first two chapters of this dissertation introduce the topic and provide background information, respectively. Chapter 2 goes into significant algebraic detail to provide context for understanding the following chapters. Chapter 3 investigates quantization, with two different methods of quantization compared both individually and when combined with random projection. It focuses on proofs on bounds of mean squared errors. Chapter 4 focuses on random projection in conjunction with spectral clustering, an algorithm whose strengths align well with those of random projection. Again the focus is on analyzing bounds. Finally, chapter 5 explores these algorithms through experiments. Two simulation experiments further explore the technical details of spectral clustering and random projection, before we close with some real data experiments that compare random projection to principal components analysis.