Efficient Data Splitting Methods for Machine Learning Model Fitting




Betrouni, Redouane

Journal Title

Journal ISSN

Volume Title



In this PhD dissertation, I developed a new sampling method which I named PCA-Systematic sampling as an improved stratified systematic sampling to optimally split datainto training and testing subsets. This procedure will help machine learning algorithms avoid the classical mistake of overfitting. While it might be slightly more computationally expensive, it makes up for this apparent weakness by having a better estimate of test error and improving prediction accuracy. The dissertation provides computational and theoretical evidence to support the benefits of the new proposed sampling design over traditional approaches. Examples and mathematical evidence are presented to show how traditional splitting methods such as simple random sampling to partition data can distort relationship between important covariates and the variable of interest for the test dataset and as a consequence leads to either poor model construction or poor model fitting assessment. In this dissertation, I create a sampling utility score index as a data quality control tool to assess data splits or sampling designs. This dissertation demonstrates the benefits of my sampling utility index as its mathematical property is derived and studied, sensitivity analysis is conducted to investigate how it behaves under different scenarios of sampling designs. Finally, this dissertation contributes to the field of survey sampling and predictive modeling when the new developed methodology is implemented on three distinct publicly available datasets. I show in this dissertation how this new scheme of new sampling design developed and named PCA-Systematic can be used as an application on real surveys data like the Annual Survey of Public Employment and Payroll (ASPEP) and the American Housing Survey (AHS) data. I provide evidence of improvement in estimates with comparison to the traditional methods of systematic sampling. My novel PCA-Stratified-Systematic sampling method outperforms current and best state of the art sampling methods for the classification problem of Fisher IRIS data.



Statistics, Computer science, Mathematics, Data Splitting, Holdout Method, Machine learning, PCA Systematic Sampling, Principal Component Analysis, Sampling Utility