A Study of Administrative Data Representation for Machine Learning




Asadzadehzanjani, Negin

Journal Title

Journal ISSN

Volume Title



Administrative data, including medical claims, are frequently used to train machine learning-based models used for predicting patient outcomes. Despite many efforts in using administrative codes (medical codes) in claims data, little systematic work has been done in understanding how the codes in such data should be represented before model construction. Traditionally, the presence/absence of these codes representing diagnoses or procedures (Binary Representation) over a fixed period (typically one year) is used. More recently, some studies included temporal information into data representation, such as counting, calculating time from diagnosis, and using multiple time windows. However, these methods were not able to comprehensively capture temporal information in data and much of temporal information such as the exact time of the occurrence of an event, and the exact sequence of an event are missed. This dissertation presents the results of development and investigation of two additional methods of administrative data representation (Temporal Min-Max and Trajectory Representation) specific to diagnoses extracted from claims data before applying machine learning algorithms. It then presents a large-scale experimental evaluation of these methods by comparing them with traditional Binary Representation using four classification problems: one-year mortality prediction and high utilization of medical services prediction, prediction of chronic kidney disease and prediction of congestive heart failure. It was shown that the optimal way of representing the data is problem-dependent, thus optimization of representation parameters is required as part of the modeling.



Public health, Artificial intelligence, Health sciences, Data Preprocessing, Health Informatics, Medical Claims, Supervised Learning, Temporal Machine Learning