Incorporating Knowledge from Authoritative Medical Ontologies in Causal Bayesian Networks Learned from Observational Patient Data



Journal Title

Journal ISSN

Volume Title



Causal modeling of observational patient data infers causal relationships among symptoms and diseases and is a key focus in epidemiology. Clinicians of epidemiology and data scientists from health policy often ask questions related to causality, such as “which symptom came first?”, “which symptoms caused the other symptoms?”, and “how effective is the treatment?”. There are many methods and algorithms which analyze and learn from patient data, creating causal models in an attempt to answer these questions. Directed networks from observational patient datasets are used to infer causation among various comorbid diseases and symptoms. For complex diseases, there are multiple stages of disease progression and multiple interrelated symptoms which develop over time. When clinicians understand the progression of the disease, treatment can be prescribed for both the symptoms and to stop a disease from fully developing. Having an accurate understanding of the causal relationships among symptoms gives clinicians and patients a variety of options for the treatment of complex diseases. A Bayesian Network (BN) is a popular framework for causal studies and for representing causal relationships among multiple variables. Causal relationships and their associated conditional probabilities can be represented in the structure of a BN as nodes and edges, creating a Causal Bayesian Network (CBN). This framework enables us to reason under uncertainty and to model and measure relationships among symptoms. Furthermore, a CBN provides a means to visualize the relationships and interactions among comorbid symptoms. The analysis of a patient dataset is done by applying a pre-existing algorithm to the data and then analyzing the results. Certain algorithms perform better than others in terms of efficiency, stability, and simplicity. However, using learned BN to infer causality requires an understanding of the underlying causal model, which relies on information external to the data and the algorithm. BNs learned from data does not consider or capture: 1) prior knowledge or expertise from epidemiologists or other authoritative sources, 2) any causal mechanisms which may be known about the disease, and 3) any contextual evidence or confounding variables not captured during the data collection process. A common source of prior knowledge or expertise is a time-based epidemiological sequence of disease progression observed in patients over time. In situations where the data is cross-sectional and a longitudinal sequence among the symptom variables cannot be determined from the data alone, another method is required to find a sequence or ordering in the symptoms. Authoritative medical ontologies (AMO) are designed to standardize and enable knowledge sharing in specific disease domains. AMOs are capable of providing causal mechanisms and causal knowledge regarding disease progression to inform causal models. Given a disease domain, AMOs contain non-temporal ordered-variable pairs which can be used to orient the structure of a CBN learned by an algorithm. Additionally, the prior expertise that exists in these ordered-variable pairs provides context for the diseases and symptoms. This dissertation establishes the use of AMOs as sources of prior knowledge for learning CBNs, which increases congruence between ontological knowledge and the dataset. Three AMOs are used to collect prior knowledge: 1) the Medical Dictionary for Regulatory Activities Terminology (MedDRA ), 2) the International Classification of Diseases Version 10 Clinical Modification (ICD-10-CM ), and 3) the Systematized Nomenclature of Medicine - Clinical Terms (SNOWMED CT ). The knowledge from these three AMOs is used to orient the CBNs learned from three datasets: 1) Sequenced Treatment Alternatives to Relieve Depression (STAR*D), 2) Icahn School of Medicine’s Asthma Mobile Health Study, and 3) the National Alzheimer's Coordinating Center’s (NACC) Unified Data Set version 3. Establishing the use of AMOs as sources of prior knowledge is done by means of a methodology to extract and apply ontological knowledge to a CBN. We have selected the MMHC algorithm for our experiments after testing the predictive accuracy of six algorithms (Grow-Shrink, Hill-Climbing, Tabu, Max-Min Hill-Climb, Restricted Maximization, Hybrid HPC) against our three datasets. This methodology contains 4 steps. First, a Baseline CBN is created using only the dataset and the MMHC (Max-Min Hill Climbing) algorithm. Second, data variables are mapped to AMOs, and causal mechanisms and potentially causal relationships are recorded in the form of ordered-variable pairs. These ordered-variable pairs exist in the form of 1) codification of symptoms and references among codes, 2) explicit causal relationship types, 3) qualitative knowledge of the disease progression in AMO browsers, and 4) subsumption relationships. Subsumption relationships are often associative but produces relevant contextual evidence for symptoms. Third, modify the MMHC algorithm to orient the CBN structure by whitelisting or blacklisting ordered-variable pairs. For the experiments, we will test whitelisting and blacklisting on STAR*D but will focus on blacklisting for NACC and Icahn datasets and leave arc learning to the algorithm and data. This forces the algorithm to consider relationships which are explicit in an AMO. The pairs are selected from an ontology or as a collection representing knowledge across several ontologies. Specifically, two collections representing a “smart” selection of the pairs are utilized: a collection of pairs which exist in at least two AMOs, and a collection of pairs which exist in one AMO and is verified using clinical information in ICD-10-CM’s browser. Finally, the Baseline and Modified CBNs are compared using the following metrics: 1) k-fold cross-validated AUC (Area Under the Receiver Operating Curve), 2) predictive accuracy of a single node of interest (for STAR*D it is Remission, for NACC it is Dementia, and for Icahn it is COPD), and 3) their goodness-of-fit using cross-validation and log-likelihood loss (also known as negative entropy). By comparing arcs in the Baseline and Modified CBNs, new relationships in the Modified CBN can be substantiated in the existing epidemiological literature. With the incorporation of causal knowledge from AMOs as a blacklist, the resulting Modified CBNs have significantly changed structures. For STAR*D, 12/20 arcs are changed in the Modified CBN (agreement of 63%) when compared to the Baseline. Cross-validated AUC shows that the Modified CBN for STAR*D performed better on average than the Baseline, with an average of. 0.8622 vs 0.8545. The Modified CBN also outperformed the Baseline in predicting remission via Citalopram, with a lower predictive error of 0.3941 vs 0.4048. For NACC Alzheimer’s study, the Modified had an arc agreement of 42/48 arcs (87%) with the Baseline. The NACC Modified CBN had a higher average AUC of 0.9592 vs 0.9514 for the Baseline. The Modified CBN also outperformed the Baseline in predicting Dementia with a lower predictive error of 0.1156 vs. 0.1234. For Icahn Asthma data, the Modified model was in complete agreement with the Baseline. The Baseline model had an had an average AUC of 0.6953 for the Baseline. This CBN was used to predict COPD, with an expected loss of 0.06353. By incorporating causal mechanisms and causal knowledge from medical ontologies in CBNs, we are able to learn Modified networks which are significantly different than Baselines which use an algorithm and data alone. Despite a different network structure, the Modified CBNs are able to perform better on average to the Baselines in terms of predictive accuracy.