Employing Model-Based Systems Engineering to Develop a Data Science Framework for Knowledge Capture, Re-use, and Integration



Journal Title

Journal ISSN

Volume Title



Data science is an umbrella term that describes a wide and interdisciplinary collection of technologies and methods used by organizations to utilize and leverage data for valuable insights across a broad range of application domains. It can be viewed as the process of extracting insight out of data and includes tasks such as identifying what data to collect, how to collect or process it, and translating insights into actionable items. As new challenges develop, with the increasing amount of data becoming available and complexity of questions that need to be answered, data science is emerging as a new and important discipline. To meet this increasing demand, the industry is making efforts to industrialize and democratize data science to enable organizations and users to build and deploy machine learning models without the extensive skills and knowledge needed to develop these models on their own. There also exists extensive literature and other resources that enable knowledge sharing for data science projects, but due to the lack of standardized knowledge management methods data scientists currently employ time-consuming and manual processes where either a new solution is created from scratch or potential previous solutions are searched for and modified for the new problem. The objective of this work is to improve the initial solution search and selection process for data science projects, and enable interoperability and reuse of existing solutions from different disciplines in a single integrated workflow, by creating a model-based knowledge repository of data science projects. In this dissertation, I present a data science project management framework which adopts a Model-Based Systems Engineering (MBSE) approach for capturing knowledge of process-oriented data science solutions, assist in generation of potential solutions for new problem instances, and facilitate interoperability of modules developed in multiple programming languages. I use this framework to systematically document and formalize the relationship between problem statements, requirements, and solutions for a data science case study that addresses test and evaluation limitations of an insider threat detection system that lacks ground truth. Finally, I use the digital knowledge base of the case study to generate suggested workflows to solve new problems and evaluate the new solutions to assess the effectiveness of this solution.