Surprise Discovery in Scientific Databases: A Framework for Data Intensive Science Utilizing the Power of Citizen Science

Vedachalam, Arun

Surprise Discovery in Scientific Databases: A Framework for Data Intensive Science Utilizing the Power of Citizen Science

dc.contributor.advisor	Borne, Kirk
dc.contributor.author	Vedachalam, Arun
dc.creator	Vedachalam, Arun
dc.date	2016-05-15
dc.date.accessioned	2017-01-26T22:07:51Z
dc.date.available	2017-01-26T22:07:51Z
dc.description.abstract	The ability to collect and analyze massive amounts of data is rapidly transforming science, industry and everyday life. Too often in the real world, information from multiple sources such as humans, experts, agents need to be integrated to provide support for a making any scientific discovery. This holds true for modern sky surveys in Astronomy where the common theme is that they produce hundreds of terabytes (TB) up to 100 (or more) petabytes (PB) both in the image data archive and in the object catalogs. For example, the LSST will produce a 2040 PB catalog database. Such large sky surveys have enormous potential to enable countless astronomical discoveries. The discoveries will span the full spectrum of statistics: from rare object types, to complete statistical and astrophysical specifications of many classes of objects. The challenges faced by this data driven approach often revolves around two major issues: 1) The lack of the expert labels present in the database and 2) The lack of sufficient knowledge in the database for identifying the known expert labels. In this dissertation, first we will discuss novel approach to finding interesting (novelty/surprise/anomaly detection) objects that enable scientists to discover the most interesting scientific knowledge hidden within large and high-dimensional datasets. Then will move on towards utilizing the power of citizen science in identifying features where the goal is to determine indicators, based solely on discovering those automated pipeline-generated attributes in the astronomical database that correlate most strongly with the patterns identified through visual inspection of galaxies by the Galaxy Zoo volunteers. Further expanding this the capability to Latent variable models where the hidden/latent variables extracted from the citizen science data help bridge the gap between the human generated classifications and the features not captured by the astronomy data pipeline. Proper utilization of these latent variables helped unearth new classes or in some cases most representative/interesting sample that are previously unknown to the astronomers. These interesting objects act as a training set for the machine learning algorithms and can be used to build automated models to classify the galaxies from the future sky surveys such as LSST.
dc.identifier.uri	https://hdl.handle.net/1920/10514
dc.language.iso	en_US
dc.rights	Copyright 2016 Arun Vedachalam
dc.title	Surprise Discovery in Scientific Databases: A Framework for Data Intensive Science Utilizing the Power of Citizen Science
dc.type	Dissertation
thesis.degree.discipline	Computational Sciences and Informatics
thesis.degree.grantor	George Mason University
thesis.degree.level	Doctoral
thesis.degree.name	Doctor of Philosophy in Computational Sciences and Informatics

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Vedachalam_gmu_0883E_11187.pdf
Size:: 2.19 MB
Format:: Adobe Portable Document Format
Description:: Vedachalam-etd

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.52 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

College of Engineering and Computing