Surprise Discovery in Scientific Databases: A Framework for Data Intensive Science Utilizing the Power of Citizen Science

dc.contributor.advisorBorne, Kirk
dc.contributor.authorVedachalam, Arun
dc.creatorVedachalam, Arun
dc.date2016-05-15
dc.date.accessioned2017-01-26T22:07:51Z
dc.date.available2017-01-26T22:07:51Z
dc.description.abstractThe ability to collect and analyze massive amounts of data is rapidly transforming science, industry and everyday life. Too often in the real world, information from multiple sources such as humans, experts, agents need to be integrated to provide support for a making any scientific discovery. This holds true for modern sky surveys in Astronomy where the common theme is that they produce hundreds of terabytes (TB) up to 100 (or more) petabytes (PB) both in the image data archive and in the object catalogs. For example, the LSST will produce a 2040 PB catalog database. Such large sky surveys have enormous potential to enable countless astronomical discoveries. The discoveries will span the full spectrum of statistics: from rare object types, to complete statistical and astrophysical specifications of many classes of objects. The challenges faced by this data driven approach often revolves around two major issues: 1) The lack of the expert labels present in the database and 2) The lack of sufficient knowledge in the database for identifying the known expert labels. In this dissertation, first we will discuss novel approach to finding interesting (novelty/surprise/anomaly detection) objects that enable scientists to discover the most interesting scientific knowledge hidden within large and high-dimensional datasets. Then will move on towards utilizing the power of citizen science in identifying features where the goal is to determine indicators, based solely on discovering those automated pipeline-generated attributes in the astronomical database that correlate most strongly with the patterns identified through visual inspection of galaxies by the Galaxy Zoo volunteers. Further expanding this the capability to Latent variable models where the hidden/latent variables extracted from the citizen science data help bridge the gap between the human generated classifications and the features not captured by the astronomy data pipeline. Proper utilization of these latent variables helped unearth new classes or in some cases most representative/interesting sample that are previously unknown to the astronomers. These interesting objects act as a training set for the machine learning algorithms and can be used to build automated models to classify the galaxies from the future sky surveys such as LSST.
dc.identifier.urihttps://hdl.handle.net/1920/10514
dc.language.isoen_US
dc.rightsCopyright 2016 Arun Vedachalam
dc.titleSurprise Discovery in Scientific Databases: A Framework for Data Intensive Science Utilizing the Power of Citizen Science
dc.typeDissertation
thesis.degree.disciplineComputational Sciences and Informatics
thesis.degree.grantorGeorge Mason University
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy in Computational Sciences and Informatics

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Vedachalam_gmu_0883E_11187.pdf
Size:
2.19 MB
Format:
Adobe Portable Document Format
Description:
Vedachalam-etd
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2.52 KB
Format:
Item-specific license agreed upon to submission
Description: