Mason Archival Repository Service

Surprise Discovery in Scientific Databases: A Framework for Data Intensive Science Utilizing the Power of Citizen Science

Show simple item record

dc.contributor.advisor Borne, Kirk
dc.contributor.author Vedachalam, Arun
dc.creator Vedachalam, Arun
dc.date 2016-05-15
dc.date.accessioned 2017-01-26T22:07:51Z
dc.date.available 2017-01-26T22:07:51Z
dc.identifier.uri https://hdl.handle.net/1920/10514
dc.description.abstract The ability to collect and analyze massive amounts of data is rapidly transforming science, industry and everyday life. Too often in the real world, information from multiple sources such as humans, experts, agents need to be integrated to provide support for a making any scientific discovery. This holds true for modern sky surveys in Astronomy where the common theme is that they produce hundreds of terabytes (TB) up to 100 (or more) petabytes (PB) both in the image data archive and in the object catalogs. For example, the LSST will produce a 2040 PB catalog database. Such large sky surveys have enormous potential to enable countless astronomical discoveries. The discoveries will span the full spectrum of statistics: from rare object types, to complete statistical and astrophysical specifications of many classes of objects. The challenges faced by this data driven approach often revolves around two major issues: 1) The lack of the expert labels present in the database and 2) The lack of sufficient knowledge in the database for identifying the known expert labels. In this dissertation, first we will discuss novel approach to finding interesting (novelty/surprise/anomaly detection) objects that enable scientists to discover the most interesting scientific knowledge hidden within large and high-dimensional datasets. Then will move on towards utilizing the power of citizen science in identifying features where the goal is to determine indicators, based solely on discovering those automated pipeline-generated attributes in the astronomical database that correlate most strongly with the patterns identified through visual inspection of galaxies by the Galaxy Zoo volunteers. Further expanding this the capability to Latent variable models where the hidden/latent variables extracted from the citizen science data help bridge the gap between the human generated classifications and the features not captured by the astronomy data pipeline. Proper utilization of these latent variables helped unearth new classes or in some cases most representative/interesting sample that are previously unknown to the astronomers. These interesting objects act as a training set for the machine learning algorithms and can be used to build automated models to classify the galaxies from the future sky surveys such as LSST.
dc.language.iso en_US en_US
dc.rights Copyright 2016 Arun Vedachalam en_US
dc.title Surprise Discovery in Scientific Databases: A Framework for Data Intensive Science Utilizing the Power of Citizen Science en_US
dc.type Dissertation en_US
thesis.degree.name Doctor of Philosophy in Computational Sciences and Informatics en_US
thesis.degree.level Doctoral en_US
thesis.degree.discipline Computational Sciences and Informatics en_US
thesis.degree.grantor George Mason University en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search MARS


Browse

My Account

Statistics