dc.description.abstract |
In the field of geospatial data discovery, two goals must be met to bridge the gap
between data providers and data consumers: (1) machine agent or a search engine must
be able to identify the distributed data sources owned by data providers on the Internet,
(2) the machine agent must also incorporate human intelligence to find the most suitable
data sources required by data consumers.
To achieve the above goals, search algorithms are applied in the data discovery
process so that a machine can implement automatic retrieval of needed information.
However, most of the search algorithms focus on discovering general webpages rather
than considering the characteristics of data sources in a specific domain, such as
hydrology. This leads to the low performance of a search engine when handling
domain-specific queries.
This dissertation presents a number of techniques that address the fundamental
questions in the problem of geospatial data discovery: how to automatically discover and
collect relevant geospatial data dispersed widely on the Web? Once this information is
found, how can this information be encoded from human-readable format to machine
understandable format? And how to make the machine incorporate human intelligence to
answer various search questions?
This dissertation starts by developing an active crawler for automatic geospatial data
discovery. Traditional data discovery methods include using general search engines, such
as Google or accessing geospatial Web catalogues, such as Geospatial One Stop (GOS).
However, Google aims to answer generic queries by treating all the keywords evenly
without considering the special characteristics of geospatial data. If solely relying on
Google, the needed services will be hidden in the long list of the search results. The
drawback of using geospatial Web catalogues is that it assumes all data providers would
register their services into the catalogues. However, this is apparently not true. In
addition, the lack of timely updates generates considerable dead links in the catalogue.
This dissertation proposes an accumulative term frequency based conditional probability
model and develops a corresponding crawler to solve the above problem and discover
geospatial data more efficiently.
This dissertation then examines the problem of building a domain Knowledge Base
(KB) for modeling data and knowledge from multiple sources. Current approaches
reported in the literature use a controlled vocabulary, which does not encode enough
logical relationships between spatial objects to enable semantic reasoning. To overcome
this drawback, this dissertation proposes a new conceptual model to abstract, map, and
model the geospatial knowledge for the hydrology domain. A Web-based tool is designed
and developed for collaboratively populating the KB by users with different backgrounds
according to the proposed conceptual model. In addition, a semantic reasoning procedure
is implemented for locating all the suitable data candidates so as to enhance the
performance of the geospatial search engine.
To provide the data consumers with the best resource, the search engine should be
capable of automatically judging the similarities among spatial objects, like human
beings do. Traditional statistical methods count the co-occurrences or shared information
of objects to measure their similarity. However, human recognition of similarity is
sometimes too complex to be simulated by simple mathematical equations. Given this
reality, a neural network based feature matching model is proposed in this dissertation to
realize an automatic similarity measurement based on the KB populated as suggested
above.
Finally, this dissertation introduces two research projects: the USGS Arctic Spatial
Data Infrastructure and the ESIP Semantic Web Testbed to demonstrate how the
proposed methodologies are applied to domain applications to solve real-world problems. |
|