Automated Data Discovery, Reasoning and Ranking in Support of Building an Intelligent Geospatial Search Engine




Li, Wenwen

Journal Title

Journal ISSN

Volume Title



In the field of geospatial data discovery, two goals must be met to bridge the gap between data providers and data consumers: (1) machine agent or a search engine must be able to identify the distributed data sources owned by data providers on the Internet, (2) the machine agent must also incorporate human intelligence to find the most suitable data sources required by data consumers. To achieve the above goals, search algorithms are applied in the data discovery process so that a machine can implement automatic retrieval of needed information. However, most of the search algorithms focus on discovering general webpages rather than considering the characteristics of data sources in a specific domain, such as hydrology. This leads to the low performance of a search engine when handling domain-specific queries. This dissertation presents a number of techniques that address the fundamental questions in the problem of geospatial data discovery: how to automatically discover and collect relevant geospatial data dispersed widely on the Web? Once this information is found, how can this information be encoded from human-readable format to machine understandable format? And how to make the machine incorporate human intelligence to answer various search questions? This dissertation starts by developing an active crawler for automatic geospatial data discovery. Traditional data discovery methods include using general search engines, such as Google or accessing geospatial Web catalogues, such as Geospatial One Stop (GOS). However, Google aims to answer generic queries by treating all the keywords evenly without considering the special characteristics of geospatial data. If solely relying on Google, the needed services will be hidden in the long list of the search results. The drawback of using geospatial Web catalogues is that it assumes all data providers would register their services into the catalogues. However, this is apparently not true. In addition, the lack of timely updates generates considerable dead links in the catalogue. This dissertation proposes an accumulative term frequency based conditional probability model and develops a corresponding crawler to solve the above problem and discover geospatial data more efficiently. This dissertation then examines the problem of building a domain Knowledge Base (KB) for modeling data and knowledge from multiple sources. Current approaches reported in the literature use a controlled vocabulary, which does not encode enough logical relationships between spatial objects to enable semantic reasoning. To overcome this drawback, this dissertation proposes a new conceptual model to abstract, map, and model the geospatial knowledge for the hydrology domain. A Web-based tool is designed and developed for collaboratively populating the KB by users with different backgrounds according to the proposed conceptual model. In addition, a semantic reasoning procedure is implemented for locating all the suitable data candidates so as to enhance the performance of the geospatial search engine. To provide the data consumers with the best resource, the search engine should be capable of automatically judging the similarities among spatial objects, like human beings do. Traditional statistical methods count the co-occurrences or shared information of objects to measure their similarity. However, human recognition of similarity is sometimes too complex to be simulated by simple mathematical equations. Given this reality, a neural network based feature matching model is proposed in this dissertation to realize an automatic similarity measurement based on the KB populated as suggested above. Finally, this dissertation introduces two research projects: the USGS Arctic Spatial Data Infrastructure and the ESIP Semantic Web Testbed to demonstrate how the proposed methodologies are applied to domain applications to solve real-world problems.



Semantic, Geospatial Web Service, Neural network, Crawler, Semantic similarity, Ontology