Query Consolidation: Interpreting Queries Sent to Independent Heterogenous Databases




Acar, Aybar C.

Journal Title

Journal ISSN

Volume Title



This dissertation introduces the problem of query consolidation, which seeks to interpret a set of disparate queries submitted to independent databases with a single “global” query. The problem has multiple applications, from improving virtual database design, to aiding users in information retrieval, to protecting against inference of sensitive data from a seemingly innocuous set of apparently unrelated queries. The problem exhibits attractive duality with the much-researched problem of query decomposition, which has been addressed intensively in the context of multidatabase environments: How to decompose a query submitted to a virtual database into a set of local queries that are evaluated in individual databases. The new problem is set in the architecture of a canonical multidatabase system, using it in the “reverse” direction. The reversal is built on the assumption of conjunctive queries and source descriptions. A rational and efficient query decomposition strategy is also assumed, and this decomposition is reversed to arrive at the original query by analyzing the decomposed components The process incorporates several steps where a number of solutions must be considered, due to the fact that query decomposition is not injective. Initially, the problem of finding the most likely join plan between component queries is investigated. This is accomplished by leveraging the referential constraints available in the underlying multidatabase, or by approximating these constraints from the data when not available. This approximation is done using the information theoretic concept of conditional entropy. Furthermore, the most likely join plans are enhanced by the expansion of their projections and adding precision to their selection constraints by estimating the selection constraints that would be applied to these consolidations offline. Additionally, the extraction of a set of queries related to the same retrieval task from an ongoing sequence of incoming queries is investigated. A conditional random field model is trained to segment and label incoming query sequences. Finally, the candidate consolidations are re-encapsulated with a genetic programming approach to find simpler intentional descriptions that are extensionally equivalent to discover the original intent of the query. The dissertation explains and discusses all of the above operations and validates the methods developed with experimentation on synthesized and real-world data. The results are highly encouraging and verify that the accuracy, time performance, and scalability of the methods would make it possible to exploit query consolidation in production environments.



Databases, Information Integration, Query Processing, Machine learning