DataMapX: A tool for cross-mapping entities and attributes between bioinformatics databases

Date

2008-06-30T14:54:29Z

Authors

Kanchinadam, Krishna M.

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Bioinformatics databases are both syntactically and semantically heterogeneous, reflecting, in this inconsistency of their models, the individual interpretations scientists place on the underlying, highly networked, relationships. Data presentation often interferes in the investigators ability to identify elements that partially or wholly map to the same attribute or entity. For example, there are a plethora of databases with public interfaces by which researchers make available subsets of data about genes with characteristics such as the source genome, locus position and allele variant positions, and expression levels, but the names, identifiers, units and chromosomal locations often differ. Since neither the presentation formats nor the nomenclatures are standardized, merging the data can be very complicated, often requiring multiple reformatting steps. At the same time, new experiments often demand a recombination of data from many sources, requiring that the investigator resolve data type and naming inconsistencies, and often that s/he change relationships as well. While some databases have open source schemas and data, this still leaves a large task for reformatting the data. Presented here is a tool that facilitates the process of cross-mapping data, when the goal is to populate a second database with a specified subset of information from a source database. The tool is very generic, so we provide use cases both to demonstrate the need and to provide nice tutorials for guiding users through the application. We focus on combining data from arrays that are used to measure either gene expression or genotype information. Each array type has a different interface for reporting the location and composition of probes (sometimes to a subset of community standards, such as the MIAME standard for expression arrays). Gene location, sequence variants and probe locations with respect to those attributes are cross-mapped, giving insight into probes used to assess its gene expression overlap with known SNP genotype information and SNP chip probe information. The overall goal of this project was to provide a data integration tool by which a researcher can: 1. Access a variety of databases, 2. Provide the correct nomenclature mapping, and 3. Incorporate the information into a common resource that allows data from different experiments and experimental platforms to be correctly combined and then statistical tests applied.

Description

Keywords

Bioinformatics, Data, Query, Map, Export, CSV

Citation