Multiple Kernel Learning for Gene Prioritization, Clustering, and Functional Enrichment Analysis

Date

2014-05

Authors

Millis, David Howard

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Gene prioritization is the process of ranking a list of candidate genes such that the genes that are most likely involved in a biological process of interest receive the highest rankings. In a supervised learning approach to gene prioritization, candidate genes are ranked in terms of their degree of similarity to genes that have already been shown to be involved in the process of interest. Gene prioritization thus can be cast as a classification task, in which a training set of genes and data associated with those genes is used to train a classifier to assign rankings to unknown genes, based on their degree of similarity to the training genes. This thesis describes the use of kernel methods, and particularly a method known as multiple kernel learning, for combining information from multiple data sources for purposes of gene prioritization. Multiple kernel learning facilitates the incorporation of heterogeneous data types into the assessment of similarity among genes. In addition, the rows of the kernel matrix can be repurposed as feature vectors. We apply clustering methods to these vectors to partition the gene list into related groups. We then perform functional enrichment analysis on the gene clusters to identify biological functions that are significantly represented in each gene cluster. We thus are able to use a single data structure, namely a kernel matrix representing similarities among genes based on multiple information sources, as the basis for three common types of bioinformatics analysis: gene prioritization, gene clustering, and functional annotation analysis of gene lists. This research contributes to the exploration of methods for extracting useful biological insights from the continually expanding knowledge base of biological data.

Description

Keywords

Bioinformatics, Bioinformatics, Functional enrichment, Gene clustering, Gene prioritization, Multiple kernel learning, Support vector machines

Citation