An Adenocarcinoma Case Study of the BaFL Protocol: Biological Probe Filtering for Robust Microarray Analysis




Thompson, Kevin

Journal Title

Journal ISSN

Volume Title



Microarrays are high throughput data measurement technologies; those that assay gene expression levels allowing investigators to simultaneously estimate the level of thousands of cellular transcripts present in a sample at the time of collection. Many sources of variation have plagued Microarray analysis, leading to apparent inconsistencies between experimental results derived from independent platforms. A rigorous, robust set of methods for identifying all of the currently known sources of variability and consistently applying them across large data sets has been implemented in the Biologically applied Filter Level, BaFL, protocol. This protocol eliminates all probes for which the underlying sequence characteristics are missing, because of which the probe characteristics, including the identification of the measured transcript region, are impossible to derive. The remaining probes are processed through the biophysical software to determine their Gibb’s free energy, as a measure of the solution stability. This measure eliminates any overly stable probes, which would be less assessable to measure the desired transcript region. The filtering process also enforces a range of acceptable signal intensity measurements, the result of scanner characteristics. Measurements outside the linear range violate the linear correlation relationship between transcript concentration and signal intensity. Probes identified as covering single nucleotide polymorphisms are identified and removed. The Ensembl database is queried to identify probes which measure single specific gene transcript regions, all other probes were excluded. The final step is to enforce a rule that a minimum of four probes are retained, so that any given statistical estimator of concentration has an adequate basis. Samples are subject to many technical steps, so tests for outliers are implemented that included comparisons of representative probe intensities and probe numbers, against the population mean. Samples exceeding ±2 standard deviations of the average probe numbers and probe intensities are removed. ProbeSet constituents at this stage may not be identical across all samples, with differences arising from the linear range filter step. By performing an intersection operation of the remaining probes across all samples, still enforcing a minimum of four probes per ProbeSet, a final, common ProbeSet dataset is derived, which is used as the basis of all further comparisons and analyses. The suggested data models demonstrated improved performance across three classification algorithms, and remarkable latent structure can be seen across the data models. When Bonferonni correction is applied and the intersecting genes identified a final candidate gene list of 30 ProbeSets results. By including on/off genes in the list, an additional ProbeSet is identified. These 31 candidate genes demonstrate notable connectivity in their GO and KEGG associations. Literature review of the genes establishes that these associations arise from properties specific to angiogenesis and tumorogenesis. A multiclass dataset of non small cell lung cancer samples was constructed and information gain calculated from the k-means clustering efficiency. A candidate list of 18 genes is shown to possess an information gain greater than or equal to 0.8. The literature review of these 18 genes provides evidence that abnormal cytokinesis may underlie tumorogenesis for both cancer sub-types. The squamous cell carcinomas, in particular, appear to suffering from the production of radical oxidative species. Currently most Microarray analyses implement one of a small number of published probe cleansing algorithms. Occasional efforts to accommodate one of the confounding factors of the probe-transcript interaction have been made, but no method is as inclusive as that presented in this work. Further, no work exists that demonstrates the improved efficacy of removing a factor on subsequent performance with the existing algorithms. Great effort has been taken here to show that analysis of the resulting datasets leads to greatly improved consistency in inter-experimental comparisons, using two independent lung adenocarcinoma datasets, in comparison to the pre-eminent probe cleansing methodologies, RMA and dCHIP.



Microarray, Lung cancer, Data Cleansing, Cancer classification, Machine learning, Data Mining