Abstract:
Microarrays are high throughput data measurement technologies; those that assay gene
expression levels allowing investigators to simultaneously estimate the level of thousands of
cellular transcripts present in a sample at the time of collection. Many sources of variation have
plagued Microarray analysis, leading to apparent inconsistencies between experimental results
derived from independent platforms. A rigorous, robust set of methods for identifying all of the
currently known sources of variability and consistently applying them across large data sets has
been implemented in the Biologically applied Filter Level, BaFL, protocol. This protocol
eliminates all probes for which the underlying sequence characteristics are missing, because of
which the probe characteristics, including the identification of the measured transcript region, are
impossible to derive. The remaining probes are processed through the biophysical software to
determine their Gibb’s free energy, as a measure of the solution stability. This measure
eliminates any overly stable probes, which would be less assessable to measure the desired
transcript region. The filtering process also enforces a range of acceptable signal intensity
measurements, the result of scanner characteristics. Measurements outside the linear range
violate the linear correlation relationship between transcript concentration and signal intensity.
Probes identified as covering single nucleotide polymorphisms are identified and removed. The
Ensembl database is queried to identify probes which measure single specific gene transcript
regions, all other probes were excluded. The final step is to enforce a rule that a minimum of
four probes are retained, so that any given statistical estimator of concentration has an adequate
basis. Samples are subject to many technical steps, so tests for outliers are implemented that
included comparisons of representative probe intensities and probe numbers, against the
population mean. Samples exceeding ±2 standard deviations of the average probe numbers and
probe intensities are removed. ProbeSet constituents at this stage may not be identical across all
samples, with differences arising from the linear range filter step. By performing an intersection
operation of the remaining probes across all samples, still enforcing a minimum of four probes
per ProbeSet, a final, common ProbeSet dataset is derived, which is used as the basis of all further
comparisons and analyses.
The suggested data models demonstrated improved performance across three classification
algorithms, and remarkable latent structure can be seen across the data models. When Bonferonni
correction is applied and the intersecting genes identified a final candidate gene list of 30
ProbeSets results. By including on/off genes in the list, an additional ProbeSet is identified.
These 31 candidate genes demonstrate notable connectivity in their GO and KEGG associations.
Literature review of the genes establishes that these associations arise from properties specific to
angiogenesis and tumorogenesis. A multiclass dataset of non small cell lung cancer samples was
constructed and information gain calculated from the k-means clustering efficiency. A candidate
list of 18 genes is shown to possess an information gain greater than or equal to 0.8. The
literature review of these 18 genes provides evidence that abnormal cytokinesis may underlie
tumorogenesis for both cancer sub-types. The squamous cell carcinomas, in particular, appear to
suffering from the production of radical oxidative species.
Currently most Microarray analyses implement one of a small number of published probe
cleansing algorithms. Occasional efforts to accommodate one of the confounding factors of the
probe-transcript interaction have been made, but no method is as inclusive as that presented in
this work. Further, no work exists that demonstrates the improved efficacy of removing a factor
on subsequent performance with the existing algorithms. Great effort has been taken here to
show that analysis of the resulting datasets leads to greatly improved consistency in inter-experimental
comparisons, using two independent lung adenocarcinoma datasets, in comparison
to the pre-eminent probe cleansing methodologies, RMA and dCHIP.