Comparative Analysis of Denoising and Clustering Methods in Microbiome Analysis



Journal Title

Journal ISSN

Volume Title



Taxonomic profiling of microbial communities is one the most crucial and at the same time very challenging step in microbiome data analysis. The challenges mostly concern achieving the best biological precision while also completing the community profiling in a reasonable amount of time and computational cost. Traditionally, this profiling was accomplished by clustering sequence reads into Operational Taxonomic Units (OTUs) using UCLUST at a specific percent sequence similarity threshold (typically 97%) which may result in discarding some correct biological sequences considered as singletons. Therefore, this needs to be improved to 100% similarity threshold by using error-correction methods to avoid NextGen sequencing errors. Recently, some novel bioinformatics methods, namely DADA2, Deblur, and UNOISE, have been developed that focus on sequence denoising (error-correction) strategies and attempt to identify all correct biological sequences in the reads. Although named differently as Amplicon Sequence Variants (ASVs), sub-OTUs (sOTU), and zero-radius OTUs (zOTU) by DADA2, Deblur and UNOISE, respectively, they all essentially mean the same thing i.e. the unique sequence variants (SVs) generated in different numbers by each of the analytical pipelines. These SVs could be used directly as representative sequences instead of those obtained by clustering sequence reads into OTUs. Since the aforementioned methods have been released recently, there are just a few third-party comparisons done between them. Therefore, there is a need for a thorough comparative analysis between these new denoising methods with each other and with respect to some common clustering methods, as well as between their respective SVs and OTUs. In my research, I have focused mainly on three sequence analysis programs: QIIME1, QIIME2 and USEARCH, since their previous versions (QIIME version 1.9 and USEARCH v9) provided some indigenous clustering methods and now their current versions (QIIME2 and USEARCH v10 & onwards) offer additional choices for denoising methods. In addition to this, I have investigated some popular online sequence analysis tools such as BLAST, RDP-11 and state-of-the-art One Codex, and I compared their performances with offline sequence analysis i.e. denoising and clustering methods. I have also proposed a novel approach using RDP classifier based bootstrap analysis to improve the confidence of taxonomic profiling.