Abstract:
Metabarcoding has provided the community of ecologists with a method for studying various environments using marker genes. A pivotal step in a metabarcoding study is the analysis of data produced by Massively Parallel High Throughput Sequencing (MPHTS) technologies. It is established that the various denoising and clustering algorithms that are used in di erent data analysis pipelines produce di erent diversity estimates for the same data. This study aimed to investigate the variation of the diversity estimates produced by di erent pipelines, which were combinations of four denoising algorithms and four clustering techniques, on four di erent datasets. The results show that the variation of the diversity estimates are mainly dependent on the denoising algorithm, complexity of the data, the speci c gene and the number of sequences in the dataset. The study also found that there is a signi cant similarity between the results produced by the non- owgram clustering techniques and that the UPARSE clustering technique is able to perform consistently in combination with di erent denoising algorithms. The UPARSE pipelines also produced the least number of OTUs in comparison to other pipelines across all datasets. The study was also able to establish that none of the pipelines considered in this study was able to produce all diversity indexes that are similar to the true diversity, which was produced using Sanger sequenced data. The study was able to provide recommendations for choosing a pipeline for a particular study and concludes that metabarcoding studies should use at least two di erent pipelines for the analysis of data.