Abstract:
Background: The first step in understanding ecological community diversity and
dynamics is quantifying community membership. An increasingly common method
for doing so is through metagenomics. Because of the rapidly increasing popularity
of this approach, a large number of computational tools and pipelines are available
for analysing metagenomic data. However, the majority of these tools have been
designed and benchmarked using highly accurate short read data (i.e. Illumina), with
few studies benchmarking classification accuracy for long error-prone reads (PacBio
or Oxford Nanopore). In addition, few tools have been benchmarked for nonmicrobial communities.
Results: Here we compare simulated long reads from Oxford Nanopore and Pacific
Biosciences (PacBio) with high accuracy Illumina read sets to systematically investigate
the effects of sequence length and taxon type on classification accuracy for
metagenomic data from both microbial and non-microbial communities. We show that
very generally, classification accuracy is far lower for non-microbial communities, even
at low taxonomic resolution (e.g. family rather than genus). We then show that for two
popular taxonomic classifiers, long reads can significantly increase classification
accuracy, and this is most pronounced for non-microbial communities.
Conclusions: This work provides insight on the expected accuracy for metagenomic
analyses for different taxonomic groups, and establishes the point at which read
length becomes more important than error rate for assigning the correct taxon.