Abstract:
Forensic scientists have been using short tandem repeats (STRs) for more than twenty years to analyse DNA. This has been based on the use of electrophoresis, most recently capillary electrophoresis, to find out the lengths of the repeat sequences detected for the forensic STR markers tested. Forensic scientists are now carrying out research into the use of massively parallel sequencing (MPS) technology for forensic testing. This research has and is generating data about the sequence of each STR marker and also additional SNP markers for physical characteristic determination, identity and ancestry and whole mitochondrial genome analysis. All of these advances offer forensic scientists the ability to provide more information about forensic samples in the future. Before this can happen, research needs to be carried out to develop suitable bioinformatics pipelines and to analyse and interpret the sequencing data in order to fully understand and interpret the data. In this study, a bioinformatics pipeline was assembled for the analysis of STR data generated on massively parallel sequencers. The pipeline is composed of tools that were readily available and new tools developed for this study. Verification of the pipeline involved the comparison of the STR profile generated by the pipeline with the results from capillary electrophoresis. The pipeline was used to explore if STR data generated on the Illumina MiSeq, and Ion Torrent Personal Genome Machine (Ion PGM) sequencers produce different results. The results showed that DNA profiles generated from data sequenced on the Ion PGM contained much more incorrectly called loci than that from the MiSeq due to high coverage sequencing errors. These errors result in the production of indel artefact sequences, which confound allele calling. Furthermore, specific locations of the STR were observed to be more prone to sequencing errors.