Abstract

Metagenomic sequencing is composed of molecular biology techniques where nucleic acid is extracted and prepared for high throughput sequencing. The nucleotide order of a large number of nucleic acid fragments is determined during high throughput sequencing. Since nucleic acids contain the construction plan of an organism, its sequence is specific to that organism. Comparing the output of metagenomic sequencing, the sequencing reads, to a labeled sequence reference database, allows the identification of the organisms.
This method can be applied as an untargeted diagnostic test to find the infection-causing organism without needing a specific suspicion. However, the diagnosis is typically restricted to the known and characterized organisms of which there is a genome sequence present in the reference database. This means that after classifying the sequencing reads from a sample by comparing them to the reference database there are always some reads left for which no suitable match can be found. Characterizing the source of these unclassified sequencing reads can be crucial to understand the method and its weaknesses.
In this work, the unclassified sequencing reads of 283 throat swab samples that underwent metagenomic sequencing are systematically investigated. Results suggest that unclassified reads can partly be explained by a lower sequence read quality.
Uncharacterized organisms or organisms with a large genetic diversity (if not included in the reference database) offer one further explanation for the remaining portion of unclassified reads. A method to detect outlier samples, which have a high number of unclassified reads, is presented. This allows quick filtering for interesting samples where an uncharacterized organism might be present.
To account for genetic diversity, a semi-supervised approach is introduced and applied to the dataset to overcome this issue to a certain extent, and to increase the fraction of sequencing reads with a class label.
The experience from this work will help to understand and to deal with unclassified reads in the future and samples from other body sites.