Chapter 6 Appendix

Appendix A: Snakemake workflow

The Snakefile, the contents of which are listed below, contain the code for the rules which are described in chapter 3.3 and which is also published on GitHub (sschmutz/thesis-workflow).

# All samples which should be analysed have to be listed here.
# Alternatively they could also be read from a file.
SAMPLES = [
    "1000576042-AR-DNA",
    "1000576042-AR-RNA",
    "1000580287-AR-DNA",
    "1000580287-AR-RNA"
    ]

# The THRESHOLD variable defines the threshold (in percent) which
# is applied to label the contigs using the rule "infer_class_labels"
THRESHOLD = 5


# a pseudo-rule which lists all files that should be created
rule all:
    input:
        quality_profiling = expand(
            "data/quality_metrics/{sample}_{type}.json",
            sample = SAMPLES,
            type = [
                "classified-reads",
                "unclassified-reads",
                "unclassified-in-contig-reads",
                "unclassified-not-in-contig-reads"
                ]
            ),
        assembly_stats_sample = expand(
            "data/metagenome_assembly/{sample}_final.contigs.lst",
            sample = SAMPLES
            ),
        undetermined_class_label = expand(
          "data/undetermined_class_label/{sample}_{threshold}.csv",
          sample = SAMPLES,
          threshold = THRESHOLD
          )


rule get_quality_filtered_sequences:
    input:
        raw_reads = \
            "data/sequencing_files/{sample}.fastq.gz",
        unclassified_reads = \
            "data/sequencing_files/{sample}_unclassified-reads.fastq.gz"
    output:
        intermediate_reads = \
            temp("data/sequencing_files/{sample}_intermediate.fastq"),
        good_reads = \
            temp("data/sequencing_files/{sample}_good.fastq"),
        bad_reads = \
            temp("data/sequencing_files/{sample}_bad.fastq"),
        good_reads_list = \
            temp("data/sequencing_files/{sample}_good.lst"),
        unclassified_reads_list = \
            temp("data/sequencing_files/{sample}_unclassified.lst"),
        good_classified_reads_list = \
            temp("data/sequencing_files/{sample}_good-classified.lst"),
        classified_reads = \
            "data/sequencing_files/{sample}_classified-reads.fastq.gz"
    run:
        # trim and filter for sequences > 75nt
        shell("seqtk trimfq {input.raw_reads} | "
               "seqtk seq -L 75 - > {output.intermediate_reads}")

        # filter low quality sequences and sequences with low entropy
        shell("prinseq-lite.pl "
              "-fastq data/sequencing_files/{wildcards.sample}_intermediate.fastq "
              "-lc_method entropy -lc_threshold 70 -min_qual_mean 20 "
              "-out_good data/sequencing_files/{wildcards.sample}_good "
              "-out_bad data/sequencing_files/{wildcards.sample}_bad")

        # get sequence ids from all good and unclssified reads
        shell('grep "@M0" data/sequencing_files/{wildcards.sample}_good.fastq | '
              'sort | sed "s/@//" > {output.good_reads_list}')
        shell('zgrep "@M0" {input.unclassified_reads} | sort | '
              'sed "s/@//" > {output.unclassified_reads_list}')

        # get only classified sequences (difference between filtered and undetermined)
        shell("comm -23 data/sequencing_files/{wildcards.sample}_good.lst "
              "data/sequencing_files/{wildcards.sample}_unclassified.lst "
              "> {output.good_classified_reads_list}")
        shell("seqkit grep -n -f "
              "data/sequencing_files/{wildcards.sample}_good-classified.lst "
              "data/sequencing_files/{wildcards.sample}_good.fastq | gzip "
              "> {output.classified_reads}")


rule metagenome_assembly:
    input:
        classified_reads = \
            "data/sequencing_files/{sample}_classified-reads.fastq.gz",
        unclassified_reads = \
            "data/sequencing_files/{sample}_unclassified-reads.fastq.gz"
    output:
        assembly_folder = temp(directory("data/metagenome_assembly/{sample}")),
        assembly = "data/metagenome_assembly/{sample}_final.contigs.fasta.gz",
        assembly_stats_sample = "data/metagenome_assembly/{sample}_final.contigs.lst"
    run:
        # run megahit with the default parameter, specify memory (-m) and
        # thread (-t) usage
        shell("megahit -r {input.classified_reads},{input.unclassified_reads} "
              "-m 0.5 -t 4 -o {output.assembly_folder}")

        # get the sequence headers of the final.contigs which contain
        # assembly statistics information (contig name, flag, multi and length)
        shell('grep ">" data/metagenome_assembly/{wildcards.sample}/final.contigs.fa | '
              'sed "s/>//" > {output.assembly_stats_sample}')

        # move, rename and compress the final.contigs
        shell("cp data/metagenome_assembly/{wildcards.sample}/final.contigs.fa "
              "data/metagenome_assembly/{wildcards.sample}_final.contigs.fasta")
        shell("gzip data/metagenome_assembly/{wildcards.sample}_final.contigs.fasta")


rule read_mapping:
    input:
        unclassified_reads = \
            "data/sequencing_files/{sample}_unclassified-reads.fastq.gz",
        classified_reads = \
            "data/sequencing_files/{sample}_classified-reads.fastq.gz",
        assembly_fasta = \
            "data/metagenome_assembly/{sample}_final.contigs.fasta.gz"
    output:
        "data/metagenome_assembly_read_mapping/{sample}_aln.tsv.gz"
    shell:
        "minimap2 -ax sr {input.assembly_fasta} "
        "<(cat {input.unclassified_reads} {input.classified_reads}) | "
        "samtools view | cut -f 1,3 | gzip > {output}"


rule get_virmet_labels:
    input:
        reference_human = \
            "data/virmet_dbs/GRCh38.fasta",
        cram_human = \
            "data/virmet_output/{sample}/"\
            "good_humanGRCh38.cram",
        reference_bacterial_1 = \
            "data/virmet_dbs/bact1.fasta",
        cram_bacterial_1 = \
            "data/virmet_output/{sample}/"\
            "good_humanGRCh38_bact1.cram",
        reference_bacterial_2 = \
            "data/virmet_dbs/bact2.fasta",
        cram_bacterial_2 = \
            "data/virmet_output/{sample}/"\
            "good_humanGRCh38_bact1_bact2.cram",
        reference_bacterial_3 = \
            "data/virmet_dbs/bact3.fasta",
        cram_bacterial_3 = \
            "data/virmet_output/{sample}/"\
            "good_humanGRCh38_bact1_bact2_bact3.cram",
        reference_bacterial_4 = \
            "data/virmet_dbs/bact4.fasta",
        cram_bacterial_4 = \
            "data/virmet_output/{sample}/"\
            "good_humanGRCh38_bact1_bact2_bact3_bact4.cram",
        reference_bacterial_5 = \
            "data/virmet_dbs/bact5.fasta",
        cram_bacterial_5 = \
            "data/virmet_output/{sample}/"\
            "good_humanGRCh38_bact1_bact2_bact3_bact4_bact5.cram",
        reference_fungal = \
            "data/virmet_dbs/fungi1.fasta",
        cram_fungal = \
            "data/virmet_output/{sample}/"\
            "good_humanGRCh38_bact1_bact2_bact3_bact4_bact5_fungi1.cram",
        fastq_viral = \
            "data/virmet_output/{sample}/"\
            "viral_reads.fastq.gz"
    output:
        list_human = "data/classification/{sample}_human.lst.gz",
        list_bacterial_1 = "data/classification/{sample}_bacterial_1.lst.gz",
        list_bacterial_2 = "data/classification/{sample}_bacterial_2.lst.gz",
        list_bacterial_3 = "data/classification/{sample}_bacterial_3.lst.gz",
        list_bacterial_4 = "data/classification/{sample}_bacterial_4.lst.gz",
        list_bacterial_5 = "data/classification/{sample}_bacterial_5.lst.gz",
        list_fungal = "data/classification/{sample}_fungal.lst.gz",
        list_viral = "data/classification/{sample}_viral.lst.gz"
    run:
        # get human sequence ids
        shell('samtools fastq --reference {input.reference_human} {input.cram_human} | '
              'grep "@M0" | sed "s/@//" | gzip > {output.list_human}')

        # get bacterial sequence ids
        shell('samtools fastq --reference {input.reference_bacterial_1} '
              '{input.cram_bacterial_1} | '
              'zgrep "@M0" | sed "s/@//" | gzip > {output.list_bacterial_1}')
        shell('samtools fastq --reference {input.reference_bacterial_2}  '
              '{input.cram_bacterial_2} | '
              'zgrep "@M0" | sed "s/@//" | gzip > {output.list_bacterial_2}')
        shell('samtools fastq --reference {input.reference_bacterial_3}  '
              '{input.cram_bacterial_3} | '
              'zgrep "@M0" | sed "s/@//" | gzip > {output.list_bacterial_3}')
        shell('samtools fastq --reference {input.reference_bacterial_4}  '
              '{input.cram_bacterial_4} | '
              'zgrep "@M0" | sed "s/@//" | gzip > {output.list_bacterial_4}')
        shell('samtools fastq --reference {input.reference_bacterial_5}  '
              '{input.cram_bacterial_5} | '
              'zgrep "@M0" | sed "s/@//" | gzip > {output.list_bacterial_5}')

        # get fungal sequence ids
        shell('samtools fastq --reference {input.reference_fungal}  '
              '{input.cram_fungal} | '
              'zgrep "@M0" | sed "s/@//" | gzip > {output.list_fungal}')

        # get viral sequence ids
        shell('zgrep "@M0" {input.fastq_viral} | sed "s/@//" |  '
              'gzip > {output.list_viral}')


rule infer_class_labels:
    input:
        metagenome_assembly_read_mapping = \
            "data/metagenome_assembly_read_mapping/{sample}_aln.tsv.gz",
        list_human = \
            "data/classification/{sample}_human.lst.gz",
        list_bacterial_1 = \
            "data/classification/{sample}_bacterial_1.lst.gz",
        list_bacterial_2 = \
            "data/classification/{sample}_bacterial_2.lst.gz",
        list_bacterial_3 = \
            "data/classification/{sample}_bacterial_3.lst.gz",
        list_bacterial_4 = \
            "data/classification/{sample}_bacterial_4.lst.gz",
        list_bacterial_5 = \
            "data/classification/{sample}_bacterial_5.lst.gz",
        list_fungal = \
            "data/classification/{sample}_fungal.lst.gz",
        list_viral = \
            "data/classification/{sample}_viral.lst.gz"
    params:
        # the threshold value is in percent
        threshold = "{threshold}"
    output:
        "data/undetermined_class_label/{sample}_{threshold}.csv",
        temp("data/classification/{sample}_unclassified-in-contig_{threshold}.lst"),
        temp("data/classification/{sample}_unclassified-not-in-contig_{threshold}.lst")
    script:
        "scripts/infer_class_labels.R"


rule get_quality_metrics:
    input:
        classified_reads = \
            "data/sequencing_files/{sample}_classified-reads.fastq.gz",
        unclassified_reads = \
            "data/sequencing_files/{sample}_unclassified-reads.fastq.gz",
        unclassified_in_contig_reads_list = expand(
            "data/classification/{{sample}}_unclassified-in-contig_{threshold}.lst",
            threshold = THRESHOLD
            ),
        unclassified_not_in_contig_reads_list = expand(
            "data/classification/{{sample}}_unclassified-not-in-contig_{threshold}.lst", 
            threshold = THRESHOLD
            )
    output:
        classified_reads_quality = \
            "data/quality_metrics/{sample}_classified-reads.json",
        unclassified_reads_quality = \
            "data/quality_metrics/{sample}_unclassified-reads.json",
        unclassified_in_contig_reads = \
            "data/sequencing_files/{sample}_unclassified-in-contig.fastq.gz",
        unclassified_in_contig_reads_quality = \
            "data/quality_metrics/{sample}_unclassified-in-contig-reads.json",
        unclassified_not_in_contig_reads = \
            "data/sequencing_files/{sample}_unclassified-not-in-contig.fastq.gz",
        unclassified_not_in_contig_reads_quality = \
            "data/quality_metrics/{sample}_unclassified-not-in-contig-reads.json"
    run:
        # get quality data of classified and unclassified reads
        shell("fastp -i {input.classified_reads} "
              "--overrepresentation_analysis "
              "--low_complexity_filter "
              "-j {output.classified_reads_quality} -h /dev/null")
        shell("fastp -i {input.unclassified_reads} "
              "--overrepresentation_analysis "
              "--low_complexity_filter "
              "-j {output.unclassified_reads_quality} -h /dev/null")

        # extract sequences listed in unclassified_in_contig and
        # unclassified_not_in_contig lists
        shell("seqkit grep -f {input.unclassified_in_contig_reads_list} "
              "{input.unclassified_reads} | "
              "gzip > {output.unclassified_in_contig_reads}")
        shell("seqkit grep -f {input.unclassified_not_in_contig_reads_list} "
              "{input.unclassified_reads} | "
              "gzip > {output.unclassified_not_in_contig_reads}")

        # get quality data of unclassified-in-contig and
        # unclassified-not-in-contig reads
        shell("fastp -i {output.unclassified_in_contig_reads} "
              "--overrepresentation_analysis "
              "--low_complexity_filter "
              "-j {output.unclassified_in_contig_reads_quality} -h /dev/null")
        shell("fastp -i {output.unclassified_not_in_contig_reads} "
              "--overrepresentation_analysis "
              "--low_complexity_filter "
              "-j {output.unclassified_not_in_contig_reads_quality} -h /dev/null")

The contents of the script infer_class_labels.R which is called from the Snakefile is listed below.

library(readr)
library(tidyr)
library(dplyr)
library(stringr)
library(purrr)
library(here)


# function definitions ----------------------------------------------------

read_virmet_classes <- function(sample_name_short, virmet_class){
  # this function reads VirMet classification infos (read ids) of one sample and class
  # enables to use map in the next function to do this for all classes of one sample
  
  class_label_path <-
    here("data",
         "classification",
         paste0(sample_name_short, "_", virmet_class, ".lst.gz"))
  
  # only returns first part of class label ("bacterial_1" gives "bacterial")
  virmet_class_short <-
    str_split(virmet_class, pattern = "_")[[1]][1]
  
  virmet_class_labels <-
    read_csv(class_label_path, col_names = c("read_name")) %>%
    mutate(virmet_class = virmet_class_short)
  
  return(virmet_class_labels)
}


write_undetermined_class_label <- function(sample_name_short, threshold){
  
  alignment_path <-
    here("data",
         "metagenome_assembly_read_mapping",
         paste0(sample_name_short, "_aln.tsv.gz"))
  
  # caution, some reads might have mapped to more than one contig
  # and are listed multiple times
  alignment <-
    read_delim(alignment_path,
               delim = "\t",
               col_names = c("read_name", "contig_name"), na = "*") %>%
    # remove multiple rows when a single reads was mapped multiple times
    # to the same contig
    unique() %>%
    separate(contig_name, into = c(NA, "contig_number"), convert = TRUE)
  

  virmet_classes <-
    c("human",
      "bacterial_1", "bacterial_2", "bacterial_3", "bacterial_4", "bacterial_5",
      "fungal",
      "viral")
  
  virmet_class_labels <-
    map_dfr(virmet_classes, ~read_virmet_classes(sample_name_short, .x)) %>%
    # viral read names contain an additional part, remove that
    separate(read_name, into = "read_name", sep = " ", extra = "drop")
  
  alignment_undetermined_reads <-
    anti_join(alignment, virmet_class_labels, by = "read_name")
  
  combined <-
    full_join(alignment, virmet_class_labels, by = "read_name") %>%
    mutate(virmet_class = replace_na(virmet_class, "undetermined"))
  
  contig_labels <-
    combined %>%
    filter(!is.na(contig_number)) %>%
    group_by(contig_number, virmet_class) %>%
    count() %>%
    group_by(contig_number) %>%
    mutate(n_total = sum(n)) %>%
    ungroup() %>%
    mutate(n_fraction = n/n_total) %>%
    # select top label (undetermined excluded) which is above threshold
    # to label the contigs
    filter(virmet_class != "undetermined",
           n_fraction > (threshold/100)) %>%
    group_by(contig_number) %>%
    slice_max(order_by = n_fraction, n = 1, with_ties = FALSE) %>%
    ungroup()
  
  
  unclassified_in_contig <-
    alignment_undetermined_reads %>%
    filter(!is.na(contig_number)) %>%
    left_join(contig_labels, by = "contig_number")
  
  unclassified_in_contig_list <-
    alignment_undetermined_reads %>%
    filter(!is.na(contig_number)) %>%
    pull(read_name) %>%
    unique()
  
  
  n_unclassified_not_in_contig <-
    alignment_undetermined_reads %>%
    filter(is.na(contig_number)) %>%
    nrow()
  
  unclassified_not_in_contig_list <-
    alignment_undetermined_reads %>%
    filter(is.na(contig_number)) %>%
    pull(read_name) %>%
    unique()
  
  classes <-
    tibble(class =
             c("human",
               "bacterial",
               "fungal",
               "viral",
               "unclassified_in_contig",
               "unclassified_not_in_contig"))
    
  class_order <- classes$class
  
  n_summary <-
    unclassified_in_contig %>%
    count(virmet_class) %>%
    rename(class = virmet_class) %>%
    mutate(class = replace_na(class, "unclassified_in_contig")) %>%
    add_row(class = "unclassified_not_in_contig", n = n_unclassified_not_in_contig) %>%
    full_join(classes, by = "class") %>%
    mutate(n = replace_na(n, 0),
           class = factor(class, levels = class_order),
           sample = sample_name_short,
           treshold_percent = threshold) %>%
    arrange(class)

  
  write_csv(n_summary,
            file =
              here("data",
                   "undetermined_class_label",
                   paste0(sample_name_short, "_", threshold, ".csv")))
  
  write(unclassified_in_contig_list,
        file =
          here("data",
               "classification",
               paste0(sample_name_short,
                      "_unclassified-in-contig_",
                      threshold,
                      ".lst")))
  
  write(unclassified_not_in_contig_list,
        file =
          here("data",
               "classification",
               paste0(sample_name_short,
                      "_unclassified-not-in-contig_",
                      threshold,
                      ".lst")))
  
}


# call functions ----------------------------------------------------------

list_human <- snakemake@input[["list_human"]]

# extract the sample name by looking at everything which is in between
# "data/classification/" and "_human.lst.gz"
sample_name_short <-
  str_extract(list_human, pattern = "(?<=data/classification/)(.*)(?=_human.lst.gz)")

threshold <- as.numeric(snakemake@params[["threshold"]])

write_undetermined_class_label(sample_name_short, threshold)

References

1. Quince, C., Walker, A. W., Simpson, J. T., Loman, N. J. & Segata, N. Shotgun metagenomics, from sampling to analysis. Nature Biotechnology 35, 833–844 (2017).

2. Chiu, C. Y. & Miller, S. A. Clinical metagenomics. Nature Reviews Genetics 20, 341–355 (2019).

3. Gardy, J. L. & Loman, N. J. Towards a genomics-informed, real-time, global pathogen surveillance system. Nature Reviews Genetics 19, 9–20 (2018).

4. Hadfield, J. et al. Nextstrain: Real-time tracking of pathogen evolution. Bioinformatics 34, 4121–4123 (2018).

5. Kufner, V. et al. Two Years of Viral Metagenomics in a Tertiary Diagnostics Unit: Evaluation of the First 105 Cases. Genes 10, 661 (2019).

6. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990).

7. Dutilh, B. E. et al. A highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes. Nature Communications 5, (2014).

8. Abbas, A. A. et al. Redondoviridae, a family of small, circular DNA viruses of the human oro-respiratory tract that are associated with periodontitis and critical illness. Cell host & microbe 25, 719–729.e4 (2019).

9. Wu, F. et al. A new coronavirus associated with human respiratory disease in China. Nature 579, 265–269 (2020).

10. Sheridan, C. Coronavirus and the race to distribute reliable diagnostics. Nature Biotechnology 38, 382–384 (2020).

11. Ma, X. et al. Analysis of error profiles in deep next-generation sequencing data. Genome Biology 20, 50 (2019).

12. Bioinformatics, e. Trimming adapter sequences - is it necessary?

13. Quality Scores for Next-Generation Sequencing. 2.

14. Ewing, B. & Green, P. Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities. Genome Research 8, 186–194 (1998).

15. Chen, S., Zhou, Y., Chen, Y. & Gu, J. Fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).

16. RefSeq: NCBI Reference Sequence Database.

17. Gleizes, A. et al. Virosaurus A Reference to Explore and Capture Virus Genetic Diversity. Viruses 12, (2020).

18. Biek, R., Pybus, O. G., Lloyd-Smith, J. O. & Didelot, X. Measurably evolving pathogens in the genomic era. Trends in ecology & evolution 30, 306–313 (2015).

19. Peck, K. M. & Lauring, A. S. Complexities of Viral Mutation Rates. Journal of Virology 92, (2018).

20. Pérez-Cobas, A. E., Gomez-Valero, L. & Buchrieser, C. Metagenomic approaches in microbial ecology: An update on whole-genome and marker gene sequencing analyses. Microbial Genomics 6, (2020).

21. Ayling, M., Clark, M. D. & Leggett, R. M. New approaches for metagenome assembly with short reads. Briefings in Bioinformatics 21, 584–594 (2020).

22. Seemann, T. Prokka: Rapid prokaryotic genome annotation. Bioinformatics (Oxford, England) 30, 2068–2069 (2014).

23. Seemann, T. Tseemann/abricate. (2021).

24. Li, D., Liu, C.-M., Luo, R., Sadakane, K. & Lam, T.-W. MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics (Oxford, England) 31, 1674–1676 (2015).

25. Mölder, F. et al. Sustainable data analysis with Snakemake. F1000Research 10, 33 (2021).

26. Hsi-Yang Fritz, M., Leinonen, R., Cochrane, G. & Birney, E. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Research 21, 734–740 (2011).

27. Danecek, P. et al. Twelve years of SAMtools and BCFtools. GigaScience 10, giab008 (2021).

28. Li, H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

29. Schmieder, R. & Edwards, R. Quality control and preprocessing of metagenomic datasets. Bioinformatics 27, 863–864 (2011).

30. Rossum, G. van. Python tutorial. (1995).

31. R Core Team. R: A language and environment for statistical computing. (R Foundation for Statistical Computing, 2020).

32. Robinson, D., Hayes, A. & Couch, S. Broom: Convert statistical objects into tidy tibbles. (2021).

33. Fox, J. & Weisberg, S. An R companion to applied regression. (Sage, 2019).

34. Kuhn, M., Jackson, S. & Cimentada, J. Corrr: Correlations in r. (2020).

35. Schloerke, B. et al. GGally: Extension to ’ggplot2’. (2021).

36. Clarke, E. & Sherrill-Mix, S. Ggbeeswarm: Categorical scatter (violin point) plots. (2017).

37. Slowikowski, K. Ggrepel: Automatically position non-overlapping text labels with ’ggplot2’. (2021).

38. Wilke, C. O. Ggtext: Improved text rendering support for ’ggplot2’. (2020).

39. Hester, J. Glue: Interpreted string literals. (2020).

40. Müller, K. Here: A simpler way to find your files. (2020).

41. Zhu, H. KableExtra: Construct complex table with ’kable’ and pipe syntax. (2021).

42. Xie, Y. Knitr: A general-purpose package for dynamic report generation in r. (2021).

43. Edwards, S. M. Lemon: Freshing up your ’ggplot2’ plots. (2020).

44. Grolemund, G. & Wickham, H. Dates and times made easy with lubridate. Journal of Statistical Software 40, 1–25 (2011).

45. Pedersen, T. L. Patchwork: The composer of plots. (2020).

46. Wickham, H. & Seidel, D. Scales: Scale functions for visualization. (2020).

47. Lüdecke, D., Ben-Shachar, M. S., Waggoner, P. & Makowski, D. See: Visualisation toolbox for ’easystats’ and extra geoms, themes and color palettes for ’ggplot2’. CRAN (2020) doi:10.5281/zenodo.3952153.

48. Stanley, J. & Arendt, C. Tidyjson: Tidy complex ’json’. (2020).

49. Kuhn, M. & Wickham, H. Tidymodels: A collection of packages for modeling and machine learning using tidyverse principles. (2020).

50. Wickham, H. et al. Welcome to the tidyverse. Journal of Open Source Software 4, 1686 (2019).

51. Torres-Manzanera, E. Xkcd: Plotting ggplot2 graphics in an xkcd style. (2018).

52. Shen, W., Le, S., Li, Y. & Hu, F. SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLOS ONE 11, e0163962 (2016).

53. Li, H. Lh3/seqtk. (2021).

54. Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological) 57, 289–300 (1995).

55. Kang, D. D. et al. MetaBAT 2: An adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7, (2019).

56. Ebinger, A., Fischer, S. & Höper, D. A theoretical and generalized approach for the assessment of the sample-specific limit of detection for clinical metagenomics. Computational and Structural Biotechnology Journal 19, 732–742 (2021).

57. Nurk, S. et al. The complete sequence of a human genome. bioRxiv 2021.05.26.445798 (2021) doi:10.1101/2021.05.26.445798.