Actionable Diagnostics & Clinical Reasoning: Signal Analysis

Interpreting metagenomic outputs for a neuroleptospirosis case

Author

Minh-Quan Ton-Ngoc, Hao Chung The

Published

May 15, 2026

Transition to analysis: The sections below provide the full reproducible debrief behind the simulation. They show how the local files were inspected, why only one specimen survives strict filtering, and why high unclassified signal forces a conservative clinical interpretation.

Code
library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)
library(scales)
library(patchwork)
library(phyloseq)
library(vegan)
library(knitr)

data_dir <- file.path("..", "materials", "clinical-reasoning")
readbased_dir <- file.path(data_dir, "Readbased_Analysis")

theme_set(
  theme_bw(base_size = 12) +
    theme(
      panel.grid.minor = element_blank(),
      plot.title.position = "plot",
      legend.position = "right"
    )
)

fig_dir <- "figures"
dir.create(fig_dir, showWarnings = FALSE)

save_fig <- function(plot, filename, width = 9, height = 5.5, dpi = 300, save_pdf = TRUE) {
  png_path <- file.path(fig_dir, paste0(filename, ".png"))
  ggsave(png_path, plot = plot, width = width, height = height, dpi = dpi, bg = "white")
  if (isTRUE(save_pdf)) {
    pdf_path <- file.path(fig_dir, paste0(filename, ".pdf"))
    ggsave(pdf_path, plot = plot, width = width, height = height, bg = "white")
  }
}

sample_role_levels <- c(
  "Untreated CSF",
  "DNase-treated CSF",
  "Patient serum",
  "Negative-control serum"
)

sample_role_palette <- c(
  "Untreated CSF" = "#1b9e77",
  "DNase-treated CSF" = "#7570b3",
  "Patient serum" = "#d95f02",
  "Negative-control serum" = "#666666"
)

signal_palette <- c(
  "Unclassified" = "#bdbdbd",
  "Leptospira santarosai" = "#1b9e77",
  "Eukaryotic/host-associated" = "#80b1d3",
  "Other classified" = "#fdb462"
)

rank_cols <- c("domain", "phylum", "class", "order", "family", "genus", "species")

read_csv_char <- function(path) {
  read_csv(path, show_col_types = FALSE, col_types = cols(.default = col_character()))
}

read_tsv_char <- function(path) {
  read_tsv(path, show_col_types = FALSE, col_types = cols(.default = col_character()))
}

extract_last_taxon <- function(x) {
  ifelse(grepl(";", x), sub("^.*;", "", x), x)
}

clean_taxon <- function(x) {
  x <- gsub("^[a-z]__", "", x)
  x[x == ""] <- NA_character_
  x
}

parse_lineage <- function(df) {
  df %>%
    separate(
      lineage,
      into = rank_cols,
      sep = ";",
      fill = "right",
      extra = "drop",
      remove = FALSE
    ) %>%
    mutate(across(all_of(rank_cols), clean_taxon))
}

1 Clinical context and adaptation

The original NEJM paper described a 14-year-old boy with severe combined immunodeficiency who had a prolonged, difficult-to-diagnose meningoencephalitic illness. Conventional testing remained unrevealing, and unbiased sequencing of CSF eventually identified Leptospira, leading to targeted therapy and clinical recovery.

For this workshop, that story has been adapted into a Vietnam-facing hospital teaching case. The educational aim is not only to say “the pathogen was found,” but to help healthcare workers reason through a more realistic diagnostic pathway:

  • broad CNS infection differential at admission,
  • staged release of clinical clues,
  • sequencing as a tool for narrowing uncertainty,
  • careful interpretation of which samples are and are not trustworthy,
  • and final integration of sequencing with clinical management.

This notebook now implements that staged pedagogy directly as a self-guided exercise, then continues into the reproducible analysis.

2 Available input files

Code
path_inventory <- tibble::tribble(
  ~object, ~path, ~type,
  "Sample metadata", file.path(data_dir, "nejm_sampleinformation.csv"), "file",
  "Workshop taxonomy export (with unclassified)", file.path(data_dir, "nejm_with_unclassified.csv"), "file",
  "Workshop taxonomy export (without unclassified)", file.path(data_dir, "nejm_without_unclassified.csv"), "file",
  "Workshop phyloseq export", file.path(data_dir, "nejm.phyloseq"), "dir",
  "Workshop AMR folder", file.path(data_dir, "nejm.amr"), "dir",
  "Previous Day 4 lesson notebook", "vietnamese-healthy-gut-microbiome.qmd", "file",
  "Underlying read-based outputs", readbased_dir, "dir"
) %>%
  mutate(
    exists = if_else(type == "dir", dir.exists(path), file.exists(path))
  )

kable(path_inventory)
object path type exists
Sample metadata ../materials/clinical-reasoning/nejm_sampleinformation.csv file TRUE
Workshop taxonomy export (with unclassified) ../materials/clinical-reasoning/nejm_with_unclassified.csv file TRUE
Workshop taxonomy export (without unclassified) ../materials/clinical-reasoning/nejm_without_unclassified.csv file TRUE
Workshop phyloseq export ../materials/clinical-reasoning/nejm.phyloseq dir TRUE
Workshop AMR folder ../materials/clinical-reasoning/nejm.amr dir TRUE
Previous Day 4 lesson notebook vietnamese-healthy-gut-microbiome.qmd file TRUE
Underlying read-based outputs ../materials/clinical-reasoning/Readbased_Analysis dir TRUE

All required context files were available when this notebook was assembled, and all main analysis inputs are present locally.

3 Load metadata and inspect the proposed sample set

Code
metadata <- read_csv(file.path(data_dir, "nejm_sampleinformation.csv"), show_col_types = FALSE) %>%
  mutate(
    sample_name = run_accession,
    sample_role = case_when(
      grepl("untreated CSF", experiment_title, ignore.case = TRUE) ~ "Untreated CSF",
      grepl("DNase-treated CSF", experiment_title, ignore.case = TRUE) ~ "DNase-treated CSF",
      grepl("control", experiment_title, ignore.case = TRUE) ~ "Negative-control serum",
      grepl("serum", experiment_title, ignore.case = TRUE) ~ "Patient serum",
      TRUE ~ sample_name
    ),
    sample_role = factor(sample_role, levels = sample_role_levels),
    patient_status = if_else(grepl("control", experiment_title, ignore.case = TRUE), "Control", "Patient"),
    approx_total_reads = if_else(library_layout == "PAIRED", 2L * read_count, read_count),
    clinical_use = case_when(
      sample_role == "Untreated CSF" ~ "Primary diagnostic CNS specimen",
      sample_role == "DNase-treated CSF" ~ "Comparator CNS specimen after DNase treatment",
      sample_role == "Patient serum" ~ "Systemic comparator specimen",
      sample_role == "Negative-control serum" ~ "Negative control for background/context",
      TRUE ~ "Unspecified"
    )
  ) %>%
  arrange(sample_role)

sample_labels <- setNames(
  paste0(as.character(metadata$sample_role), "\n", metadata$sample_name),
  metadata$sample_name
)

metadata %>%
  select(sample_name, sample_role, patient_status, read_count, approx_total_reads, clinical_use, experiment_title)
# A tibble: 4 × 7
  sample_name sample_role           patient_status read_count approx_total_reads
  <chr>       <fct>                 <chr>               <dbl>              <dbl>
1 SRR1145846  Untreated CSF         Patient            128119             256238
2 SRR1145844  DNase-treated CSF     Patient           1306348            2612696
3 SRR1145845  Patient serum         Patient           2813757            5627514
4 SRR1145847  Negative-control ser… Control             44597              89194
# ℹ 2 more variables: clinical_use <chr>, experiment_title <chr>

Initial metadata interpretation

  • The local re-analysis clearly targets four runs:
    • untreated CSF,
    • DNase-treated CSF,
    • patient serum,
    • and a negative-control serum.
  • This matches the logic of the original paper, where CSF was the key specimen and serum/control provided context.
  • The local read_count values are notably smaller than the multi-million read counts reported in the paper, especially for the untreated CSF and control serum. Whether these reflect SRA spots, processed subsets, or downsampled inputs, they are an important clue that the recreated workflow is not operating on exactly the same data volume as the paper.

4 Part A: quick interpretation from direct final results

This first interpretation layer uses the direct final results from:

../materials/clinical-reasoning/Readbased_Analysis/Sourmash-YACHT/final_results/final_results

This is the easier, more obvious layer of interpretation. Before running the code, predict what you expect to see:

  • Will all four samples produce a confident pathogen-level result?
  • Which specimen should be most diagnostically useful for CNS infection?
  • Would serum be expected to show the same pathogen signal?

For a CNS infection, CSF is usually more informative than serum. Also remember that a negative control is not expected to produce a disease-causing pathogen signal.

4.1 Inspect the direct final result files

Code
direct_final_dir <- file.path(
  readbased_dir, "Sourmash-YACHT",
  "final_results", "final_results"
)

direct_final_files <- tibble(
  file = list.files(direct_final_dir, full.names = FALSE),
  path = file.path(direct_final_dir, file),
  size_bytes = file.info(path)$size
)

kable(direct_final_files)
file path size_bytes
merged_sourmash_yacht.csv ../materials/clinical-reasoning/Readbased_Analysis/Sourmash-YACHT/final_results/final_results/merged_sourmash_yacht.csv 1987
merged_sourmash_yacht_removeuncl_mincoverage0.05.csv ../materials/clinical-reasoning/Readbased_Analysis/Sourmash-YACHT/final_results/final_results/merged_sourmash_yacht_removeuncl_mincoverage0.05.csv 334

This code follows the same practical style as the previous Vietnamese healthy gut microbiome notebook: first check what files exist, then load the smallest interpretable summary table, then only move to deeper analysis if the first result is incomplete or confusing.

Code
direct_with <- read_csv_char(file.path(direct_final_dir, "merged_sourmash_yacht.csv")) %>%
  mutate(
    sample_name = WGS_ID,
    relative_abundance = as.numeric(relative_abundance),
    f_orig_query = as.numeric(f_orig_query),
    total_weighted_hashes = as.numeric(total_weighted_hashes),
    organism_name = coalesce(organism_name, "unclassified")
  ) %>%
  left_join(metadata %>% select(sample_name, sample_role), by = "sample_name")

direct_remove_unclassified <- read_csv_char(file.path(direct_final_dir, "merged_sourmash_yacht_removeuncl_mincoverage0.05.csv")) %>%
  mutate(
    sample_name = WGS_ID,
    organism_name = coalesce(organism_name, "No retained organism")
  ) %>%
  left_join(metadata %>% select(sample_name, sample_role), by = "sample_name")

kable(
  direct_with %>%
    select(sample_name, sample_role, organism_name, lineage, relative_abundance, f_orig_query, total_weighted_hashes, file_name),
  digits = 4
)
sample_name sample_role organism_name lineage relative_abundance f_orig_query total_weighted_hashes file_name
SRR1145844 DNase-treated CSF GCA_001321855.1 SAR11 cluster bacterium PRT-SC02 NA NA NA NA min_coverage0.01
SRR1145845 Patient serum GCA_001321855.1 SAR11 cluster bacterium PRT-SC02 NA NA NA NA min_coverage0.01
SRR1145847 Negative-control serum GCA_001321855.1 SAR11 cluster bacterium PRT-SC02 NA NA NA NA min_coverage0.01
SRR1145846 Untreated CSF GCF_000313175.2 Leptospira santarosai serovar Shermani str. LT 821 d__Bacteria;p__Spirochaetota;c__Leptospiria;o__Leptospirales;f__Leptospiraceae;g__Leptospira;s__Leptospira santarosai 0.1167 0.3787 1765 min_coverage0.05 | min_coverage0.01
SRR1145846 Untreated CSF unclassified NA 0.8833 NA NA NA

4.1.1 Learner checkpoint

Before opening the interpretation, answer:

  1. How many samples show a meaningful retained organism-level result?
  2. Which organism is detected?
  3. Does this match the pathogen described in the original paper?

The direct final result is very simple:

  • Only one sample yields a meaningful pathogen-level result.
  • The informative sample is the untreated CSF (SRR1145846).
  • The retained organism is Leptospira santarosai.
  • This matches the final pathogen interpretation of the original paper.

This is the clearest direct result layer. It gives learners the main diagnostic message quickly, before they explore why the other samples are weak or non-informative.

Code
kable(
  direct_remove_unclassified %>%
    select(sample_name, sample_role, organism_name, file_name),
  digits = 4
)
sample_name sample_role organism_name file_name
SRR1145846 Untreated CSF GCF_000313175.2 Leptospira santarosai serovar Shermani str. LT 821 min_coverage0.05 | min_coverage0.01

4.2 Brief AMR inspection at the direct-result stage

AMR analysis is available from the processed rgi-bwt outputs, but the key question is whether it changes this clinical diagnosis.

Code
amr_files <- list.files(file.path(data_dir, "nejm.amr"), pattern = "gene_mapping_data[.]txt$", full.names = TRUE)

read_amr_brief <- function(path) {
  read_tsv_char(path) %>%
    mutate(
      sample_name = sub("[.]gene_mapping_data[.]txt$", "", basename(path)),
      `Completely Mapped Reads` = as.numeric(`Completely Mapped Reads`),
      `Average Percent Coverage` = as.numeric(`Average Percent Coverage`)
    )
}

amr_brief <- bind_rows(lapply(amr_files, read_amr_brief)) %>%
  left_join(metadata %>% select(sample_name, sample_role), by = "sample_name")

amr_brief_summary <- amr_brief %>%
  group_by(sample_name, sample_role) %>%
  summarise(
    n_amr_hits = n(),
    total_completely_mapped_reads = sum(`Completely Mapped Reads`, na.rm = TRUE),
    max_average_percent_coverage = max(`Average Percent Coverage`, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(sample_role)

kable(amr_brief_summary, digits = 2)
sample_name sample_role n_amr_hits total_completely_mapped_reads max_average_percent_coverage
SRR1145846 Untreated CSF 1 2 23.51
SRR1145844 DNase-treated CSF 3 312 43.29
SRR1145845 Patient serum 2 32 2.79
SRR1145847 Negative-control serum 3 4 16.84

The AMR output does not materially change the diagnosis in this case.

The AMR calls are sparse, low-coverage, and not central to recognizing neuroleptospirosis. They are useful as a demonstration that clinical metagenomic pipelines can return AMR-related outputs, but here they are mainly illustrative rather than clinically decisive.

5 Part B: deeper interpretation using processed analysis tables

The quick result above is enough to identify the most likely pathogen. The deeper analysis below asks harder questions:

  • Why does only one sample survive the strict retained-hit layer?
  • Why are unclassified reads so prominent?
  • Why might the recreated workflow look different from the original paper?
  • When should a sample be interpreted as non-informative rather than forced into a diagnosis?

Before opening interpretations in this section, try to predict what the table or figure should show.

The important comparison is not only “which taxon is present?” It is also “how much of each sample is unresolved, and does the taxon survive conservative filtering?”

5.1 Inspect the supplied workshop exports

Code
strict_with <- read_csv_char(file.path(data_dir, "nejm_with_unclassified.csv")) %>%
  mutate(
    sample_name = as.character(sample_name),
    unweighted_fraction = as.numeric(unweighted_fraction),
    f_weighted_at_rank = as.numeric(f_weighted_at_rank),
    bp_match_at_rank = as.numeric(bp_match_at_rank),
    total_weighted_hashes = as.numeric(total_weighted_hashes),
    source = "supplied_with_unclassified"
  )

strict_without <- read_csv_char(file.path(data_dir, "nejm_without_unclassified.csv")) %>%
  mutate(
    sample_name = as.character(sample_name),
    unweighted_fraction = as.numeric(unweighted_fraction),
    f_weighted_at_rank = as.numeric(f_weighted_at_rank),
    bp_match_at_rank = as.numeric(bp_match_at_rank),
    total_weighted_hashes = as.numeric(total_weighted_hashes),
    source = "supplied_without_unclassified"
  )

phy_sample_data <- read_csv_char(file.path(data_dir, "nejm.phyloseq", "sample_data.csv"))
phy_otu_export <- read_csv_char(file.path(data_dir, "nejm.phyloseq", "otu_table.csv"))
phy_tax_export <- read_csv_char(file.path(data_dir, "nejm.phyloseq", "tax_table.csv"))

export_inventory <- tibble::tribble(
  ~source, ~n_rows, ~n_unique_samples, ~sample_ids,
  "Metadata", nrow(metadata), n_distinct(metadata$sample_name), paste(metadata$sample_name, collapse = ", "),
  "Supplied with_unclassified", nrow(strict_with), n_distinct(strict_with$sample_name), paste(unique(strict_with$sample_name), collapse = ", "),
  "Supplied without_unclassified", nrow(strict_without), n_distinct(strict_without$sample_name), paste(unique(strict_without$sample_name), collapse = ", "),
  "Supplied phyloseq sample_data", nrow(phy_sample_data), n_distinct(phy_sample_data$sample_name), paste(unique(phy_sample_data$sample_name), collapse = ", ")
)

kable(export_inventory)
source n_rows n_unique_samples sample_ids
Metadata 4 4 SRR1145846, SRR1145844, SRR1145845, SRR1145847
Supplied with_unclassified 2 1 SRR1145846
Supplied without_unclassified 1 1 SRR1145846
Supplied phyloseq sample_data 1 1 SRR1145846

This is the first major data-quality observation:

  • The metadata and AMR folder describe four runs.
  • The supplied taxonomy exports and the supplied phyloseq directory retain only one sample: SRR1145846.
  • That means the workshop export in nejm_analysis/ is a strict post-filter subset, not the whole four-sample diagnostic panel.
Code
supplied_phyloseq_status <- tibble(
  file = c("sample_data.csv", "otu_table.csv", "tax_table.csv"),
  rows = c(nrow(phy_sample_data), nrow(phy_otu_export), nrow(phy_tax_export)),
  cols = c(ncol(phy_sample_data), ncol(phy_otu_export), ncol(phy_tax_export))
)

kable(supplied_phyloseq_status)
file rows cols
sample_data.csv 1 2
otu_table.csv 1 2
tax_table.csv 1 8
Code
strict_with %>%
  select(sample_name, lineage, match_name, unweighted_fraction, f_weighted_at_rank, bp_match_at_rank, total_weighted_hashes)
# A tibble: 2 × 7
  sample_name lineage          match_name unweighted_fraction f_weighted_at_rank
  <chr>       <chr>            <chr>                    <dbl>              <dbl>
1 SRR1145846  d__Bacteria;p__… GCF_00031…               0.379              0.117
2 SRR1145846  unclassified     unclassif…               0.621              0.883
# ℹ 2 more variables: bp_match_at_rank <dbl>, total_weighted_hashes <dbl>

What the supplied strict export tells us

  • Only the untreated CSF sample survives the strict merged export.
  • In that export, the only retained organism-level call is Leptospira santarosai.
  • Even in that sample, the supplied table still shows a large unclassified component, which immediately tells us not to over-simplify the interpretation.
  • Because the supplied export collapses the dataset to one sample, it is not sufficient on its own for a four-sample diagnostic teaching notebook.

6 Reconstruct a four-sample exploratory taxonomy table from the underlying read-based outputs

To preserve the full diagnostic set, we reconstruct a species-level table from the underlying per-sample taxonomic summaries in:

../materials/clinical-reasoning/Readbased_Analysis/Sourmash-YACHT/fastmultigather/single_sketches/taxmetagenome/

This is not identical to the supplied strict YACHT export. It is a broader exploratory view that helps explain why only one sample survived the stricter final filter.

Code
raw_tax_dir <- file.path(
  readbased_dir, "Sourmash-YACHT",
  "fastmultigather", "single_sketches", "taxmetagenome"
)

raw_tax_files <- sort(list.files(
  raw_tax_dir,
  pattern = "_51[.]gather[.]gtdb[+]euks[.]k51[.]sc10k[.]thresh10k[.]summarized[.]csv$",
  full.names = TRUE
))

raw_tax <- bind_rows(lapply(raw_tax_files, read_csv_char)) %>%
  mutate(
    across(c(fraction, f_weighted_at_rank, bp_match_at_rank, query_ani_at_rank, total_weighted_hashes), as.numeric)
  )

raw_species_with <- raw_tax %>%
  filter(rank == "species") %>%
  transmute(
    sample_name = query_name,
    lineage,
    query_md5,
    query_filename,
    match_name = extract_last_taxon(lineage),
    unweighted_fraction = fraction,
    f_weighted_at_rank,
    bp_match_at_rank,
    total_weighted_hashes,
    query_ani_at_rank
  ) %>%
  parse_lineage() %>%
  left_join(metadata %>% select(sample_name, sample_role, patient_status, approx_total_reads, clinical_use), by = "sample_name") %>%
  mutate(
    species_label = case_when(
      lineage == "unclassified" ~ "unclassified",
      !is.na(species) ~ species,
      !is.na(genus) ~ paste(genus, "unresolved species"),
      TRUE ~ extract_last_taxon(lineage)
    ),
    sample_role = factor(sample_role, levels = sample_role_levels)
  )

raw_species_without <- raw_species_with %>%
  filter(lineage != "unclassified") %>%
  group_by(sample_name) %>%
  mutate(
    f_weighted_classified = f_weighted_at_rank / sum(f_weighted_at_rank, na.rm = TRUE),
    unweighted_fraction_classified = unweighted_fraction / sum(unweighted_fraction, na.rm = TRUE)
  ) %>%
  ungroup()

raw_tax_inventory <- tibble::tribble(
  ~source, ~n_rows, ~n_unique_samples, ~sample_ids,
  "Reconstructed exploratory species table", nrow(raw_species_with), n_distinct(raw_species_with$sample_name), paste(sort(unique(raw_species_with$sample_name)), collapse = ", "),
  "Classified-only exploratory species table", nrow(raw_species_without), n_distinct(raw_species_without$sample_name), paste(sort(unique(raw_species_without$sample_name)), collapse = ", ")
)

kable(raw_tax_inventory)
source n_rows n_unique_samples sample_ids
Reconstructed exploratory species table 30 4 SRR1145844, SRR1145845, SRR1145846, SRR1145847
Classified-only exploratory species table 26 4 SRR1145844, SRR1145845, SRR1145846, SRR1145847

The reconstructed exploratory table restores the expected four-sample panel. This is the table used below for the whole-case interpretation, while the supplied strict export is still used as evidence that only one sample passed the stricter retained-hit filter.

6.1 Harmonize sample identifiers across metadata, taxonomy, AMR, and strict exports

Code
amr_files <- list.files(file.path(data_dir, "nejm.amr"), pattern = "gene_mapping_data[.]txt$", full.names = TRUE)
amr_sample_ids <- sub("[.]gene_mapping_data[.]txt$", "", basename(amr_files))

harmonization_table <- tibble::tribble(
  ~source, ~n_unique_samples, ~sample_ids,
  "Metadata", n_distinct(metadata$sample_name), paste(sort(unique(metadata$sample_name)), collapse = ", "),
  "Supplied strict taxonomy export", n_distinct(strict_with$sample_name), paste(sort(unique(strict_with$sample_name)), collapse = ", "),
  "Reconstructed exploratory species table", n_distinct(raw_species_with$sample_name), paste(sort(unique(raw_species_with$sample_name)), collapse = ", "),
  "Supplied phyloseq export", n_distinct(phy_sample_data$sample_name), paste(sort(unique(phy_sample_data$sample_name)), collapse = ", "),
  "AMR files", length(unique(amr_sample_ids)), paste(sort(unique(amr_sample_ids)), collapse = ", ")
)

kable(harmonization_table)
source n_unique_samples sample_ids
Metadata 4 SRR1145844, SRR1145845, SRR1145846, SRR1145847
Supplied strict taxonomy export 1 SRR1145846
Reconstructed exploratory species table 4 SRR1145844, SRR1145845, SRR1145846, SRR1145847
Supplied phyloseq export 1 SRR1145846
AMR files 4 SRR1145844, SRR1145845, SRR1145846, SRR1145847

Identifier harmonization summary

  • Sample identifiers are consistent across metadata and AMR outputs.
  • The mismatch is not an ID problem. It is a retention/filtering problem in the strict taxonomy export.
  • The reconstructed exploratory table is therefore needed to analyze the four-sample diagnostic set end to end.

7 Inspect the YACHT threshold behavior

The underlying YACHT outputs help explain why the merged strict export contains only one sample.

Code
yacht_dir <- file.path(
  readbased_dir, "Sourmash-YACHT",
  "yacht_results", "temp_yacht_results", "csv"
)

read_yacht_file <- function(path) {
  dat <- read_csv_char(path)
  sample_name <- sub("_min_coverage.*$", "", basename(path))
  threshold <- sub("^.*_min_coverage", "", sub("[.]csv$", "", basename(path)))
  if (nrow(dat) == 0) {
    tibble(
      sample_name = sample_name,
      threshold = threshold,
      organism_name = NA_character_,
      in_sample_est = NA,
      p_vals = NA_real_,
      actual_confidence_with_coverage = NA_real_,
      alt_confidence_mut_rate_with_coverage = NA_real_
    )
  } else {
    dat %>%
      transmute(
        sample_name = WGS_ID,
        threshold = threshold,
        organism_name,
        in_sample_est = as.logical(in_sample_est),
        p_vals = as.numeric(p_vals),
        actual_confidence_with_coverage = as.numeric(actual_confidence_with_coverage),
        alt_confidence_mut_rate_with_coverage = as.numeric(alt_confidence_mut_rate_with_coverage)
      )
  }
}

yacht_thresholds <- bind_rows(lapply(
  list.files(yacht_dir, pattern = "min_coverage(0[.]01|0[.]05)[.]csv$", full.names = TRUE),
  read_yacht_file
)) %>%
  mutate(
    threshold = factor(threshold, levels = c("0.01", "0.05")),
    retained_hit = if_else(is.na(organism_name), "No retained hit", organism_name)
  ) %>%
  left_join(metadata %>% select(sample_name, sample_role), by = "sample_name") %>%
  arrange(sample_role, threshold)

kable(
  yacht_thresholds %>%
    select(sample_name, sample_role, threshold, retained_hit, actual_confidence_with_coverage, alt_confidence_mut_rate_with_coverage),
  digits = 3
)
sample_name sample_role threshold retained_hit actual_confidence_with_coverage alt_confidence_mut_rate_with_coverage
SRR1145846 Untreated CSF 0.01 GCF_000313175.2 Leptospira santarosai serovar Shermani str. LT 821 0.978 0.020
SRR1145846 Untreated CSF 0.05 GCF_000313175.2 Leptospira santarosai serovar Shermani str. LT 821 0.988 0.011
SRR1145844 DNase-treated CSF 0.01 GCA_001321855.1 SAR11 cluster bacterium PRT-SC02 0.958 0.055
SRR1145844 DNase-treated CSF 0.05 No retained hit NA NA
SRR1145845 Patient serum 0.01 GCA_001321855.1 SAR11 cluster bacterium PRT-SC02 0.958 0.055
SRR1145845 Patient serum 0.05 No retained hit NA NA
SRR1145847 Negative-control serum 0.01 GCA_001321855.1 SAR11 cluster bacterium PRT-SC02 0.958 0.055
SRR1145847 Negative-control serum 0.05 No retained hit NA NA

Threshold interpretation

  • At the looser min_coverage = 0.01 setting, three non-untreated-CSF samples retain a weak hit to SAR11 cluster bacterium PRT-SC02, which is not clinically coherent for this case.
  • At the stricter min_coverage = 0.05 setting, only SRR1145846 retains a YACHT hit, and that hit is Leptospira santarosai.
  • This is the key reason the workshop export in nejm_analysis/ collapses to one sample.

A large number of nucleotide alignments is not the same thing as a true pathogen signal.

For a clinically persuasive metagenomic result, we need more than “some reads aligned somewhere.” We want a signal that is:

  • enriched in the right specimen,
  • concentrated in a biologically plausible organism,
  • absent or much weaker in serum/control samples,
  • robust enough to survive conservative thresholds,
  • and coherent with the patient’s syndrome.

In the original paper, the untreated CSF had a dominant Leptospira signal among bacterial reads, while serum and control did not. In contrast, many other bacterial assignments were distributed across background-like genera and appeared in comparator or control libraries. Modern stricter workflows can reasonably remove those weak, distributed calls while retaining the untreated-CSF Leptospira signal.

DNase treatment was used to enrich for viral sequencing, not to optimize bacterial species calling. It can reduce free host DNA and shift the library composition, so the relative amount of “microbial-looking” material may increase.

However, the original supplementary material also notes that DNase pretreatment can enrich conserved bacterial 16S rRNA material, which makes accurate species-level bacterial identification harder. Therefore, a DNase-treated library can have more bacterial-looking alignments but less reliable pathogen-level specificity.

For this case, the practical interpretation is:

  • untreated CSF = the strongest bacterial pathogen signal,
  • DNase-treated CSF/serum = more prone to non-specific or hard-to-classify signal,
  • serum/control = useful comparators but not diagnostic for the CNS pathogen in this recreated workflow.

8 Taxonomic result inspection

8.1 Unclassified signal is the central interpretive issue

Before running the next chunk, predict which samples will be most difficult to interpret.

Ask yourself:

  • Which specimen is most likely to contain low microbial biomass?
  • Which specimen might be dominated by host/background material?
  • Which specimen should be most clinically meaningful if this is a CNS infection?

Do not only look for the organism name. First look at how much of each sample remains unresolved. A sample can contain a taxon label but still be too incomplete or unstable for a reliable conclusion.

Code
unclassified_summary <- raw_species_with %>%
  filter(lineage == "unclassified") %>%
  select(
    sample_name, sample_role, patient_status,
    unweighted_fraction, f_weighted_at_rank,
    bp_match_at_rank, total_weighted_hashes
  ) %>%
  arrange(sample_role)

kable(unclassified_summary, digits = 3)
sample_name sample_role patient_status unweighted_fraction f_weighted_at_rank bp_match_at_rank total_weighted_hashes
SRR1145846 Untreated CSF Patient 0.473 0.925 260000 469
SRR1145844 DNase-treated CSF Patient 0.981 0.996 3540000 77552
SRR1145845 Patient serum Patient 0.984 1.000 3620000 130988
SRR1145847 Negative-control serum Control 0.500 0.210 120000 186

The unclassified burden is the single most important taxonomic warning sign in this recreated dataset:

  • SRR1145844 DNase-treated CSF: about 98.1% unweighted and 99.6% weighted unclassified.
  • SRR1145845 patient serum: about 98.4% unweighted and 99.95% weighted unclassified.
  • SRR1145846 untreated CSF: about 47.3% unweighted but 92.5% weighted unclassified in the broader species summary.
  • SRR1145847 negative-control serum: about 50.0% unweighted and 21.0% weighted unclassified, with the remaining classified portion split across multiple low-level taxa.

This means several samples are low-information or unstable at organism level, even before we ask whether any taxon is clinically plausible.

Code
unclassified_plot_data <- unclassified_summary %>%
  mutate(sample_label = factor(sample_labels[sample_name], levels = sample_labels[metadata$sample_name])) %>%
  pivot_longer(
    cols = c(unweighted_fraction, f_weighted_at_rank),
    names_to = "metric",
    values_to = "fraction"
  ) %>%
  mutate(
    metric = recode(
      metric,
      unweighted_fraction = "Unweighted fraction",
      f_weighted_at_rank = "Weighted fraction"
    ),
    percent = 100 * fraction
  )

unclassified_segments <- unclassified_plot_data %>%
  group_by(sample_name, sample_label, sample_role) %>%
  summarise(
    y_min = min(percent),
    y_max = max(percent),
    .groups = "drop"
  )

unclassified_plot <- ggplot(unclassified_plot_data, aes(x = sample_label, y = percent)) +
  geom_segment(
    data = unclassified_segments,
    aes(x = sample_label, xend = sample_label, y = y_min, yend = y_max),
    inherit.aes = FALSE,
    linewidth = 0.8,
    color = "grey55"
  ) +
  geom_point(aes(fill = metric, shape = metric), size = 3.2, color = "black", stroke = 0.35) +
  scale_fill_manual(values = c("Unweighted fraction" = "#4e79a7", "Weighted fraction" = "#e15759")) +
  scale_shape_manual(values = c("Unweighted fraction" = 21, "Weighted fraction" = 24)) +
  scale_y_continuous(labels = label_percent(scale = 1), limits = c(0, 100), expand = expansion(mult = c(0, 0.02))) +
  labs(
    title = "Unclassified signal dominates most recreated samples",
    subtitle = "Points compare the two abundance metrics within each specimen",
    x = NULL,
    y = "Unclassified fraction (%)",
    fill = "Metric",
    shape = "Metric"
  ) +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

save_fig(unclassified_plot, "taxonomy_unclassified_fraction", width = 9.5, height = 5.5)
unclassified_plot

Interpretation

  • High unclassified signal weakens confidence because much of the sample remains unresolved.
  • When a specimen is dominated by unclassified sequence, the classified profile can become incomplete, unstable, and overly sensitive to small numbers of assigned hashes.
  • In a low-biomass clinical specimen, this often means the sample is not robust enough for a trustworthy organism-level conclusion, even if a few taxa are nominally present.
  • For this case, the untreated CSF is the only specimen that still retains a clinically coherent pathogen candidate after strict filtering. The others should be treated cautiously or regarded as non-informative.

Removing the human genome is necessary, but it does not turn the remaining reads into a clean microbial dataset.

After host depletion, the remaining reads may still include:

  • low-quality or low-complexity fragments,
  • short reads that lack taxonomic information,
  • amplification artifacts,
  • reagent or environmental background,
  • host-associated reads that were not removed perfectly,
  • and real biological sequences without close reference genomes in the database.

This is especially common in low-biomass clinical specimens such as CSF. In the original paper, the untreated CSF had millions of raw reads, but only a small fraction were non-human, and an even smaller fraction carried the diagnostic Leptospira signal. The clinical question is therefore not “Can we classify everything?” but “Does the classified part contain a coherent, specimen-specific pathogen signal?”

A sample may be best interpreted as non-informative when:

  • unclassified signal dominates,
  • the classified taxa are scattered across many weak background-like organisms,
  • the signal does not survive stricter thresholds,
  • the same type of signal appears in serum or control samples,
  • or the result does not fit the clinical syndrome or specimen type.

In this recreated workflow, the DNase-treated CSF, patient serum, and negative-control serum do not provide evidence strong enough for a final organism-level conclusion. That is not a failure of learning. It is one of the core lessons of clinical metagenomics.

8.2 Compare strict retained-hit export versus broader exploratory species summaries

Code
strict_comparison <- strict_with %>%
  select(sample_name, lineage, unweighted_fraction, f_weighted_at_rank) %>%
  mutate(view = "Supplied strict merged export") %>%
  bind_rows(
    raw_species_with %>%
      filter(sample_name %in% strict_with$sample_name) %>%
      select(sample_name, lineage, unweighted_fraction, f_weighted_at_rank) %>%
      mutate(view = "Reconstructed raw species summary")
  ) %>%
  arrange(sample_name, view, desc(f_weighted_at_rank))

kable(strict_comparison, digits = 4)
sample_name lineage unweighted_fraction f_weighted_at_rank view
SRR1145846 unclassified 0.4727 0.9254 Reconstructed raw species summary
SRR1145846 d__Bacteria;p__Spirochaetota;c__Leptospiria;o__Leptospirales;f__Leptospiraceae;g__Leptospira;s__Leptospira santarosai 0.4364 0.0640 Reconstructed raw species summary
SRR1145846 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Alloprevotella;s__Alloprevotella sp009775705 0.0364 0.0043 Reconstructed raw species summary
SRR1145846 d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Providencia;s__Providencia rettgeri_E 0.0182 0.0021 Reconstructed raw species summary
SRR1145846 d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Burkholderiales;f__Burkholderiaceae;g__Castellaniella;s__Castellaniella denitrificans 0.0182 0.0021 Reconstructed raw species summary
SRR1145846 Eukaryota;Chordata;Mammalia;Primates;Cercopithecidae;Papio;Papio anubis 0.0182 0.0021 Reconstructed raw species summary
SRR1145846 unclassified 0.6213 0.8833 Supplied strict merged export
SRR1145846 d__Bacteria;p__Spirochaetota;c__Leptospiria;o__Leptospirales;f__Leptospiraceae;g__Leptospira;s__Leptospira santarosai 0.3787 0.1167 Supplied strict merged export

For the untreated CSF, both views agree on the key teaching point: Leptospira santarosai is the only clinically coherent signal. However, the exact numeric fractions differ because these files come from different summary layers of the pipeline.

That matters for interpretation:

  • The supplied strict export reflects the final, more conservative retained-hit table.
  • The reconstructed species table is broader and noisier, which is useful for teaching why the other samples did not survive stricter filtering.
  • We should therefore compare the two views qualitatively, not pretend that they are numerically interchangeable.

8.3 Diagnostic composition of each sample

Code
diagnostic_mass <- raw_species_with %>%
  mutate(
    diagnostic_category = case_when(
      lineage == "unclassified" ~ "Unclassified",
      grepl("Leptospira santarosai$", lineage) ~ "Leptospira santarosai",
      grepl("^Eukaryota", lineage) ~ "Eukaryotic/host-associated",
      TRUE ~ "Other classified"
    )
  ) %>%
  group_by(sample_name, sample_role, diagnostic_category) %>%
  summarise(
    unweighted_fraction = sum(unweighted_fraction, na.rm = TRUE),
    f_weighted_at_rank = sum(f_weighted_at_rank, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  pivot_longer(
    cols = c(unweighted_fraction, f_weighted_at_rank),
    names_to = "metric",
    values_to = "fraction"
  ) %>%
  mutate(
    metric = recode(
      metric,
      unweighted_fraction = "Unweighted fraction",
      f_weighted_at_rank = "Weighted fraction"
    ),
    diagnostic_category = factor(
      diagnostic_category,
      levels = c("Unclassified", "Leptospira santarosai", "Eukaryotic/host-associated", "Other classified")
    ),
    sample_label = factor(sample_labels[sample_name], levels = sample_labels[metadata$sample_name])
  )

diagnostic_mass_plot <- ggplot(diagnostic_mass, aes(x = sample_label, y = fraction, fill = diagnostic_category)) +
  geom_col(width = 0.75, color = "white") +
  facet_wrap(~metric, ncol = 1) +
  scale_fill_manual(values = signal_palette) +
  scale_y_continuous(labels = percent_format(accuracy = 1), expand = expansion(mult = c(0, 0.02))) +
  labs(
    title = "Only the untreated CSF carries a clinically coherent leptospiral signal",
    subtitle = "The other samples are dominated by unclassified signal or diffuse low-level background",
    x = NULL,
    y = "Fraction of sample",
    fill = "Signal category"
  ) +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

save_fig(diagnostic_mass_plot, "taxonomy_diagnostic_signal_vs_unclassified", width = 10, height = 7)
diagnostic_mass_plot

Interpretation

  • The untreated CSF is the only sample where leptospiral signal is large enough to remain central to the story.
  • The DNase-treated CSF shows only a tiny leptospiral trace in the broader summary and no retained strict YACHT hit.
  • The patient serum is overwhelmingly unclassified and does not support a bloodstream leptospiral signal in this recreated workflow.
  • The negative-control serum is not cleanly blank in the broader summary, which argues for caution and supports a conservative interpretation of any low-level non-CSF hits.

8.4 Top classified taxa per sample

Code
top_classified_table <- raw_species_without %>%
  group_by(sample_name, sample_role) %>%
  slice_max(order_by = f_weighted_classified, n = 5, with_ties = FALSE) %>%
  ungroup() %>%
  select(sample_name, sample_role, species_label, f_weighted_classified, unweighted_fraction, f_weighted_at_rank)

kable(top_classified_table, digits = 4)
sample_name sample_role species_label f_weighted_classified unweighted_fraction f_weighted_at_rank
SRR1145844 DNase-treated CSF Escherichia coli 0.9794 0.0055 0.0043
SRR1145844 DNase-treated CSF Comamonas tsuruhatensis 0.0059 0.0028 0.0000
SRR1145844 DNase-treated CSF Leptospira santarosai 0.0059 0.0028 0.0000
SRR1145844 DNase-treated CSF Brucella pituitosa 0.0029 0.0028 0.0000
SRR1145844 DNase-treated CSF Sphingobacterium cladoniae 0.0029 0.0028 0.0000
SRR1145845 Patient serum Acinetobacter johnsonii 0.5385 0.0027 0.0003
SRR1145845 Patient serum Pseudomonas aeruginosa 0.2308 0.0054 0.0001
SRR1145845 Patient serum Corynebacterium matruchotii 0.1231 0.0027 0.0001
SRR1145845 Patient serum Escherichia coli 0.0923 0.0027 0.0000
SRR1145845 Patient serum Contendobacter odensis 0.0154 0.0027 0.0000
SRR1145846 Untreated CSF Leptospira santarosai 0.8571 0.4364 0.0640
SRR1145846 Untreated CSF Alloprevotella sp009775705 0.0571 0.0364 0.0043
SRR1145846 Untreated CSF Providencia rettgeri_E 0.0286 0.0182 0.0021
SRR1145846 Untreated CSF Castellaniella denitrificans 0.0286 0.0182 0.0021
SRR1145846 Untreated CSF Papio anubis 0.0286 0.0182 0.0021
SRR1145847 Negative-control serum Escherichia coli 0.4150 0.0417 0.3280
SRR1145847 Negative-control serum Solirubrobacter taibaiensis 0.2109 0.0417 0.1667
SRR1145847 Negative-control serum Antrihabitans stalagmiti 0.1497 0.0833 0.1183
SRR1145847 Negative-control serum Pseudomonas_E amygdali 0.0884 0.0833 0.0699
SRR1145847 Negative-control serum Variovorax sp003952185 0.0476 0.0417 0.0376

This table is useful, but it must be read carefully:

  • f_weighted_classified describes composition among the classified component only.
  • In samples that are nearly all unclassified, a taxon can look “dominant” within the classified slice while still being clinically trivial in the original specimen.
  • That is why the top classified taxa in the DNase-treated CSF, serum, and control serum should not be treated as actionable pathogen calls.
Code
top_species <- raw_species_without %>%
  group_by(species_label) %>%
  summarise(max_classified = max(f_weighted_classified, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(max_classified), species_label) %>%
  slice_head(n = 10) %>%
  pull(species_label)

heatmap_data <- raw_species_without %>%
  filter(species_label %in% top_species) %>%
  select(sample_name, sample_role, species_label, f_weighted_classified) %>%
  complete(
    sample_name = metadata$sample_name,
    species_label = top_species,
    fill = list(f_weighted_classified = 0)
  ) %>%
  left_join(metadata %>% select(sample_name, sample_role), by = "sample_name") %>%
  mutate(
    sample_label = factor(sample_labels[sample_name], levels = sample_labels[metadata$sample_name]),
    species_label = factor(species_label, levels = rev(top_species))
  )

species_heatmap <- ggplot(
  heatmap_data,
  aes(x = sample_label, y = species_label, fill = 100 * f_weighted_classified)
) +
  geom_tile(color = "grey80", linewidth = 0.3) +
  scale_fill_gradientn(
    colours = c("grey98", "#c7e9c0", "#41ab5d", "#005a32"),
    labels = label_number(suffix = "%", accuracy = 1)
  ) +
  labs(
    title = "Among classified signal, untreated CSF is dominated by Leptospira",
    subtitle = "These are classified-only percentages and should not be mistaken for whole-sample dominance",
    x = NULL,
    y = "Species",
    fill = "Classified-only\nabundance"
  ) +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

save_fig(species_heatmap, "taxonomy_species_heatmap", width = 10, height = 6)
species_heatmap

Interpretation

  • This view helps show why the untreated CSF is clinically different from the other specimens.
  • Once we condition on the classified component, Leptospira santarosai becomes the dominant classified organism in untreated CSF.
  • The other samples do not show a similarly coherent pathogen pattern. Their classified signal is fragmented across taxa that are either implausible, background-like, or too weak to trust.

9 Sample-level diagnostic interpretation

Code
strict_hits_05 <- yacht_thresholds %>%
  filter(threshold == "0.05") %>%
  transmute(
    sample_name,
    strict_retained_hit = !is.na(organism_name),
    strict_hit_label = if_else(is.na(organism_name), "No retained hit", organism_name),
    strict_confidence = actual_confidence_with_coverage
  )

top_classified <- raw_species_without %>%
  group_by(sample_name) %>%
  slice_max(order_by = f_weighted_classified, n = 1, with_ties = FALSE) %>%
  ungroup() %>%
  transmute(
    sample_name,
    top_classified_species = species_label,
    top_classified_f_weighted_classified = f_weighted_classified
  )

sample_diagnostic_summary <- metadata %>%
  select(sample_name, sample_role, patient_status, approx_total_reads, clinical_use) %>%
  left_join(
    raw_species_with %>%
      filter(lineage == "unclassified") %>%
      select(sample_name, unclassified_unweighted = unweighted_fraction, unclassified_weighted = f_weighted_at_rank, total_weighted_hashes),
    by = "sample_name"
  ) %>%
  left_join(strict_hits_05, by = "sample_name") %>%
  left_join(top_classified, by = "sample_name") %>%
  mutate(
    diagnostic_interpretation = case_when(
      sample_role == "Untreated CSF" ~ "Most clinically meaningful sample. Retains a strict Leptospira hit, but still has substantial unresolved signal.",
      unclassified_weighted >= 0.95 ~ "Essentially non-informative at organism level. Dominated by unclassified signal.",
      sample_role == "Negative-control serum" ~ "Background/control sample with unstable mixed low-level signal. Useful mainly as a cautionary comparator.",
      TRUE ~ "No persuasive pathogen-level evidence strong enough for a final conclusion."
    )
  ) %>%
  arrange(sample_role)

kable(sample_diagnostic_summary, digits = 3)
sample_name sample_role patient_status approx_total_reads clinical_use unclassified_unweighted unclassified_weighted total_weighted_hashes strict_retained_hit strict_hit_label strict_confidence top_classified_species top_classified_f_weighted_classified diagnostic_interpretation
SRR1145846 Untreated CSF Patient 256238 Primary diagnostic CNS specimen 0.473 0.925 469 TRUE GCF_000313175.2 Leptospira santarosai serovar Shermani str. LT 821 0.988 Leptospira santarosai 0.857 Most clinically meaningful sample. Retains a strict Leptospira hit, but still has substantial unresolved signal.
SRR1145844 DNase-treated CSF Patient 2612696 Comparator CNS specimen after DNase treatment 0.981 0.996 77552 FALSE No retained hit NA Escherichia coli 0.979 Essentially non-informative at organism level. Dominated by unclassified signal.
SRR1145845 Patient serum Patient 5627514 Systemic comparator specimen 0.984 1.000 130988 FALSE No retained hit NA Acinetobacter johnsonii 0.538 Essentially non-informative at organism level. Dominated by unclassified signal.
SRR1145847 Negative-control serum Control 89194 Negative control for background/context 0.500 0.210 186 FALSE No retained hit NA Escherichia coli 0.415 Background/control sample with unstable mixed low-level signal. Useful mainly as a cautionary comparator.

9.1 Practical diagnostic conclusions from the four-sample set

9.1.1 1. Untreated CSF (SRR1145846)

  • This is the only sample that retains a confident strict YACHT hit at min_coverage = 0.05.
  • That hit is Leptospira santarosai.
  • In the broader species summary, this sample still carries a heavy unclassified fraction, so the result should be treated as clinically meaningful but not artificially perfect.
  • Even so, this is the most clinically meaningful specimen in the recreated workflow and the only sample that supports a plausible actionable diagnosis.

9.1.2 2. DNase-treated CSF (SRR1145844)

  • This sample is overwhelmingly unclassified.
  • Only a tiny low-level leptospiral trace is visible in the broader species summary, and it disappears from the strict retained-hit table.
  • The most appropriate interpretation is no reliable organism-level conclusion from this sample.
  • This is also consistent with the teaching point that DNase treatment can be helpful for viral enrichment but may reduce useful bacterial signal.

9.1.3 3. Patient serum (SRR1145845)

  • This sample is also overwhelmingly unclassified and does not yield a coherent leptospiral result.
  • The low-level classified taxa are not persuasive for a bloodstream explanation.
  • Therefore, the recreated serum does not provide trustworthy evidence for a final pathogen-level conclusion.

9.1.4 4. Negative-control serum (SRR1145847)

  • The strict YACHT export retains no hit, which is reassuring.
  • However, the broader species summary contains a diffuse mixture of low-level taxa rather than a perfectly clean blank.
  • That does not support a real alternative pathogen. Instead, it shows why low-biomass/control samples can look unstable and why conservative filtering matters.

10 AMR inspection

Code
read_amr <- function(path) {
  dat <- read_tsv_char(path)
  dat %>%
    mutate(
      sample_name = sub("[.]gene_mapping_data[.]txt$", "", basename(path)),
      `Completely Mapped Reads` = as.numeric(`Completely Mapped Reads`),
      `Average Percent Coverage` = as.numeric(`Average Percent Coverage`),
      `Reference Length` = as.numeric(`Reference Length`)
    )
}

amr_data <- bind_rows(lapply(amr_files, read_amr)) %>%
  left_join(metadata %>% select(sample_name, sample_role), by = "sample_name") %>%
  arrange(sample_role, desc(`Completely Mapped Reads`))

amr_summary <- amr_data %>%
  group_by(sample_name, sample_role) %>%
  summarise(
    n_hits = n(),
    total_complete_reads = sum(`Completely Mapped Reads`, na.rm = TRUE),
    max_coverage = max(`Average Percent Coverage`, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(sample_role)

kable(amr_summary, digits = 2)
sample_name sample_role n_hits total_complete_reads max_coverage
SRR1145846 Untreated CSF 1 2 23.51
SRR1145844 DNase-treated CSF 3 312 43.29
SRR1145845 Patient serum 2 32 2.79
SRR1145847 Negative-control serum 3 4 16.84
Code
kable(
  amr_data %>%
    select(sample_name, sample_role, `ARO Term`, `Completely Mapped Reads`, `Average Percent Coverage`, `AMR Gene Family`, `Drug Class`),
  digits = 2
)
sample_name sample_role ARO Term Completely Mapped Reads Average Percent Coverage AMR Gene Family Drug Class
SRR1145846 Untreated CSF vanT gene in vanG cluster 2 23.51 glycopeptide resistance gene cluster; vanT glycopeptide antibiotic
SRR1145844 DNase-treated CSF tet(C) 293 23.78 major facilitator superfamily (MFS) antibiotic efflux pump tetracycline antibiotic
SRR1145844 DNase-treated CSF adeF 15 2.11 resistance-nodulation-cell division (RND) antibiotic efflux pump fluoroquinolone antibiotic; tetracycline antibiotic
SRR1145844 DNase-treated CSF TEM-116 4 43.29 TEM beta-lactamase monobactam; cephalosporin; penicillin beta-lactam
SRR1145845 Patient serum adeF 31 2.05 resistance-nodulation-cell division (RND) antibiotic efflux pump fluoroquinolone antibiotic; tetracycline antibiotic
SRR1145845 Patient serum vanW gene in vanG cluster 1 2.79 vanW; glycopeptide resistance gene cluster glycopeptide antibiotic
SRR1145847 Negative-control serum ArnT 2 8.73 pmr phosphoethanolamine transferase peptide antibiotic
SRR1145847 Negative-control serum mdtM 1 1.39 major facilitator superfamily (MFS) antibiotic efflux pump fluoroquinolone antibiotic; lincosamide antibiotic; nucleoside antibiotic; phenicol antibiotic; disinfecting agents and antiseptics
SRR1145847 Negative-control serum TEM-145 1 16.84 TEM beta-lactamase monobactam; cephalosporin; penicillin beta-lactam
Code
amr_plot_data <- amr_data %>%
  mutate(
    sample_label = factor(sample_labels[sample_name], levels = sample_labels[metadata$sample_name]),
    aro_label = factor(`ARO Term`, levels = rev(unique(`ARO Term`)))
  )

amr_plot <- ggplot(
  amr_plot_data,
  aes(x = sample_label, y = aro_label)
) +
  geom_point(
    aes(size = `Completely Mapped Reads`, color = `Average Percent Coverage`),
    alpha = 0.9
  ) +
  scale_color_gradientn(colours = c("#d9d9d9", "#fdae6b", "#e6550d", "#a63603")) +
  scale_size_continuous(range = c(2.5, 10)) +
  labs(
    title = "AMR calls are sparse and not central to the diagnostic question",
    x = NULL,
    y = "ARO term",
    size = "Completely\nmapped reads",
    color = "Average %\ncoverage"
  ) +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

save_fig(amr_plot, "amr_sparse_signal_summary", width = 10, height = 5.5)
amr_plot

AMR interpretation

  • The AMR output is very sparse.
  • Only one putative AMR feature is associated with the untreated CSF leptospiral sample, and it is supported by only 2 completely mapped reads.
  • The higher-read AMR signal appears in the DNase-treated CSF, but that same specimen is taxonomically non-informative for the pathogen question.
  • Coverage is low for most calls, and none of these results materially changes the diagnostic reasoning in this case.

The most appropriate conclusion is that the AMR output is illustrative rather than clinically informative here.

11 Comparison with the original paper

The original paper reported:

  • 475 single-end CSF reads mapping to Leptospira in untreated CSF,
  • 263 single-end reads in DNase-treated CSF,
  • 0 leptospiral reads in serum,
  • 0 leptospiral reads in the negative control,
  • and eventual confirmation of Leptospira santarosai by targeted follow-up testing.

Before opening the comparison table, propose two reasons why this recreated output might look different from the published report.

Think about what changed between the original study and this workshop re-analysis: preprocessing, host filtering, reference databases, classification algorithms, and thresholds.

Code
comparison_table <- tibble::tribble(
  ~Specimen, ~Original_paper, ~Recreated_workflow, ~Teaching_interpretation,
  "Untreated CSF",
  "Clear actionable leptospiral signal; 475 single-end reads and 955 paired-end bacterial reads assigned to Leptospira.",
  "Only specimen with a retained strict YACHT hit. Broad species summary still has heavy unclassified signal, but Leptospira remains the only coherent diagnosis-level candidate.",
  "This is the one specimen that materially narrows the differential.",
  "DNase-treated CSF",
  "Leptospiral reads still present in the paper, though lower than untreated CSF.",
  "No retained strict hit. Exploratory species summary is overwhelmingly unclassified, with only a tiny leptospiral trace.",
  "Best interpreted as non-informative in the recreated workflow.",
  "Patient serum",
  "No leptospiral reads detected.",
  "No retained strict hit. Serum is dominated by unclassified signal and weak background taxa.",
  "Does not add trustworthy organism-level evidence.",
  "Negative-control serum",
  "No leptospiral reads detected and effectively clean for the key pathogen question.",
  "No retained strict hit, but the broader species summary is noisier than expected, with mixed low-level taxa.",
  "Useful as a cautionary control showing why low-level signal should not be over-called."
)

kable(comparison_table)
Specimen Original_paper Recreated_workflow Teaching_interpretation
Untreated CSF Clear actionable leptospiral signal; 475 single-end reads and 955 paired-end bacterial reads assigned to Leptospira. Only specimen with a retained strict YACHT hit. Broad species summary still has heavy unclassified signal, but Leptospira remains the only coherent diagnosis-level candidate. This is the one specimen that materially narrows the differential.
DNase-treated CSF Leptospiral reads still present in the paper, though lower than untreated CSF. No retained strict hit. Exploratory species summary is overwhelmingly unclassified, with only a tiny leptospiral trace. Best interpreted as non-informative in the recreated workflow.
Patient serum No leptospiral reads detected. No retained strict hit. Serum is dominated by unclassified signal and weak background taxa. Does not add trustworthy organism-level evidence.
Negative-control serum No leptospiral reads detected and effectively clean for the key pathogen question. No retained strict hit, but the broader species summary is noisier than expected, with mixed low-level taxa. Useful as a cautionary control showing why low-level signal should not be over-called.

11.1 Main similarities with the paper

  • The untreated CSF remains the most important sample.
  • The recreated workflow still points to Leptospira santarosai in that specimen.
  • Serum does not provide stronger competing evidence for an alternative pathogen.
  • The case still teaches the same major clinical lesson: unbiased CSF metagenomics can shorten the path to a treatable diagnosis when serial conventional testing is unrevealing.

11.2 Main differences from the paper

  • The recreated dataset is much noisier, especially outside the untreated CSF.
  • The recreated DNase-treated CSF is not cleanly supportive, whereas the paper still recovered leptospiral signal there.
  • The recreated negative-control serum is not as clean as the paper’s description would lead us to expect.
  • The supplied strict workshop export collapses the dataset to one retained sample, which is a much more conservative picture than a simplified classroom result card might suggest.

This comparison should not be read as “the paper was right and the re-analysis is wrong” or the reverse.

The original paper used the tools and databases available in 2013-2014 and then confirmed the result with PCR, Sanger sequencing, phylogenetics, and clinical response. The recreated workflow uses modern sketch-based and threshold-sensitive tools that may prioritize specificity and may discard weak or distributed background calls.

Both perspectives can be compatible:

  • the clinical conclusion of the original paper remains strongly supported,
  • the modern re-analysis may be stricter about what counts as a retained pathogen-level call,
  • and samples with heavy unclassified/background signal may reasonably become non-informative.

12 Why the recreated results may differ so much from the original report

Several factors could plausibly explain the mismatch:

Separate biological explanations from workflow explanations. In this case, many differences are likely workflow-related rather than evidence that the pathogen biology was different.

  1. Different sequencing depth or retained read volume

The local metadata suggest a much smaller re-analysis input than the original paper described for some specimens, especially untreated CSF and the negative control. Less data means less sensitivity and more instability.

  1. Different preprocessing and host-removal behavior

Modern host depletion and stricter preprocessing can reduce false positives, but they can also leave very sparse non-host signal in low-biomass samples.

  1. Different classification framework

The original paper used SURPI with nt/protein alignment. This notebook uses outputs from Sourmash/YACHT and taxmetagenome summaries, which are sketch-based and threshold-sensitive.

  1. Different databases

The modern database now contains Leptospira santarosai, so direct species-level matching is easier than it was in 2014. At the same time, different databases can redistribute marginal reads into different taxa.

  1. Conservative thresholds

The strict YACHT export uses a min_coverage = 0.05 retained-hit view. That improves specificity, but it also means borderline samples drop out completely.

  1. Low-biomass CSF and comparator samples

Low-biomass clinical metagenomics is especially vulnerable to instability, high host background, stochastic classification, and over-interpretation of tiny signal.

  1. Unclassified-heavy samples are inherently incomplete

When most of a sample is unclassified, the organism-level profile is only a partial summary of what remains after filtering and database matching. That is exactly why some recreated samples should be called non-informative rather than forced into a diagnosis.

The important teaching message is that disagreement with the original paper does not automatically mean the recreated workflow is wrong. It may mean that the recreated workflow is being more conservative with weak or unstable samples.

13 Integrated clinical interpretation

13.1 Moving from broad differential to sequencing-based narrowing

Before sequencing, the clinically plausible differential remains broad:

  • tuberculous meningitis,
  • viral meningoencephalitis,
  • fungal CNS infection,
  • partially treated bacterial meningitis,
  • autoimmune/inflammatory disease,
  • and less common zoonotic infection.

The sequencing results change that differential unevenly:

  • Most samples do not help enough to justify a pathogen-level conclusion.
  • The untreated CSF is the exception. It provides the only coherent organism-level signal that survives strict filtering and fits the clinical story.

13.2 Most likely diagnosis

The most likely final diagnosis is neuroleptospirosis caused by pathogenic Leptospira, most consistent with Leptospira santarosai in this recreated workflow.

That conclusion is supported by the combination of:

  • a difficult CNS infection syndrome with repeatedly unrevealing conventional testing,
  • the clinical logic of the original case,
  • the untreated CSF being the best bacterial specimen,
  • a retained strict YACHT hit to Leptospira santarosai only in untreated CSF,
  • and the lack of a comparably coherent alternative explanation from the other samples.

13.3 Would these recreated results change management?

  • Untreated CSF: yes, this result would meaningfully narrow the diagnostic pathway and justify targeted confirmatory discussion.
  • DNase-treated CSF: no, this specimen does not provide evidence strong enough for a final organism-level conclusion.
  • Patient serum: no, this specimen does not return trustworthy evidence strong enough for a final conclusion.
  • Negative control: no, it is informative only as a cautionary background comparator.

14 Why some samples may be non-informative

This case is valuable precisely because not every specimen helps equally.

In the recreated workflow, a sample should be considered non-informative when:

  • it is dominated by unclassified signal,
  • the classified component is small and fragmented,
  • the top taxa are clinically implausible or inconsistent with the specimen type,
  • the negative control is noisy enough to raise caution about low-level background,
  • or the strict retained-hit step drops the sample completely.

That is exactly what happens for the DNase-treated CSF, patient serum, and control serum here.

In a workshop setting, this is an important lesson: a metagenomic workflow is useful not only when it finds the answer, but also when it shows that a specimen does not contain sufficiently trustworthy evidence for a conclusion.

15 Limitations

  • This is a teaching re-analysis, not a full reproduction of the original study.
  • The supplied strict workshop export retains only one sample, so a broader four-sample view had to be reconstructed from lower-level per-sample summaries.
  • The dataset contains only four runs, one of which is a negative control.
  • Heavy unclassified signal makes several sample profiles unstable.
  • The abundance metrics differ across summary layers, so exact percentages should not be over-harmonized.
  • The AMR workflow is present but sparse and not clinically central to this case.
  • Diversity and ordination results are purely descriptive and should not be treated as formal microbiome statistics.

16 Suggested discussion points for learners

  1. Why is the untreated CSF the most useful sample for this case?
  2. Why does a large unclassified fraction weaken confidence in the DNase-treated CSF and serum results?
  3. Why is a noisy negative control important even when it does not show the final pathogen?
  4. Why can a recreated workflow disagree with a published paper without being obviously wrong?
  5. How should clinicians balance an imperfect but coherent metagenomic result against ongoing diagnostic uncertainty in a deteriorating patient?

17 Suggested next steps and follow-up tests

  • Re-review the exposure history for water and animal-associated leptospiral risk.
  • Send targeted CSF confirmatory molecular testing for pathogenic Leptospira if available.
  • Discuss disease-directed therapy in parallel with confirmatory testing if the full clinical picture is strongly compatible.
  • If clinically appropriate, consider convalescent serology with the clear caveat that immune dysfunction or immunoglobulin replacement can complicate interpretation.
  • Preserve the untreated CSF as the most important specimen for further confirmatory work.

18 Optional research-style add-on: ecological summaries

This final add-on is not needed to reach the clinical diagnosis. It is included for learners who want to see how the same processed table can be adapted into a microbiome-style R workflow, similar to the previous Vietnamese healthy gut microbiome notebook.

Important

These plots are descriptive only. With four runs, heavy unclassified signal, and clinical low-biomass specimens, alpha diversity and ordination should not be treated as formal microbiome evidence.

18.1 Build an exploratory phyloseq object

The supplied nejm.phyloseq/ export is too incomplete for four-sample comparisons, so we build a small exploratory object from the reconstructed classified-only species table.

Code
otu_export <- raw_species_without %>%
  select(species_label, sample_name, f_weighted_classified) %>%
  distinct() %>%
  pivot_wider(names_from = sample_name, values_from = f_weighted_classified, values_fill = 0)

tax_export <- raw_species_without %>%
  select(species_label, domain, phylum, class, order, family, genus, species) %>%
  distinct()

sample_export <- metadata %>%
  select(sample_name, sample_role, patient_status, approx_total_reads, clinical_use)

otu_mat <- as.matrix(as.data.frame(otu_export[, -1]))
rownames(otu_mat) <- otu_export$species_label

tax_mat <- as.matrix(as.data.frame(tax_export[, -1]))
rownames(tax_mat) <- tax_export$species_label

sample_df <- as.data.frame(sample_export)
rownames(sample_df) <- sample_df$sample_name
sample_df$sample_name <- NULL

phy_exploratory <- phyloseq(
  otu_table(otu_mat, taxa_are_rows = TRUE),
  tax_table(tax_mat),
  sample_data(sample_df)
)

phy_exploratory
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 23 taxa and 4 samples ]
sample_data() Sample Data:       [ 4 samples by 4 sample variables ]
tax_table()   Taxonomy Table:    [ 23 taxa by 7 taxonomic ranks ]

18.2 Alpha diversity

Before running this chunk, predict whether the most clinically informative sample should necessarily have the highest diversity.

In low-biomass clinical metagenomics, higher diversity can mean diffuse background, not a better diagnostic specimen.

Code
sample_matrix_alpha <- t(otu_mat)

alpha_diversity <- tibble(
  sample_name = rownames(sample_matrix_alpha),
  Observed = specnumber(sample_matrix_alpha),
  Shannon = diversity(sample_matrix_alpha, index = "shannon")
) %>%
  left_join(metadata %>% select(sample_name, sample_role), by = "sample_name") %>%
  mutate(
    sample_label = factor(sample_labels[sample_name], levels = sample_labels[metadata$sample_name])
  )

kable(alpha_diversity, digits = 3)
sample_name Observed Shannon sample_role sample_label
SRR1145844 6 0.132 DNase-treated CSF DNase-treated CSF
SRR1145844
SRR1145845 5 1.214 Patient serum Patient serum
SRR1145845
SRR1145846 5 0.600 Untreated CSF Untreated CSF
SRR1145846
SRR1145847 10 1.690 Negative-control serum Negative-control serum
SRR1145847
Code
alpha_observed_plot <- ggplot(alpha_diversity, aes(x = sample_label, y = Observed, fill = sample_role)) +
  geom_col(width = 0.75, color = "black") +
  scale_fill_manual(values = sample_role_palette) +
  labs(
    title = "Observed classified species",
    x = NULL,
    y = "Observed species",
    fill = "Sample role"
  ) +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

alpha_shannon_plot <- ggplot(alpha_diversity, aes(x = sample_label, y = Shannon, fill = sample_role)) +
  geom_col(width = 0.75, color = "black") +
  scale_fill_manual(values = sample_role_palette) +
  labs(
    title = "Shannon diversity of classified-only signal",
    x = NULL,
    y = "Shannon index",
    fill = "Sample role"
  ) +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

alpha_plot <- alpha_observed_plot / alpha_shannon_plot
save_fig(alpha_plot, "taxonomy_alpha_diversity", width = 10, height = 8)
alpha_plot

Interpretation

  • These alpha diversity values are descriptive only.
  • In a clinical low-biomass dataset, higher diversity does not automatically mean a more informative specimen. It can just mean “more diffuse background.”
  • The untreated CSF is actually less diverse than the noisier comparator specimens, which is compatible with a more coherent pathogen-dominated classified component.
  • The DNase-treated CSF, patient serum, and control serum may look more diverse because their classified reads are split across many weak taxa, not because they are clinically richer samples.

18.3 Beta diversity and ordination

Code
sample_matrix <- t(otu_mat)
bray_dist <- vegdist(sample_matrix, method = "bray")
ordination <- cmdscale(bray_dist, eig = TRUE, k = 2)

ord_df <- as.data.frame(ordination$points)
colnames(ord_df) <- c("Axis1", "Axis2")
ord_df$sample_name <- rownames(ord_df)

ord_df <- ord_df %>%
  left_join(metadata %>% select(sample_name, sample_role), by = "sample_name") %>%
  mutate(
    sample_label = sample_labels[sample_name]
  )

ord_plot <- ggplot(ord_df, aes(x = Axis1, y = Axis2, color = sample_role, label = sample_label)) +
  geom_hline(yintercept = 0, linewidth = 0.3, color = "grey80") +
  geom_vline(xintercept = 0, linewidth = 0.3, color = "grey80") +
  geom_point(size = 3.5) +
  geom_text(nudge_y = 0.02, size = 3.2, show.legend = FALSE) +
  scale_color_manual(values = sample_role_palette) +
  labs(
    title = "Bray-Curtis ordination of classified-only species profiles",
    subtitle = "Useful as a descriptive picture only; not suitable for formal inference",
    x = "PCoA 1",
    y = "PCoA 2",
    color = "Sample role"
  )

save_fig(ord_plot, "taxonomy_bray_curtis_ordination", width = 9, height = 5.5)
ord_plot

Interpretation

  • The ordination is best read as a visual summary of how different the classified-only profiles are.
  • It reinforces that the untreated CSF behaves differently from the other samples.
  • However, with only four runs and heavy unclassified signal, this plot is a teaching visual, not evidence of a stable ecological pattern.

19 Conclusion

This recreated notebook preserves the main clinical-metagenomic lesson of the original NEJM case while being honest about the limits of the local data.

The key conclusion is not that every sample yields a useful answer. It is that only one sample, the untreated CSF, provides credible organism-level evidence strong enough to narrow the case toward neuroleptospirosis, while the remaining recreated samples are dominated by unclassified signal or unstable background and should be treated as non-informative.

That is exactly the kind of interpretation healthcare workers need in practice: not just how to spot a plausible pathogen, but also how to recognize when a specimen does not support a trustworthy conclusion.

Tip

Return to the Day 4 overview.

Back to top