Actionable Diagnostics & Clinical Reasoning: Signal Analysis
Interpreting metagenomic outputs for a neuroleptospirosis case
Author
Minh-Quan Ton-Ngoc, Hao Chung The
Published
May 15, 2026
Transition to analysis: The sections below provide the full reproducible debrief behind the simulation. They show how the local files were inspected, why only one specimen survives strict filtering, and why high unclassified signal forces a conservative clinical interpretation.
The original NEJM paper described a 14-year-old boy with severe combined immunodeficiency who had a prolonged, difficult-to-diagnose meningoencephalitic illness. Conventional testing remained unrevealing, and unbiased sequencing of CSF eventually identified Leptospira, leading to targeted therapy and clinical recovery.
For this workshop, that story has been adapted into a Vietnam-facing hospital teaching case. The educational aim is not only to say “the pathogen was found,” but to help healthcare workers reason through a more realistic diagnostic pathway:
broad CNS infection differential at admission,
staged release of clinical clues,
sequencing as a tool for narrowing uncertainty,
careful interpretation of which samples are and are not trustworthy,
and final integration of sequencing with clinical management.
This notebook now implements that staged pedagogy directly as a self-guided exercise, then continues into the reproducible analysis.
This matches the logic of the original paper, where CSF was the key specimen and serum/control provided context.
The local read_count values are notably smaller than the multi-million read counts reported in the paper, especially for the untreated CSF and control serum. Whether these reflect SRA spots, processed subsets, or downsampled inputs, they are an important clue that the recreated workflow is not operating on exactly the same data volume as the paper.
4 Part A: quick interpretation from direct final results
This first interpretation layer uses the direct final results from:
This is the easier, more obvious layer of interpretation. Before running the code, predict what you expect to see:
Will all four samples produce a confident pathogen-level result?
Which specimen should be most diagnostically useful for CNS infection?
Would serum be expected to show the same pathogen signal?
Click to hint
For a CNS infection, CSF is usually more informative than serum. Also remember that a negative control is not expected to produce a disease-causing pathogen signal.
This code follows the same practical style as the previous Vietnamese healthy gut microbiome notebook: first check what files exist, then load the smallest interpretable summary table, then only move to deeper analysis if the first result is incomplete or confusing.
How many samples show a meaningful retained organism-level result?
Which organism is detected?
Does this match the pathogen described in the original paper?
Click to reveal interpretation
The direct final result is very simple:
Only one sample yields a meaningful pathogen-level result.
The informative sample is the untreated CSF (SRR1145846).
The retained organism is Leptospira santarosai.
This matches the final pathogen interpretation of the original paper.
This is the clearest direct result layer. It gives learners the main diagnostic message quickly, before they explore why the other samples are weak or non-informative.
The AMR output does not materially change the diagnosis in this case.
The AMR calls are sparse, low-coverage, and not central to recognizing neuroleptospirosis. They are useful as a demonstration that clinical metagenomic pipelines can return AMR-related outputs, but here they are mainly illustrative rather than clinically decisive.
5 Part B: deeper interpretation using processed analysis tables
The quick result above is enough to identify the most likely pathogen. The deeper analysis below asks harder questions:
Why does only one sample survive the strict retained-hit layer?
Why are unclassified reads so prominent?
Why might the recreated workflow look different from the original paper?
When should a sample be interpreted as non-informative rather than forced into a diagnosis?
Before opening interpretations in this section, try to predict what the table or figure should show.
Click to hint
The important comparison is not only “which taxon is present?” It is also “how much of each sample is unresolved, and does the taxon survive conservative filtering?”
Only the untreated CSF sample survives the strict merged export.
In that export, the only retained organism-level call is Leptospira santarosai.
Even in that sample, the supplied table still shows a large unclassified component, which immediately tells us not to over-simplify the interpretation.
Because the supplied export collapses the dataset to one sample, it is not sufficient on its own for a four-sample diagnostic teaching notebook.
6 Reconstruct a four-sample exploratory taxonomy table from the underlying read-based outputs
To preserve the full diagnostic set, we reconstruct a species-level table from the underlying per-sample taxonomic summaries in:
This is not identical to the supplied strict YACHT export. It is a broader exploratory view that helps explain why only one sample survived the stricter final filter.
The reconstructed exploratory table restores the expected four-sample panel. This is the table used below for the whole-case interpretation, while the supplied strict export is still used as evidence that only one sample passed the stricter retained-hit filter.
6.1 Harmonize sample identifiers across metadata, taxonomy, AMR, and strict exports
GCF_000313175.2 Leptospira santarosai serovar Shermani str. LT 821
0.978
0.020
SRR1145846
Untreated CSF
0.05
GCF_000313175.2 Leptospira santarosai serovar Shermani str. LT 821
0.988
0.011
SRR1145844
DNase-treated CSF
0.01
GCA_001321855.1 SAR11 cluster bacterium PRT-SC02
0.958
0.055
SRR1145844
DNase-treated CSF
0.05
No retained hit
NA
NA
SRR1145845
Patient serum
0.01
GCA_001321855.1 SAR11 cluster bacterium PRT-SC02
0.958
0.055
SRR1145845
Patient serum
0.05
No retained hit
NA
NA
SRR1145847
Negative-control serum
0.01
GCA_001321855.1 SAR11 cluster bacterium PRT-SC02
0.958
0.055
SRR1145847
Negative-control serum
0.05
No retained hit
NA
NA
Threshold interpretation
At the looser min_coverage = 0.01 setting, three non-untreated-CSF samples retain a weak hit to SAR11 cluster bacterium PRT-SC02, which is not clinically coherent for this case.
At the stricter min_coverage = 0.05 setting, only SRR1145846 retains a YACHT hit, and that hit is Leptospira santarosai.
This is the key reason the workshop export in nejm_analysis/ collapses to one sample.
Click to reveal: why “many aligned reads” can still mean no pathogen call
A large number of nucleotide alignments is not the same thing as a true pathogen signal.
For a clinically persuasive metagenomic result, we need more than “some reads aligned somewhere.” We want a signal that is:
enriched in the right specimen,
concentrated in a biologically plausible organism,
absent or much weaker in serum/control samples,
robust enough to survive conservative thresholds,
and coherent with the patient’s syndrome.
In the original paper, the untreated CSF had a dominant Leptospira signal among bacterial reads, while serum and control did not. In contrast, many other bacterial assignments were distributed across background-like genera and appeared in comparator or control libraries. Modern stricter workflows can reasonably remove those weak, distributed calls while retaining the untreated-CSF Leptospira signal.
Click to reveal: why DNase-treated samples can be confusing for bacterial interpretation
DNase treatment was used to enrich for viral sequencing, not to optimize bacterial species calling. It can reduce free host DNA and shift the library composition, so the relative amount of “microbial-looking” material may increase.
However, the original supplementary material also notes that DNase pretreatment can enrich conserved bacterial 16S rRNA material, which makes accurate species-level bacterial identification harder. Therefore, a DNase-treated library can have more bacterial-looking alignments but less reliable pathogen-level specificity.
For this case, the practical interpretation is:
untreated CSF = the strongest bacterial pathogen signal,
DNase-treated CSF/serum = more prone to non-specific or hard-to-classify signal,
serum/control = useful comparators but not diagnostic for the CNS pathogen in this recreated workflow.
8 Taxonomic result inspection
8.1 Unclassified signal is the central interpretive issue
Before running the next chunk, predict which samples will be most difficult to interpret.
Ask yourself:
Which specimen is most likely to contain low microbial biomass?
Which specimen might be dominated by host/background material?
Which specimen should be most clinically meaningful if this is a CNS infection?
Click to hint
Do not only look for the organism name. First look at how much of each sample remains unresolved. A sample can contain a taxon label but still be too incomplete or unstable for a reliable conclusion.
The unclassified burden is the single most important taxonomic warning sign in this recreated dataset:
SRR1145844 DNase-treated CSF: about 98.1% unweighted and 99.6% weighted unclassified.
SRR1145845 patient serum: about 98.4% unweighted and 99.95% weighted unclassified.
SRR1145846 untreated CSF: about 47.3% unweighted but 92.5% weighted unclassified in the broader species summary.
SRR1145847 negative-control serum: about 50.0% unweighted and 21.0% weighted unclassified, with the remaining classified portion split across multiple low-level taxa.
This means several samples are low-information or unstable at organism level, even before we ask whether any taxon is clinically plausible.
High unclassified signal weakens confidence because much of the sample remains unresolved.
When a specimen is dominated by unclassified sequence, the classified profile can become incomplete, unstable, and overly sensitive to small numbers of assigned hashes.
In a low-biomass clinical specimen, this often means the sample is not robust enough for a trustworthy organism-level conclusion, even if a few taxa are nominally present.
For this case, the untreated CSF is the only specimen that still retains a clinically coherent pathogen candidate after strict filtering. The others should be treated cautiously or regarded as non-informative.
Click to reveal: why unclassified can remain high after host removal
Removing the human genome is necessary, but it does not turn the remaining reads into a clean microbial dataset.
After host depletion, the remaining reads may still include:
low-quality or low-complexity fragments,
short reads that lack taxonomic information,
amplification artifacts,
reagent or environmental background,
host-associated reads that were not removed perfectly,
and real biological sequences without close reference genomes in the database.
This is especially common in low-biomass clinical specimens such as CSF. In the original paper, the untreated CSF had millions of raw reads, but only a small fraction were non-human, and an even smaller fraction carried the diagnostic Leptospira signal. The clinical question is therefore not “Can we classify everything?” but “Does the classified part contain a coherent, specimen-specific pathogen signal?”
Click to reveal: when “no reliable conclusion” is the right answer
A sample may be best interpreted as non-informative when:
unclassified signal dominates,
the classified taxa are scattered across many weak background-like organisms,
the signal does not survive stricter thresholds,
the same type of signal appears in serum or control samples,
or the result does not fit the clinical syndrome or specimen type.
In this recreated workflow, the DNase-treated CSF, patient serum, and negative-control serum do not provide evidence strong enough for a final organism-level conclusion. That is not a failure of learning. It is one of the core lessons of clinical metagenomics.
8.2 Compare strict retained-hit export versus broader exploratory species summaries
For the untreated CSF, both views agree on the key teaching point: Leptospira santarosaiis the only clinically coherent signal. However, the exact numeric fractions differ because these files come from different summary layers of the pipeline.
That matters for interpretation:
The supplied strict export reflects the final, more conservative retained-hit table.
The reconstructed species table is broader and noisier, which is useful for teaching why the other samples did not survive stricter filtering.
We should therefore compare the two views qualitatively, not pretend that they are numerically interchangeable.
The untreated CSF is the only sample where leptospiral signal is large enough to remain central to the story.
The DNase-treated CSF shows only a tiny leptospiral trace in the broader summary and no retained strict YACHT hit.
The patient serum is overwhelmingly unclassified and does not support a bloodstream leptospiral signal in this recreated workflow.
The negative-control serum is not cleanly blank in the broader summary, which argues for caution and supports a conservative interpretation of any low-level non-CSF hits.
This table is useful, but it must be read carefully:
f_weighted_classified describes composition among the classified component only.
In samples that are nearly all unclassified, a taxon can look “dominant” within the classified slice while still being clinically trivial in the original specimen.
That is why the top classified taxa in the DNase-treated CSF, serum, and control serum should not be treated as actionable pathogen calls.
Code
top_species <- raw_species_without %>%group_by(species_label) %>%summarise(max_classified =max(f_weighted_classified, na.rm =TRUE), .groups ="drop") %>%arrange(desc(max_classified), species_label) %>%slice_head(n =10) %>%pull(species_label)heatmap_data <- raw_species_without %>%filter(species_label %in% top_species) %>%select(sample_name, sample_role, species_label, f_weighted_classified) %>%complete(sample_name = metadata$sample_name,species_label = top_species,fill =list(f_weighted_classified =0) ) %>%left_join(metadata %>%select(sample_name, sample_role), by ="sample_name") %>%mutate(sample_label =factor(sample_labels[sample_name], levels = sample_labels[metadata$sample_name]),species_label =factor(species_label, levels =rev(top_species)) )species_heatmap <-ggplot( heatmap_data,aes(x = sample_label, y = species_label, fill =100* f_weighted_classified)) +geom_tile(color ="grey80", linewidth =0.3) +scale_fill_gradientn(colours =c("grey98", "#c7e9c0", "#41ab5d", "#005a32"),labels =label_number(suffix ="%", accuracy =1) ) +labs(title ="Among classified signal, untreated CSF is dominated by Leptospira",subtitle ="These are classified-only percentages and should not be mistaken for whole-sample dominance",x =NULL,y ="Species",fill ="Classified-only\nabundance" ) +theme(axis.text.x =element_text(angle =20, hjust =1))save_fig(species_heatmap, "taxonomy_species_heatmap", width =10, height =6)species_heatmap
Interpretation
This view helps show why the untreated CSF is clinically different from the other specimens.
Once we condition on the classified component, Leptospira santarosai becomes the dominant classified organism in untreated CSF.
The other samples do not show a similarly coherent pathogen pattern. Their classified signal is fragmented across taxa that are either implausible, background-like, or too weak to trust.
9 Sample-level diagnostic interpretation
Code
strict_hits_05 <- yacht_thresholds %>%filter(threshold =="0.05") %>%transmute( sample_name,strict_retained_hit =!is.na(organism_name),strict_hit_label =if_else(is.na(organism_name), "No retained hit", organism_name),strict_confidence = actual_confidence_with_coverage )top_classified <- raw_species_without %>%group_by(sample_name) %>%slice_max(order_by = f_weighted_classified, n =1, with_ties =FALSE) %>%ungroup() %>%transmute( sample_name,top_classified_species = species_label,top_classified_f_weighted_classified = f_weighted_classified )sample_diagnostic_summary <- metadata %>%select(sample_name, sample_role, patient_status, approx_total_reads, clinical_use) %>%left_join( raw_species_with %>%filter(lineage =="unclassified") %>%select(sample_name, unclassified_unweighted = unweighted_fraction, unclassified_weighted = f_weighted_at_rank, total_weighted_hashes),by ="sample_name" ) %>%left_join(strict_hits_05, by ="sample_name") %>%left_join(top_classified, by ="sample_name") %>%mutate(diagnostic_interpretation =case_when( sample_role =="Untreated CSF"~"Most clinically meaningful sample. Retains a strict Leptospira hit, but still has substantial unresolved signal.", unclassified_weighted >=0.95~"Essentially non-informative at organism level. Dominated by unclassified signal.", sample_role =="Negative-control serum"~"Background/control sample with unstable mixed low-level signal. Useful mainly as a cautionary comparator.",TRUE~"No persuasive pathogen-level evidence strong enough for a final conclusion." ) ) %>%arrange(sample_role)kable(sample_diagnostic_summary, digits =3)
sample_name
sample_role
patient_status
approx_total_reads
clinical_use
unclassified_unweighted
unclassified_weighted
total_weighted_hashes
strict_retained_hit
strict_hit_label
strict_confidence
top_classified_species
top_classified_f_weighted_classified
diagnostic_interpretation
SRR1145846
Untreated CSF
Patient
256238
Primary diagnostic CNS specimen
0.473
0.925
469
TRUE
GCF_000313175.2 Leptospira santarosai serovar Shermani str. LT 821
0.988
Leptospira santarosai
0.857
Most clinically meaningful sample. Retains a strict Leptospira hit, but still has substantial unresolved signal.
SRR1145844
DNase-treated CSF
Patient
2612696
Comparator CNS specimen after DNase treatment
0.981
0.996
77552
FALSE
No retained hit
NA
Escherichia coli
0.979
Essentially non-informative at organism level. Dominated by unclassified signal.
SRR1145845
Patient serum
Patient
5627514
Systemic comparator specimen
0.984
1.000
130988
FALSE
No retained hit
NA
Acinetobacter johnsonii
0.538
Essentially non-informative at organism level. Dominated by unclassified signal.
SRR1145847
Negative-control serum
Control
89194
Negative control for background/context
0.500
0.210
186
FALSE
No retained hit
NA
Escherichia coli
0.415
Background/control sample with unstable mixed low-level signal. Useful mainly as a cautionary comparator.
9.1 Practical diagnostic conclusions from the four-sample set
9.1.1 1. Untreated CSF (SRR1145846)
This is the only sample that retains a confident strict YACHT hit at min_coverage = 0.05.
That hit is Leptospira santarosai.
In the broader species summary, this sample still carries a heavy unclassified fraction, so the result should be treated as clinically meaningful but not artificially perfect.
Even so, this is the most clinically meaningful specimen in the recreated workflow and the only sample that supports a plausible actionable diagnosis.
9.1.2 2. DNase-treated CSF (SRR1145844)
This sample is overwhelmingly unclassified.
Only a tiny low-level leptospiral trace is visible in the broader species summary, and it disappears from the strict retained-hit table.
The most appropriate interpretation is no reliable organism-level conclusion from this sample.
This is also consistent with the teaching point that DNase treatment can be helpful for viral enrichment but may reduce useful bacterial signal.
9.1.3 3. Patient serum (SRR1145845)
This sample is also overwhelmingly unclassified and does not yield a coherent leptospiral result.
The low-level classified taxa are not persuasive for a bloodstream explanation.
Therefore, the recreated serum does not provide trustworthy evidence for a final pathogen-level conclusion.
9.1.4 4. Negative-control serum (SRR1145847)
The strict YACHT export retains no hit, which is reassuring.
However, the broader species summary contains a diffuse mixture of low-level taxa rather than a perfectly clean blank.
That does not support a real alternative pathogen. Instead, it shows why low-biomass/control samples can look unstable and why conservative filtering matters.
amr_plot_data <- amr_data %>%mutate(sample_label =factor(sample_labels[sample_name], levels = sample_labels[metadata$sample_name]),aro_label =factor(`ARO Term`, levels =rev(unique(`ARO Term`))) )amr_plot <-ggplot( amr_plot_data,aes(x = sample_label, y = aro_label)) +geom_point(aes(size =`Completely Mapped Reads`, color =`Average Percent Coverage`),alpha =0.9 ) +scale_color_gradientn(colours =c("#d9d9d9", "#fdae6b", "#e6550d", "#a63603")) +scale_size_continuous(range =c(2.5, 10)) +labs(title ="AMR calls are sparse and not central to the diagnostic question",x =NULL,y ="ARO term",size ="Completely\nmapped reads",color ="Average %\ncoverage" ) +theme(axis.text.x =element_text(angle =20, hjust =1))save_fig(amr_plot, "amr_sparse_signal_summary", width =10, height =5.5)amr_plot
AMR interpretation
The AMR output is very sparse.
Only one putative AMR feature is associated with the untreated CSF leptospiral sample, and it is supported by only 2 completely mapped reads.
The higher-read AMR signal appears in the DNase-treated CSF, but that same specimen is taxonomically non-informative for the pathogen question.
Coverage is low for most calls, and none of these results materially changes the diagnostic reasoning in this case.
The most appropriate conclusion is that the AMR output is illustrative rather than clinically informative here.
11 Comparison with the original paper
The original paper reported:
475 single-end CSF reads mapping to Leptospira in untreated CSF,
263 single-end reads in DNase-treated CSF,
0 leptospiral reads in serum,
0 leptospiral reads in the negative control,
and eventual confirmation of Leptospira santarosai by targeted follow-up testing.
Before opening the comparison table, propose two reasons why this recreated output might look different from the published report.
Click to hint
Think about what changed between the original study and this workshop re-analysis: preprocessing, host filtering, reference databases, classification algorithms, and thresholds.
Code
comparison_table <- tibble::tribble(~Specimen, ~Original_paper, ~Recreated_workflow, ~Teaching_interpretation,"Untreated CSF","Clear actionable leptospiral signal; 475 single-end reads and 955 paired-end bacterial reads assigned to Leptospira.","Only specimen with a retained strict YACHT hit. Broad species summary still has heavy unclassified signal, but Leptospira remains the only coherent diagnosis-level candidate.","This is the one specimen that materially narrows the differential.","DNase-treated CSF","Leptospiral reads still present in the paper, though lower than untreated CSF.","No retained strict hit. Exploratory species summary is overwhelmingly unclassified, with only a tiny leptospiral trace.","Best interpreted as non-informative in the recreated workflow.","Patient serum","No leptospiral reads detected.","No retained strict hit. Serum is dominated by unclassified signal and weak background taxa.","Does not add trustworthy organism-level evidence.","Negative-control serum","No leptospiral reads detected and effectively clean for the key pathogen question.","No retained strict hit, but the broader species summary is noisier than expected, with mixed low-level taxa.","Useful as a cautionary control showing why low-level signal should not be over-called.")kable(comparison_table)
Specimen
Original_paper
Recreated_workflow
Teaching_interpretation
Untreated CSF
Clear actionable leptospiral signal; 475 single-end reads and 955 paired-end bacterial reads assigned to Leptospira.
Only specimen with a retained strict YACHT hit. Broad species summary still has heavy unclassified signal, but Leptospira remains the only coherent diagnosis-level candidate.
This is the one specimen that materially narrows the differential.
DNase-treated CSF
Leptospiral reads still present in the paper, though lower than untreated CSF.
No retained strict hit. Exploratory species summary is overwhelmingly unclassified, with only a tiny leptospiral trace.
Best interpreted as non-informative in the recreated workflow.
Patient serum
No leptospiral reads detected.
No retained strict hit. Serum is dominated by unclassified signal and weak background taxa.
Does not add trustworthy organism-level evidence.
Negative-control serum
No leptospiral reads detected and effectively clean for the key pathogen question.
No retained strict hit, but the broader species summary is noisier than expected, with mixed low-level taxa.
Useful as a cautionary control showing why low-level signal should not be over-called.
11.1 Main similarities with the paper
The untreated CSF remains the most important sample.
The recreated workflow still points to Leptospira santarosai in that specimen.
Serum does not provide stronger competing evidence for an alternative pathogen.
The case still teaches the same major clinical lesson: unbiased CSF metagenomics can shorten the path to a treatable diagnosis when serial conventional testing is unrevealing.
11.2 Main differences from the paper
The recreated dataset is much noisier, especially outside the untreated CSF.
The recreated DNase-treated CSF is not cleanly supportive, whereas the paper still recovered leptospiral signal there.
The recreated negative-control serum is not as clean as the paper’s description would lead us to expect.
The supplied strict workshop export collapses the dataset to one retained sample, which is a much more conservative picture than a simplified classroom result card might suggest.
Click to reveal: how to interpret disagreement without calling either analysis wrong
This comparison should not be read as “the paper was right and the re-analysis is wrong” or the reverse.
The original paper used the tools and databases available in 2013-2014 and then confirmed the result with PCR, Sanger sequencing, phylogenetics, and clinical response. The recreated workflow uses modern sketch-based and threshold-sensitive tools that may prioritize specificity and may discard weak or distributed background calls.
Both perspectives can be compatible:
the clinical conclusion of the original paper remains strongly supported,
the modern re-analysis may be stricter about what counts as a retained pathogen-level call,
and samples with heavy unclassified/background signal may reasonably become non-informative.
12 Why the recreated results may differ so much from the original report
Several factors could plausibly explain the mismatch:
Click to hint before reading the list
Separate biological explanations from workflow explanations. In this case, many differences are likely workflow-related rather than evidence that the pathogen biology was different.
Different sequencing depth or retained read volume
The local metadata suggest a much smaller re-analysis input than the original paper described for some specimens, especially untreated CSF and the negative control. Less data means less sensitivity and more instability.
Different preprocessing and host-removal behavior
Modern host depletion and stricter preprocessing can reduce false positives, but they can also leave very sparse non-host signal in low-biomass samples.
Different classification framework
The original paper used SURPI with nt/protein alignment. This notebook uses outputs from Sourmash/YACHT and taxmetagenome summaries, which are sketch-based and threshold-sensitive.
Different databases
The modern database now contains Leptospira santarosai, so direct species-level matching is easier than it was in 2014. At the same time, different databases can redistribute marginal reads into different taxa.
Conservative thresholds
The strict YACHT export uses a min_coverage = 0.05 retained-hit view. That improves specificity, but it also means borderline samples drop out completely.
Low-biomass CSF and comparator samples
Low-biomass clinical metagenomics is especially vulnerable to instability, high host background, stochastic classification, and over-interpretation of tiny signal.
Unclassified-heavy samples are inherently incomplete
When most of a sample is unclassified, the organism-level profile is only a partial summary of what remains after filtering and database matching. That is exactly why some recreated samples should be called non-informative rather than forced into a diagnosis.
The important teaching message is that disagreement with the original paper does not automatically mean the recreated workflow is wrong. It may mean that the recreated workflow is being more conservative with weak or unstable samples.
13 Integrated clinical interpretation
13.1 Moving from broad differential to sequencing-based narrowing
Before sequencing, the clinically plausible differential remains broad:
tuberculous meningitis,
viral meningoencephalitis,
fungal CNS infection,
partially treated bacterial meningitis,
autoimmune/inflammatory disease,
and less common zoonotic infection.
The sequencing results change that differential unevenly:
Most samples do not help enough to justify a pathogen-level conclusion.
The untreated CSF is the exception. It provides the only coherent organism-level signal that survives strict filtering and fits the clinical story.
13.2 Most likely diagnosis
The most likely final diagnosis is neuroleptospirosis caused by pathogenic Leptospira, most consistent with Leptospira santarosai in this recreated workflow.
That conclusion is supported by the combination of:
a difficult CNS infection syndrome with repeatedly unrevealing conventional testing,
the clinical logic of the original case,
the untreated CSF being the best bacterial specimen,
a retained strict YACHT hit to Leptospira santarosai only in untreated CSF,
and the lack of a comparably coherent alternative explanation from the other samples.
13.3 Would these recreated results change management?
Untreated CSF: yes, this result would meaningfully narrow the diagnostic pathway and justify targeted confirmatory discussion.
DNase-treated CSF: no, this specimen does not provide evidence strong enough for a final organism-level conclusion.
Patient serum: no, this specimen does not return trustworthy evidence strong enough for a final conclusion.
Negative control: no, it is informative only as a cautionary background comparator.
14 Why some samples may be non-informative
This case is valuable precisely because not every specimen helps equally.
In the recreated workflow, a sample should be considered non-informative when:
it is dominated by unclassified signal,
the classified component is small and fragmented,
the top taxa are clinically implausible or inconsistent with the specimen type,
the negative control is noisy enough to raise caution about low-level background,
or the strict retained-hit step drops the sample completely.
That is exactly what happens for the DNase-treated CSF, patient serum, and control serum here.
In a workshop setting, this is an important lesson: a metagenomic workflow is useful not only when it finds the answer, but also when it shows that a specimen does not contain sufficiently trustworthy evidence for a conclusion.
15 Limitations
This is a teaching re-analysis, not a full reproduction of the original study.
The supplied strict workshop export retains only one sample, so a broader four-sample view had to be reconstructed from lower-level per-sample summaries.
The dataset contains only four runs, one of which is a negative control.
Heavy unclassified signal makes several sample profiles unstable.
The abundance metrics differ across summary layers, so exact percentages should not be over-harmonized.
The AMR workflow is present but sparse and not clinically central to this case.
Diversity and ordination results are purely descriptive and should not be treated as formal microbiome statistics.
16 Suggested discussion points for learners
Why is the untreated CSF the most useful sample for this case?
Why does a large unclassified fraction weaken confidence in the DNase-treated CSF and serum results?
Why is a noisy negative control important even when it does not show the final pathogen?
Why can a recreated workflow disagree with a published paper without being obviously wrong?
How should clinicians balance an imperfect but coherent metagenomic result against ongoing diagnostic uncertainty in a deteriorating patient?
17 Suggested next steps and follow-up tests
Re-review the exposure history for water and animal-associated leptospiral risk.
Send targeted CSF confirmatory molecular testing for pathogenic Leptospira if available.
Discuss disease-directed therapy in parallel with confirmatory testing if the full clinical picture is strongly compatible.
If clinically appropriate, consider convalescent serology with the clear caveat that immune dysfunction or immunoglobulin replacement can complicate interpretation.
Preserve the untreated CSF as the most important specimen for further confirmatory work.
This final add-on is not needed to reach the clinical diagnosis. It is included for learners who want to see how the same processed table can be adapted into a microbiome-style R workflow, similar to the previous Vietnamese healthy gut microbiome notebook.
Important
These plots are descriptive only. With four runs, heavy unclassified signal, and clinical low-biomass specimens, alpha diversity and ordination should not be treated as formal microbiome evidence.
18.1 Build an exploratory phyloseq object
The supplied nejm.phyloseq/ export is too incomplete for four-sample comparisons, so we build a small exploratory object from the reconstructed classified-only species table.
These alpha diversity values are descriptive only.
In a clinical low-biomass dataset, higher diversity does not automatically mean a more informative specimen. It can just mean “more diffuse background.”
The untreated CSF is actually less diverse than the noisier comparator specimens, which is compatible with a more coherent pathogen-dominated classified component.
The DNase-treated CSF, patient serum, and control serum may look more diverse because their classified reads are split across many weak taxa, not because they are clinically richer samples.
18.3 Beta diversity and ordination
Code
sample_matrix <-t(otu_mat)bray_dist <-vegdist(sample_matrix, method ="bray")ordination <-cmdscale(bray_dist, eig =TRUE, k =2)ord_df <-as.data.frame(ordination$points)colnames(ord_df) <-c("Axis1", "Axis2")ord_df$sample_name <-rownames(ord_df)ord_df <- ord_df %>%left_join(metadata %>%select(sample_name, sample_role), by ="sample_name") %>%mutate(sample_label = sample_labels[sample_name] )ord_plot <-ggplot(ord_df, aes(x = Axis1, y = Axis2, color = sample_role, label = sample_label)) +geom_hline(yintercept =0, linewidth =0.3, color ="grey80") +geom_vline(xintercept =0, linewidth =0.3, color ="grey80") +geom_point(size =3.5) +geom_text(nudge_y =0.02, size =3.2, show.legend =FALSE) +scale_color_manual(values = sample_role_palette) +labs(title ="Bray-Curtis ordination of classified-only species profiles",subtitle ="Useful as a descriptive picture only; not suitable for formal inference",x ="PCoA 1",y ="PCoA 2",color ="Sample role" )save_fig(ord_plot, "taxonomy_bray_curtis_ordination", width =9, height =5.5)ord_plot
Interpretation
The ordination is best read as a visual summary of how different the classified-only profiles are.
It reinforces that the untreated CSF behaves differently from the other samples.
However, with only four runs and heavy unclassified signal, this plot is a teaching visual, not evidence of a stable ecological pattern.
19 Conclusion
This recreated notebook preserves the main clinical-metagenomic lesson of the original NEJM case while being honest about the limits of the local data.
The key conclusion is not that every sample yields a useful answer. It is that only one sample, the untreated CSF, provides credible organism-level evidence strong enough to narrow the case toward neuroleptospirosis, while the remaining recreated samples are dominated by unclassified signal or unstable background and should be treated as non-informative.
That is exactly the kind of interpretation healthcare workers need in practice: not just how to spot a plausible pathogen, but also how to recognize when a specimen does not support a trustworthy conclusion.