Clinical Foundations & AMR Surveillance: AMR Analysis & Summary

AMR burden, filtering choices, and cautious comparison with Pereira-Dias et al.

Author

Hao Chung The, Minh-Quan Ton-Ngoc

Published

May 15, 2026

1 R-based exploratory analysis

The sections below assume that the pipeline outputs above already exist. From this point onward, the notebook shifts from pipeline execution to exploratory interpretation.

Code

library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)
library(ggrepel)
library(phyloseq)
library(vegan)
library(patchwork)
library(scales)
library(data.table)
library(broom)

data_dir <- file.path("..", "materials", "vietnamese-healthy-gut-microbiome")

group_palette <- c(
  "Child 0-23 months" = "#D55E00",
  "Child 24-59 months" = "#0072B2",
  "Adult" = "#009E73"
)

group_shapes <- c(
  "Child 0-23 months" = 16,
  "Child 24-59 months" = 17,
  "Adult" = 15
)

group_labels <- c(
  "Child 0-23 months" = "Child\n(0-23 months)",
  "Child 24-59 months" = "Child\n(24-59 months)",
  "Adult" = "Adult"
)

group_labeller <- labeller(analysis_group = as_labeller(group_labels))

my_color <- c(
  "#4e79a7", "#f28e2b", "#59a14f", "#b6992d",
  "#499894", "#e15759", "#b07aa1", "#9d7660"
)

make_fill_palette <- function(levels_vector, other_label = "Other") {
  core_levels <- setdiff(levels_vector, other_label)
  n_cols <- length(core_levels)
  if (n_cols > length(my_color)) {
    extra_cols <- scales::hue_pal()(n_cols - length(my_color))
    core_cols <- c(my_color, extra_cols)
  } else {
    core_cols <- my_color[seq_len(n_cols)]
  }
  cols <- setNames(core_cols, core_levels)
  if (other_label %in% levels_vector) {
    cols[other_label] <- "grey85"
  }
  cols[levels_vector]
}

paper_amr_class_levels <- c(
  "Aminoglycosides", "β-lactamases", "Fosfomycin", "Fluoroquinolones",
  "Glycopeptides", "MLS", "Chloramphenicol", "Sulfonamides",
  "Tetracyclines", "Trimethoprim", "Other"
)

map_paper_amr_class <- function(drug_class) {
  case_when(
    drug_class %in% c("aminoglycoside antibiotic") ~ "Aminoglycosides",
    drug_class %in% c("cephalosporin", "penicillin beta-lactam", "carbapenem", "monobactam") ~ "β-lactamases",
    drug_class %in% c("phosphonic acid antibiotic") ~ "Fosfomycin",
    drug_class %in% c("fluoroquinolone antibiotic") ~ "Fluoroquinolones",
    drug_class %in% c("glycopeptide antibiotic") ~ "Glycopeptides",
    drug_class %in% c(
      "macrolide antibiotic", "lincosamide antibiotic", "streptogramin antibiotic",
      "streptogramin A antibiotic", "streptogramin B antibiotic"
    ) ~ "MLS",
    drug_class %in% c("phenicol antibiotic") ~ "Chloramphenicol",
    drug_class %in% c("sulfonamide antibiotic") ~ "Sulfonamides",
    drug_class %in% c("tetracycline antibiotic", "glycylcycline") ~ "Tetracyclines",
    drug_class %in% c("diaminopyrimidine antibiotic") ~ "Trimethoprim",
    TRUE ~ "Other"
  )
}

theme_set(
  theme_bw(base_size = 12) +
    theme(
      panel.grid.minor = element_blank(),
      plot.title.position = "plot",
      legend.position = "right"
    )
)

fig_dir <- "figures"
dir.create(fig_dir, showWarnings = FALSE)

save_fig <- function(plot, filename, width = 9, height = 5.5, dpi = 300) {
  ggsave(
    filename = file.path(fig_dir, filename),
    plot = plot,
    width = width,
    height = height,
    dpi = dpi,
    bg = "white"
  )
}

parse_lineage <- function(df) {
  df %>%
    tidyr::separate(
      col = lineage,
      into = c("domain", "phylum", "class", "order", "family", "genus", "species"),
      sep = ";",
      fill = "right",
      extra = "drop",
      remove = FALSE
    )
}

trim_sample_ids <- function(x) trimws(as.character(x))

Code note

This setup chunk loads the R packages used throughout the notebook and defines a few helper objects.
The notebook now uses ggplot2 only for plotting, to keep figure sizing predictable in the rendered output.
For color, the stacked bar plots use a richer custom palette, while group-level plots still combine color and point shape so they remain readable on screen and when printed.
The default ggplot2 theme is theme_bw(), so any plot built directly with ggplot2 follows the workshop style guide.

2 Context from the original paper

Before looking at the 6 local samples, we extract the main analytical context from the published article and the local summary notes.

Paper-level findings that are reasonable to compare against in a workshop subset

The study analyzed 42 healthy Vietnamese participants split into 0-23 months, 2-5 years, and adults.
The paper reported a clear age-related microbiome structure difference, with children under 2 years showing a distinct gut community structure.
Alpha diversity was lower in the youngest children than in older children and adults.
Beta diversity showed broad overlap between older children and adults, while younger children were more distinct and more dispersed.
The healthy gut resistome was described as a substantial AMR reservoir, with tetracycline and MLS resistance genes especially common.
Genes highlighted in the paper included ermB, lnuC, tetM, tetO, tetQ, tetW, and tet-32 as near-ubiquitous signals.
The paper reported a higher AMR burden in the youngest children, and noted clinically important genes such as blaCTX-M and mphA/mphD in younger children.

Code

paper_context <- tibble::tibble(
  theme = c(
    "Age-related microbiome structure",
    "Alpha diversity",
    "Beta diversity",
    "Broad AMR burden",
    "AMR classes emphasized",
    "AMR genes emphasized"
  ),
  published_finding = c(
    "Children under 2 years had a distinct gut microbiome structure relative to older children and adults.",
    "Youngest children had lower Shannon diversity; older children were closer to adults.",
    "Bray-Curtis ordination showed young children dispersed away from the overlapping older-child and adult cloud.",
    "Youngest children carried the greatest number of detected AMR genes.",
    "Tetracycline and MLS resistance were nearly ubiquitous.",
    "ermB, lnuC, tet-32, tetM, tetO, tetQ, tetW were common; blaCTX-M and mphA/mphD were highlighted as clinically important signals."
  )
)

knitr::kable(paper_context, col.names = c("Theme", "Finding extracted from paper / summary"))

Theme	Finding extracted from paper / summary
Age-related microbiome structure	Children under 2 years had a distinct gut microbiome structure relative to older children and adults.
Alpha diversity	Youngest children had lower Shannon diversity; older children were closer to adults.
Beta diversity	Bray-Curtis ordination showed young children dispersed away from the overlapping older-child and adult cloud.
Broad AMR burden	Youngest children carried the greatest number of detected AMR genes.
AMR classes emphasized	Tetracycline and MLS resistance were nearly ubiquitous.
AMR genes emphasized	ermB, lnuC, tet-32, tetM, tetO, tetQ, tetW were common; blaCTX-M and mphA/mphD were highlighted as clinically important signals.

3 Analysis-ready files used below

The R analysis below assumes that the workflow and post-processing steps have already produced an analysis folder containing:

Metadata for the 6 re-analyzed samples.
Taxonomy tables with and without unclassified rows.
A vietnamese_gut.phyloseq folder that contains otu_table.csv, sample_data.csv, and tax_table.csv.
A vietnamese_gut_amr folder with one rgi-bwt gene mapping table per sample.
The original paper PDF and a local summary file used only for context.

Code

base_dir <- "."

paths <- tibble::tibble(
  object = c(
    "metadata",
    "taxonomy_with_unclassified",
    "taxonomy_without_unclassified",
    "phyloseq_export_dir",
    "amr_dir",
    "local_summary_md"
  ),
  file_name = c(
    "vietnamese_gut_sample_information.csv",
    "vietnamese_gut_with_unclassified.csv",
    "vietnamese_gut_without_unclassified.csv",
    "vietnamese_gut.phyloseq",
    "vietnamese_gut_amr",
    "vietnamese_gut_summary.md"
  )
) %>%
  mutate(
  path = file.path(data_dir, file_name),
  exists = file.exists(path) | dir.exists(path)
)

knitr::kable(paths)

object	file_name	path	exists
metadata	vietnamese_gut_sample_information.csv	../materials/vietnamese-healthy-gut-microbiome/vietnamese_gut_sample_information.csv	TRUE
taxonomy_with_unclassified	vietnamese_gut_with_unclassified.csv	../materials/vietnamese-healthy-gut-microbiome/vietnamese_gut_with_unclassified.csv	TRUE
taxonomy_without_unclassified	vietnamese_gut_without_unclassified.csv	../materials/vietnamese-healthy-gut-microbiome/vietnamese_gut_without_unclassified.csv	TRUE
phyloseq_export_dir	vietnamese_gut.phyloseq	../materials/vietnamese-healthy-gut-microbiome/vietnamese_gut.phyloseq	TRUE
amr_dir	vietnamese_gut_amr	../materials/vietnamese-healthy-gut-microbiome/vietnamese_gut_amr	TRUE
local_summary_md	vietnamese_gut_summary.md	../materials/vietnamese-healthy-gut-microbiome/vietnamese_gut_summary.md	FALSE

4 Metadata inspection

4.1 Load data

Code note

This chunk reads the local sample metadata and creates a simple analysis_group variable that we can use consistently in plots and summaries.
The grouping is intentionally simple because the metadata for this workshop subset are limited.

Code

metadata <- read_csv(file.path(data_dir, "vietnamese_gut_sample_information.csv"), show_col_types = FALSE) %>%
  mutate(
    sample_name = trim_sample_ids(sample_name),
    studycode = trim_sample_ids(studycode),
    status = recode(status, "Children" = "Child", .default = status),
    analysis_group = case_when(
      status == "Adult" ~ "Adult",
      age_month < 24 ~ "Child 0-23 months",
      TRUE ~ "Child 24-59 months"
    ),
    analysis_group = factor(
      analysis_group,
      levels = c("Child 0-23 months", "Child 24-59 months", "Adult")
    )
  ) %>%
  arrange(analysis_group, studycode)

sample_order <- metadata %>%
  arrange(analysis_group, studycode) %>%
  pull(sample_name)

studycode_order <- metadata %>%
  arrange(analysis_group, studycode) %>%
  pull(studycode)

metadata

# A tibble: 6 × 7
  studycode age_month date_of_sampling to_sequence sample_name status
  <chr>         <dbl> <chr>            <chr>       <chr>       <chr> 
1 22EN-C-02        16 18-May-2017      1st only    ERR9904449  Child 
2 22EN-C-03         9 18-May-2017      1st only    ERR9904450  Child 
3 22EN-C-07        59 22-May-2017      1st only    ERR9904458  Child 
4 22EN-A-04        25 18-May-2017      1st only    ERR9904447  Adult 
5 22EN-A-05        59 19-May-2017      1st only    ERR9904452  Adult 
6 22EN-A-17        32 24-May-2017      2nd only    ERR9904480  Adult 
# ℹ 1 more variable: analysis_group <fct>

Metadata notes

status is the most reliable top-level grouping variable in the local file.
For children, age_month is usable as age in months.
For adults, the age_month column appears to contain values such as 25, 32, and 59, which are much more plausible as adult ages in years than months.
For that reason, this notebook uses analysis_group derived from status and child age thresholds, and uses the adult age values only as descriptive metadata.

Code

metadata_summary <- metadata %>%
  count(status, analysis_group, name = "n_samples")

knitr::kable(metadata_summary)

status	analysis_group	n_samples
Adult	Adult	3
Child	Child 0-23 months	2
Child	Child 24-59 months	1

4.2 Harmonize sample IDs across inputs

Code note

This chunk checks whether the same sample IDs are present in metadata, taxonomy tables, the exported phyloseq tables, and the AMR result files.
Harmonizing IDs early helps avoid silent joins to the wrong sample later in the notebook.
We keep sample_name for joins, but use studycode later for plotting because it is more compact on figure axes.

Code

tax_with <- read_csv(file.path(data_dir, "vietnamese_gut_with_unclassified.csv"), show_col_types = FALSE) %>%
  mutate(sample_name = trim_sample_ids(sample_name))

tax_without <- read_csv(file.path(data_dir, "vietnamese_gut_without_unclassified.csv"), show_col_types = FALSE) %>%
  mutate(sample_name = trim_sample_ids(sample_name))

phy_sample_data <- read_csv(file.path(data_dir, "vietnamese_gut.phyloseq", "sample_data.csv"), show_col_types = FALSE) %>%
  mutate(sample_name = trim_sample_ids(sample_name))

amr_files <- list.files(file.path(data_dir, "vietnamese_gut_amr"), pattern = "gene_mapping_data\\.txt$", full.names = TRUE)
amr_file_samples <- sub("\\.gene_mapping_data\\.txt$", "", basename(amr_files))

sample_harmonization <- tibble::tibble(
  source = c("metadata", "taxonomy_with", "taxonomy_without", "phyloseq_export", "amr_files"),
  n_unique_samples = c(
    n_distinct(metadata$sample_name),
    n_distinct(tax_with$sample_name),
    n_distinct(tax_without$sample_name),
    n_distinct(phy_sample_data$sample_name),
    length(unique(amr_file_samples))
  ),
  sample_ids = c(
    paste(sort(unique(metadata$sample_name)), collapse = ", "),
    paste(sort(unique(tax_with$sample_name)), collapse = ", "),
    paste(sort(unique(tax_without$sample_name)), collapse = ", "),
    paste(sort(unique(phy_sample_data$sample_name)), collapse = ", "),
    paste(sort(unique(amr_file_samples)), collapse = ", ")
  )
)

knitr::kable(sample_harmonization)

source	n_unique_samples	sample_ids
metadata	6	ERR9904447, ERR9904449, ERR9904450, ERR9904452, ERR9904458, ERR9904480
taxonomy_with	6	ERR9904447, ERR9904449, ERR9904450, ERR9904452, ERR9904458, ERR9904480
taxonomy_without	6	ERR9904447, ERR9904449, ERR9904450, ERR9904452, ERR9904458, ERR9904480
phyloseq_export	6	ERR9904447, ERR9904449, ERR9904450, ERR9904452, ERR9904458, ERR9904480
amr_files	6	ERR9904447, ERR9904449, ERR9904450, ERR9904452, ERR9904458, ERR9904480

All four sources contain the same six sample IDs, so harmonization is straightforward.

4.3 Inspect the exported phyloseq components

Code

phy_otu_export <- read_csv(file.path(data_dir, "vietnamese_gut.phyloseq", "otu_table.csv"), show_col_types = FALSE)
phy_tax_export <- read_csv(file.path(data_dir, "vietnamese_gut.phyloseq", "tax_table.csv"), show_col_types = FALSE)

phyloseq_export_status <- tibble::tibble(
  file = c("sample_data.csv", "otu_table.csv", "tax_table.csv"),
  rows = c(nrow(phy_sample_data), nrow(phy_otu_export), nrow(phy_tax_export)),
  columns = c(ncol(phy_sample_data), ncol(phy_otu_export), ncol(phy_tax_export))
)

knitr::kable(phyloseq_export_status)

file	rows	columns
sample_data.csv	6	2
otu_table.csv	736	7
tax_table.csv	736	8

The vietnamese_gut.phyloseq path is not a serialized .rds phyloseq object. It is an export directory containing the component tables needed to reconstruct a phyloseq object.

4.4 Inspect the AMR folder layout

Code

amr_file_inventory <- tibble::tibble(
  file = basename(amr_files),
  sample_name = sub("\\.gene_mapping_data\\.txt$", "", basename(amr_files)),
  size_bytes = file.info(amr_files)$size
)

knitr::kable(amr_file_inventory)

file	sample_name	size_bytes
ERR9904447.gene_mapping_data.txt	ERR9904447	109567
ERR9904449.gene_mapping_data.txt	ERR9904449	100676
ERR9904450.gene_mapping_data.txt	ERR9904450	96742
ERR9904452.gene_mapping_data.txt	ERR9904452	80172
ERR9904458.gene_mapping_data.txt	ERR9904458	103677
ERR9904480.gene_mapping_data.txt	ERR9904480	126389

Code

amr_example <- fread(amr_files[1], sep = "\t", data.table = FALSE)

knitr::kable(head(amr_example, 8))

ARO Term	ARO Accession	Reference Model Type	Reference DB	Alleles with Mapped Reads	Reference Allele(s) Identity to CARD Reference Protein (%)	Resistomes & Variants: Observed in Genome(s)	Resistomes & Variants: Observed in Plasmid(s)	Resistomes & Variants: Observed Pathogen(s)	Completely Mapped Reads	All Mapped Reads	Average Percent Coverage	Average Length Coverage (bp)	Average MAPQ (Completely Mapped Reads)	Number of reads mapping to baits and mapping to complete gene	Number of reads mapping to baits and mapping to complete gene (%)	Mate Pair Linkage (# reads)	Reference Length	AMR Gene Family	Drug Class	Resistance Mechanism
ErmG	3000522	protein homolog model	CARD; Resistomes & Variants	2	98.36 - 100.0	no data	no data	Bacteroides thetaiotaomicron	2	2	16.05	118.00	189.50	N/A	N/A	NA	735; 735	Erm 23S ribosomal RNA methyltransferase	macrolide antibiotic; lincosamide antibiotic; streptogramin antibiotic; streptogramin A antibiotic; streptogramin B antibiotic	antibiotic target alteration
AAC(6’)-Iw	3002567	protein homolog model	CARD	1	100.0	no data	no data	Acinetobacter sp.	1	1	3.85	17.00	23.00	N/A	N/A	NA	441	AAC(6’)	aminoglycoside antibiotic	antibiotic inactivation
tet(W)	3000194	protein homolog model	CARD; Resistomes & Variants	22	95.85 - 100.0	YES	no data	Butyrivibrio fibrisolvens; Bifidobacterium longum; Collinsella aerofaciens; Faecalibacterium prausnitzii; Lactobacillus johnsonii; Eggerthella lenta; Eubacterium maltosivorans; Lactobacillus crispatus; Streptococcus suis; Bifidobacterium breve	1247	1247	32.04	575.82	160.75	N/A	N/A	NA	1920; 1920; 1920; 2082; 1617; 1920; 1920; 1959; 2070; 1920; 1920; 1920; 1920; 1920; 1920; 1920; 1920; 1932; 1920; 1920; 1920; 219	tetracycline-resistant ribosomal protection protein	tetracycline antibiotic	antibiotic target protection
CGB-1	3000841	protein homolog model	CARD	1	100.0	no data	no data	Chryseobacterium gleum	2	2	2.74	20.00	40.00	N/A	N/A	NA	729	CGB beta-lactamase	carbapenem; cephalosporin; penicillin beta-lactam	antibiotic inactivation
vanL	3002910	protein homolog model	CARD	1	100.0	no data	no data	Enterococcus faecalis	1	1	3.05	32.00	56.00	N/A	N/A	NA	1050	glycopeptide resistance gene cluster; Van ligase	glycopeptide antibiotic	antibiotic target alteration
lsaB	3003111	protein homolog model	CARD	1	100.0	no data	no data	Mammaliicoccus sciuri	14	14	2.91	43.00	39.64	N/A	N/A	NA	1479	lsa-type ABC-F protein	lincosamide antibiotic; streptogramin antibiotic; pleuromutilin antibiotic	antibiotic target protection
ErmQ	3000593	protein homolog model	CARD	1	100.0	no data	no data	Clostridium perfringens	4	4	44.44	344.00	160.00	N/A	N/A	NA	774	Erm 23S ribosomal RNA methyltransferase	macrolide antibiotic; lincosamide antibiotic; streptogramin antibiotic; streptogramin A antibiotic; streptogramin B antibiotic	antibiotic target alteration
catP	3002686	protein homolog model	CARD	1	100.0	no data	no data	Clostridium perfringens	30	30	100.00	624.00	156.97	N/A	N/A	NA	624	chloramphenicol acetyltransferase (CAT)	phenicol antibiotic	antibiotic inactivation

The AMR files are one table per sample and include:

ARO term
reference model metadata
mapped read counts
average coverage
AMR gene family
drug class
resistance mechanism

This is sufficient for descriptive gene-level and drug-class summaries.

5 AMR analysis

5.1 Inspect rgi-bwt outputs and harmonize sample IDs

Code note

This chunk reads all per-sample rgi-bwt result tables into one long table.
It also converts the key AMR columns to numeric form and extracts a usable gene length from Reference Length.
We then build two normalized AMR datasets:
1. a version without a coverage filter
2. a stricter version retaining only hits with Average Percent Coverage ≥ 90

Code

amr_raw <- lapply(amr_files, function(path) {
  read_tsv(path, show_col_types = FALSE) %>%
    mutate(
      sample_name = sub("\\.gene_mapping_data\\.txt$", "", basename(path)),
      sample_name = trim_sample_ids(sample_name)
    )
}) %>%
  bind_rows() %>%
  left_join(metadata, by = "sample_name") %>%
  mutate(
    `Completely Mapped Reads` = as.numeric(`Completely Mapped Reads`),
    `All Mapped Reads` = as.numeric(`All Mapped Reads`),
    `Average Percent Coverage` = as.numeric(`Average Percent Coverage`),
    `Average Length Coverage (bp)` = as.numeric(`Average Length Coverage (bp)`),
    reference_length_raw = as.character(`Reference Length`),
    reference_length_bp = readr::parse_number(sub(";.*$", "", reference_length_raw))
  )

amr_base <- amr_raw %>%
  filter(
    !is.na(`Completely Mapped Reads`),
    !is.na(reference_length_bp),
    reference_length_bp > 0,
    `Completely Mapped Reads` > 0
  ) %>%
  mutate(
    rpk = `Completely Mapped Reads` / (reference_length_bp / 1000)
  )

amr_all <- amr_base %>%
  group_by(sample_name) %>%
  mutate(
    retained_amr_reads = sum(`Completely Mapped Reads`, na.rm = TRUE),
    retained_amr_reads_million = retained_amr_reads / 1e6,
    rpkm = rpk / retained_amr_reads_million
  ) %>%
  ungroup() %>%
  mutate(amr_case = "No coverage filter")

amr_filtered <- amr_base %>%
  filter(!is.na(`Average Percent Coverage`), `Average Percent Coverage` >= 90) %>%
  group_by(sample_name) %>%
  mutate(
    retained_amr_reads = sum(`Completely Mapped Reads`, na.rm = TRUE),
    retained_amr_reads_million = retained_amr_reads / 1e6,
    rpkm = rpk / retained_amr_reads_million
  ) %>%
  ungroup() %>%
  mutate(amr_case = "Average Percent Coverage ≥ 90")

amr_cases <- bind_rows(amr_all, amr_filtered) %>%
  mutate(
    amr_case = factor(amr_case, levels = c("No coverage filter", "Average Percent Coverage ≥ 90"))
  )

amr_overview <- bind_rows(
  amr_all %>%
    summarise(
      amr_case = "No coverage filter",
      n_rows = n(),
      n_samples = n_distinct(sample_name),
      n_aro_terms = n_distinct(`ARO Term`)
    ),
  amr_filtered %>%
    summarise(
      amr_case = "Average Percent Coverage ≥ 90",
      n_rows = n(),
      n_samples = n_distinct(sample_name),
      n_aro_terms = n_distinct(`ARO Term`)
    )
) %>%
  ungroup()

knitr::kable(amr_overview)

amr_case	n_rows	n_samples	n_aro_terms
No coverage filter	1859	6	683
Average Percent Coverage ≥ 90	83	6	50

Before any downstream AMR summary, this notebook applies the same normalization logic in two different ways:

compute RPK by normalizing Completely Mapped Reads by Reference Length
compute RPKM by scaling RPK within each sample by the retained per-sample AMR read total
compare results with and without the additional filter Average Percent Coverage ≥ 90

In other words, the number of reads mapped to each gene (RGI Completely Mapped Reads) is normalized by the gene length (RGI Reference Length) to create a Reads per kilobase of AMR gene (RPK) metric. That RPK value is then converted to RPKM using the sample-specific total retained AMR reads.

\[ \mathrm{RPK}_i = \frac{C_i}{L_i / 1000} \]

\[ \mathrm{RPKM}_i = \frac{\mathrm{RPK}_i}{\left(\sum_{j=1}^{n} C_j\right) / 10^6} \]

\(C_i\): Completely Mapped Reads for AMR gene \(i\)
\(L_i\): Reference Length for AMR gene \(i\), in base pairs
\(\sum_{j=1}^{n} C_j\): total Completely Mapped Reads across all retained AMR genes in the same sample
\(n\): number of retained AMR genes in that sample

This makes it possible for workshop participants to compare the effect of the coverage threshold directly, instead of seeing only one filtered view.

5.2 Per-sample gene burden

Code note

This section summarizes both AMR cases at the sample level.
The main idea is simple: count how many AMR hits remain per sample in each case, then compare their total normalized signal using RPKM.

Code

amr_sample_summary <- amr_cases %>%
  group_by(amr_case, sample_name) %>%
  summarise(
    n_retained_aro_terms = n_distinct(`ARO Term`),
    total_completely_mapped_reads = sum(`Completely Mapped Reads`, na.rm = TRUE),
    total_rpk = sum(rpk, na.rm = TRUE),
    total_rpkm = sum(rpkm, na.rm = TRUE),
    median_rpkm = median(rpkm, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  left_join(metadata, by = "sample_name") %>%
  arrange(amr_case, analysis_group, studycode)

knitr::kable(amr_sample_summary, digits = 2)

amr_case	sample_name	n_retained_aro_terms	total_completely_mapped_reads	total_rpk	total_rpkm	median_rpkm	studycode	age_month	date_of_sampling	to_sequence	status	analysis_group
No coverage filter	ERR9904449	303	65713	42891.62	652711.3	162.32	22EN-C-02	16	18-May-2017	1st only	Child	Child 0-23 months
No coverage filter	ERR9904450	285	101696	77348.26	760583.1	733.65	22EN-C-03	9	18-May-2017	1st only	Child	Child 0-23 months
No coverage filter	ERR9904458	318	57902	51150.38	883395.7	237.06	22EN-C-07	59	22-May-2017	1st only	Child	Child 24-59 months
No coverage filter	ERR9904447	330	41747	32638.71	781821.6	260.60	22EN-A-04	25	18-May-2017	1st only	Adult	Adult
No coverage filter	ERR9904452	246	31889	22535.53	706686.5	212.98	22EN-A-05	59	19-May-2017	1st only	Adult	Adult
No coverage filter	ERR9904480	377	52596	43358.38	824366.4	270.13	22EN-A-17	32	24-May-2017	2nd only	Adult	Adult
Average Percent Coverage ≥ 90	ERR9904449	9	3185	3507.82	1101356.6	56453.05	22EN-C-02	16	18-May-2017	1st only	Child	Child 0-23 months
Average Percent Coverage ≥ 90	ERR9904450	19	4624	6158.69	1331895.8	28661.36	22EN-C-03	9	18-May-2017	1st only	Child	Child 0-23 months
Average Percent Coverage ≥ 90	ERR9904458	15	4208	5903.73	1402977.8	33680.14	22EN-C-07	59	22-May-2017	1st only	Child	Child 24-59 months
Average Percent Coverage ≥ 90	ERR9904447	12	5815	5940.86	1021643.9	10479.33	22EN-A-04	25	18-May-2017	1st only	Adult	Adult
Average Percent Coverage ≥ 90	ERR9904452	13	2320	2556.41	1101898.8	51340.01	22EN-A-05	59	19-May-2017	1st only	Adult	Adult
Average Percent Coverage ≥ 90	ERR9904480	15	3307	3309.80	1000846.7	24667.45	22EN-A-17	32	24-May-2017	2nd only	Adult	Adult

Code

amr_jitter <- position_jitter(width = 0.05, height = 0, seed = 42)

amr_gene_count_plot <- ggplot(amr_sample_summary, aes(x = analysis_group, y = n_retained_aro_terms, color = analysis_group)) +
  stat_summary(fun = median, geom = "crossbar", width = 0.5, color = "grey30") +
  geom_point(aes(shape = analysis_group), size = 3, position = amr_jitter) +
  ggrepel::geom_text_repel(
    aes(label = studycode),
    position = amr_jitter,
    size = 3,
    show.legend = FALSE,
    seed = 42,
    box.padding = 0.25,
    point.padding = 0.2,
    min.segment.length = 0
  ) +
  scale_color_manual(values = group_palette, labels = group_labels) +
  scale_shape_manual(values = group_shapes, labels = group_labels) +
  facet_wrap(~ amr_case, nrow = 1) +
  scale_x_discrete(labels = group_labels) +
  scale_y_continuous(limits = c(0, NA), expand = expansion(mult = c(0, 0.05))) +
  labs(
    title = "AMR hit counts by sample under two filtering strategies",
    x = "Analysis group",
    y = "Number of retained ARO terms",
    shape = "Group"
  ) +
  theme(legend.position = "none")

amr_total_rpkm_plot <- ggplot(amr_sample_summary, aes(x = analysis_group, y = total_rpkm, color = analysis_group)) +
  stat_summary(fun = median, geom = "crossbar", width = 0.5, color = "grey30") +
  geom_point(aes(shape = analysis_group), size = 3, position = amr_jitter) +
  ggrepel::geom_text_repel(
    aes(label = studycode),
    position = amr_jitter,
    size = 3,
    show.legend = FALSE,
    seed = 42,
    box.padding = 0.25,
    point.padding = 0.2,
    min.segment.length = 0
  ) +
  scale_color_manual(values = group_palette, labels = group_labels) +
  scale_shape_manual(values = group_shapes, labels = group_labels) +
  facet_wrap(~ amr_case, nrow = 1) +
  scale_x_discrete(labels = group_labels) +
  scale_y_continuous(labels = comma_format(), limits = c(0, NA), expand = expansion(mult = c(0, 0.05))) +
  labs(
    title = "Total retained AMR signal after RPKM normalization",
    x = "Analysis group",
    y = "Summed RPKM",
    shape = "Group"
  ) +
  theme(legend.position = "none")

amr_burden_plot <- amr_gene_count_plot / amr_total_rpkm_plot
save_fig(amr_burden_plot, "amr_burden_summary.png", width = 9, height = 9.5)
amr_burden_plot

Take-home note. Comparing the top and bottom panels makes it easier to see that the strict coverage filter reduces both the number of retained hits and the apparent AMR burden.

Interpretation

Without the coverage filter, the AMR summary is broader and tends to preserve more of the signal that directionally resembles the paper.
After applying Average Percent Coverage ≥ 90, the retained AMR set becomes much smaller and more conservative.
The child samples still tend to show higher total AMR signal on average, but the difference becomes weaker after strict filtering, and the age pattern remains exploratory rather than definitive.

5.3 Gene-level summaries

Code

amr_gene_summary <- amr_cases %>%
  group_by(amr_case, `ARO Term`) %>%
  summarise(
    n_samples = n_distinct(sample_name),
    total_rpk = sum(rpk, na.rm = TRUE),
    total_rpkm = sum(rpkm, na.rm = TRUE),
    max_rpkm_in_one_sample = max(rpkm, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(amr_case, desc(n_samples), desc(total_rpkm))

knitr::kable(amr_gene_summary %>% group_by(amr_case) %>% slice_head(n = 20), digits = 2)

amr_case	ARO Term	n_samples	total_rpk	total_rpkm	max_rpkm_in_one_sample
No coverage filter	tet(Q)	6	16052.58	348350.05	144184.02
No coverage filter	ErmB	6	15638.81	308079.29	76867.24
No coverage filter	tet(O)	6	11677.08	228032.19	74950.74
No coverage filter	lnuC	6	13230.53	223310.48	167960.56
No coverage filter	vanT gene in vanG cluster	6	11604.81	212833.21	52746.43
No coverage filter	vanY gene in vanB cluster	6	10553.70	177275.29	44499.57
No coverage filter	tet(W)	6	10096.88	157028.43	52279.10
No coverage filter	Bifidobacterium adolescentis rpoB mutants conferring resistance to rifampicin	6	9735.19	143061.05	43965.06
No coverage filter	ErmF	6	6550.58	125364.49	37229.06
No coverage filter	adeF	6	6128.31	117506.45	27757.48
No coverage filter	tet(32)	6	4889.58	105120.92	28945.09
No coverage filter	ErmX	6	6581.29	94651.85	38070.92
No coverage filter	qacG	6	4910.30	85917.18	36038.63
No coverage filter	Bifidobacterium bifidum ileS conferring resistance to mupirocin	6	5701.25	73419.13	30847.43
No coverage filter	kdpE	6	5101.77	67845.54	31617.17
No coverage filter	vanG	6	3575.18	67426.88	15181.64
No coverage filter	tet(M)	6	3364.58	53357.76	12525.85
No coverage filter	dfrF	6	3277.45	50005.18	15792.76
No coverage filter	Escherichia coli emrE	6	3270.90	45729.89	24381.66
No coverage filter	tet(40)	6	2088.45	43034.56	11120.68
Average Percent Coverage ≥ 90	mel	6	1541.87	409727.75	237252.37
Average Percent Coverage ≥ 90	catP	6	623.40	194780.91	93943.41
Average Percent Coverage ≥ 90	CblA-1	4	1094.28	388729.91	226401.95
Average Percent Coverage ≥ 90	CfxA6	3	7946.79	1883098.37	830493.08
Average Percent Coverage ≥ 90	dfrF	3	2365.66	511392.24	347331.45
Average Percent Coverage ≥ 90	OXA-347	3	975.76	246717.37	149210.74
Average Percent Coverage ≥ 90	LnuP	3	203.59	47828.37	31306.21
Average Percent Coverage ≥ 90	mef(H)	2	2123.14	510194.12	497609.17
Average Percent Coverage ≥ 90	lnuC	2	2030.14	501758.30	300237.43
Average Percent Coverage ≥ 90	lnuB	2	889.78	195006.41	166335.60
Average Percent Coverage ≥ 90	Erm(52)	2	656.99	174428.07	133642.01
Average Percent Coverage ≥ 90	Mef(En2)	2	364.01	103810.41	91119.49
Average Percent Coverage ≥ 90	vanU gene in vanG cluster	2	286.38	96217.24	68803.63
Average Percent Coverage ≥ 90	cfr(B)	2	314.29	80272.06	43198.41
Average Percent Coverage ≥ 90	clcD	2	339.05	66420.14	48478.89
Average Percent Coverage ≥ 90	EreD	2	196.41	63698.60	37939.47
Average Percent Coverage ≥ 90	ErmQ	2	202.84	50669.40	40528.19
Average Percent Coverage ≥ 90	gadW	2	74.07	17023.85	9475.14
Average Percent Coverage ≥ 90	tetB(P)	2	61.77	14216.91	8335.37
Average Percent Coverage ≥ 90	CfxA4	1	1874.74	588615.76	588615.76

The top gene list changes substantially after strict filtering. Several genes that are prominent without a coverage cutoff do not remain after the Coverage ≥ 90 rule, so the workshop interpretation below compares both cases directly.

Code

paper_theme_genes <- c("ErmB", "lnuC", "tet(32)", "tet(M)", "tet(O)", "tet(Q)", "tet(W)", "mphA")
detected_ctxm_variants <- sort(unique(amr_cases$`ARO Term`[grepl("CTX-M", amr_cases$`ARO Term`, ignore.case = TRUE)]))
focus_gene_set <- unique(c(paper_theme_genes, detected_ctxm_variants))

amr_focus_gene_table <- amr_cases %>%
  filter(`ARO Term` %in% focus_gene_set) %>%
  group_by(amr_case, sample_name, `ARO Term`) %>%
  summarise(
    rpk = sum(rpk, na.rm = TRUE),
    rpkm = sum(rpkm, na.rm = TRUE),
    max_coverage = max(`Average Percent Coverage`, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  complete(
    amr_case = levels(amr_cases$amr_case),
    sample_name = sample_order,
    `ARO Term` = focus_gene_set,
    fill = list(rpk = 0, rpkm = 0, max_coverage = NA_real_)
  )

focus_gene_order <- amr_focus_gene_table %>%
  group_by(amr_case, `ARO Term`) %>%
  summarise(total_rpkm = sum(rpkm, na.rm = TRUE), .groups = "drop") %>%
  pivot_wider(names_from = amr_case, values_from = total_rpkm, values_fill = 0) %>%
  mutate(delta_rpkm = `No coverage filter` - `Average Percent Coverage ≥ 90`) %>%
  arrange(desc(abs(delta_rpkm)), desc(`No coverage filter`), `ARO Term`) %>%
  pull(`ARO Term`)

amr_focus_gene_table <- amr_focus_gene_table %>%
  left_join(metadata %>% select(sample_name, studycode, analysis_group), by = "sample_name") %>%
  mutate(
    studycode = factor(studycode, levels = studycode_order),
    amr_case = factor(amr_case, levels = levels(amr_cases$amr_case)),
    `ARO Term` = factor(`ARO Term`, levels = rev(focus_gene_order))
  )

gene_fill_max <- max(log10(amr_focus_gene_table$rpkm + 1), na.rm = TRUE)
gene_fill_breaks <- pretty(c(0, gene_fill_max), n = 5)

amr_focus_gene_plot <- ggplot(amr_focus_gene_table, aes(x = studycode, y = `ARO Term`, fill = log10(rpkm + 1))) +
  geom_tile(color = "grey70", linewidth = 0.25) +
  facet_grid(
    amr_case ~ analysis_group,
    scales = "free_x",
    space = "free_x",
    labeller = labeller(analysis_group = as_labeller(group_labels))
  ) +
  scale_fill_gradientn(
    colours = c("grey99", "#FEE08B", "#F46D43", "#A50026"),
    limits = c(0, gene_fill_max),
    breaks = gene_fill_breaks
  ) +
  labs(
    title = "Paper-themed AMR genes under both filtering strategies",
    subtitle = "Rows share one color scale; pale bottom-row tiles indicate signals removed by the Coverage ≥ 90 filter",
    x = "Study code",
    y = "ARO term",
    fill = "log10(RPKM + 1)"
  ) +
  theme(
    #axis.text.x = element_text(angle = 0, hjust = 0),
    axis.text.y = element_text(size = 8),
    strip.background = element_rect(fill = "grey92", color = "black"),
    strip.text = element_text(face = "bold"),
    panel.spacing = grid::unit(0.5, "lines")
  )

save_fig(amr_focus_gene_plot, "amr_paper_theme_gene_heatmap.png", width = 10, height = 8)
amr_focus_gene_plot

Take-home note. Because both rows use the same fill scale and the genes are ordered by how much signal is lost after filtering, the coverage effect is much easier to read at a glance.

Code

amr_focus_gene_summary <- amr_cases %>%
  filter(`ARO Term` %in% focus_gene_set) %>%
  group_by(amr_case, `ARO Term`) %>%
  summarise(
    n_samples = n_distinct(sample_name),
    total_rpk = sum(rpk, na.rm = TRUE),
    total_rpkm = sum(rpkm, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(amr_case, desc(n_samples), desc(total_rpkm))

knitr::kable(amr_focus_gene_summary, digits = 2)

amr_case	ARO Term	n_samples	total_rpk	total_rpkm
No coverage filter	tet(Q)	6	16052.58	348350.05
No coverage filter	ErmB	6	15638.81	308079.29
No coverage filter	tet(O)	6	11677.08	228032.19
No coverage filter	lnuC	6	13230.53	223310.48
No coverage filter	tet(W)	6	10096.88	157028.43
No coverage filter	tet(32)	6	4889.58	105120.92
No coverage filter	tet(M)	6	3364.58	53357.76
No coverage filter	mphA	4	27.74	473.48
No coverage filter	CTX-M-134	2	20.55	358.85
No coverage filter	CTX-M-15	1	28.54	683.61
No coverage filter	CTX-M-202	1	25.93	492.93
No coverage filter	CTX-M-78	1	2.28	43.41
No coverage filter	CTX-M-122	1	2.28	39.43
No coverage filter	CTX-M-65	1	2.28	34.74
No coverage filter	CTX-M-254	1	1.14	27.34
No coverage filter	CTX-M-137	1	1.14	21.70
No coverage filter	CTX-M-112	1	1.14	17.37
Average Percent Coverage ≥ 90	lnuC	2	2030.14	501758.30
Average Percent Coverage ≥ 90	CTX-M-15	1	28.54	4907.79

5.4 Drug-class summaries

Code

present_paper_classes <- amr_cases %>%
  select(`Drug Class`) %>%
  separate_rows(`Drug Class`, sep = ";\\s*") %>%
  mutate(paper_class = map_paper_amr_class(`Drug Class`)) %>%
  distinct(paper_class) %>%
  filter(!is.na(paper_class)) %>%
  mutate(paper_class = factor(paper_class, levels = paper_amr_class_levels)) %>%
  arrange(paper_class) %>%
  pull(paper_class)

amr_class_summary <- amr_cases %>%
  select(amr_case, sample_name, studycode, analysis_group, `ARO Term`, `Drug Class`, rpk, rpkm) %>%
  separate_rows(`Drug Class`, sep = ";\\s*") %>%
  mutate(paper_class = map_paper_amr_class(`Drug Class`)) %>%
  group_by(amr_case, sample_name, studycode, analysis_group, paper_class) %>%
  summarise(
    total_rpk = sum(rpk, na.rm = TRUE),
    total_rpkm = sum(rpkm, na.rm = TRUE),
    n_genes = n_distinct(`ARO Term`),
    .groups = "drop"
  ) %>%
  complete(
    amr_case = levels(amr_cases$amr_case),
    sample_name = sample_order,
    paper_class = present_paper_classes,
    fill = list(total_rpk = 0, total_rpkm = 0, n_genes = 0)
  ) %>%
  left_join(metadata %>% select(sample_name, studycode, analysis_group), by = "sample_name") %>%
  mutate(
    studycode = coalesce(studycode.x, studycode.y),
    analysis_group = coalesce(analysis_group.x, analysis_group.y)
  ) %>%
  select(-studycode.x, -studycode.y, -analysis_group.x, -analysis_group.y) %>%
  mutate(amr_case = factor(amr_case, levels = levels(amr_cases$amr_case)))

amr_class_order <- amr_class_summary %>%
  group_by(amr_case, paper_class) %>%
  summarise(total_rpkm = sum(total_rpkm, na.rm = TRUE), .groups = "drop") %>%
  pivot_wider(names_from = amr_case, values_from = total_rpkm, values_fill = 0) %>%
  mutate(delta_rpkm = `No coverage filter` - `Average Percent Coverage ≥ 90`) %>%
  arrange(desc(abs(delta_rpkm)), desc(`No coverage filter`), paper_class) %>%
  pull(paper_class)

amr_class_summary <- amr_class_summary %>%
  mutate(
    studycode = factor(studycode, levels = studycode_order),
    paper_class = factor(paper_class, levels = rev(amr_class_order))
  )

class_fill_max <- max(log10(amr_class_summary$total_rpkm + 1), na.rm = TRUE)
class_fill_breaks <- pretty(c(0, class_fill_max), n = 5)

amr_class_plot <- ggplot(amr_class_summary, aes(x = studycode, y = paper_class, fill = log10(total_rpkm + 1))) +
  geom_tile(color = "grey70", linewidth = 0.25) +
  facet_grid(
    amr_case ~ analysis_group,
    scales = "free_x",
    space = "free_x",
    labeller = labeller(analysis_group = as_labeller(group_labels))
  ) +
  scale_fill_gradientn(
    colours = c("grey99", "#D1E5F0", "#67A9CF", "#2166AC"),
    limits = c(0, class_fill_max),
    breaks = class_fill_breaks
  ) +
  labs(
    title = "Paper-style AMR classes under both filtering strategies",
    subtitle = "Rows share one color scale, so pale bottom-row tiles highlight class-level signal lost after strict filtering",
    x = "Study code",
    y = "Paper-style AMR class",
    fill = "log10(RPKM + 1)"
  ) +
  theme(
    #axis.text.x = element_text(angle = 45, hjust = 1),
    strip.background = element_rect(fill = "grey92", color = "black"),
    strip.text = element_text(face = "bold"),
    panel.spacing = grid::unit(0.5, "lines")
  )

save_fig(amr_class_plot, "amr_drug_class_heatmap.png", width = 10, height = 7.5)
amr_class_plot

Take-home note. Showing every class-sample combination, including zeros, makes the class-level difference between the no-filter and coverage-filter views much more explicit.

Only the paper-style AMR classes that are actually present in this 6-sample re-analysis are shown here; classes absent from the current data are omitted to avoid implying that they were observed.

Code

amr_class_overall <- amr_class_summary %>%
  group_by(amr_case, paper_class) %>%
  summarise(
    n_samples = n_distinct(sample_name),
    total_rpk = sum(total_rpk),
    total_rpkm = sum(total_rpkm),
    .groups = "drop"
  ) %>%
  arrange(amr_case, desc(total_rpkm))

knitr::kable(amr_class_overall, digits = 2)

amr_case	paper_class	n_samples	total_rpk	total_rpkm
No coverage filter	MLS	6	207552.96	3699918.52
No coverage filter	Tetracyclines	6	79823.82	1439291.97
No coverage filter	Other	6	59507.79	888185.32
No coverage filter	β-lactamases	6	49331.79	821052.43
No coverage filter	Glycopeptides	6	35488.10	637193.72
No coverage filter	Fluoroquinolones	6	24974.18	380010.54
No coverage filter	Aminoglycosides	6	25777.29	367891.01
No coverage filter	Chloramphenicol	6	10360.76	157092.03
No coverage filter	Trimethoprim	6	6656.90	96039.79
No coverage filter	Fosfomycin	6	3060.14	54163.42
No coverage filter	Sulfonamides	6	2499.65	34694.74
Average Percent Coverage ≥ 90	MLS	6	14943.15	3742046.25
Average Percent Coverage ≥ 90	β-lactamases	6	13831.48	3623264.35
Average Percent Coverage ≥ 90	Trimethoprim	6	2627.48	568014.87
Average Percent Coverage ≥ 90	Chloramphenicol	6	1594.43	445992.86
Average Percent Coverage ≥ 90	Other	6	1692.12	409178.97
Average Percent Coverage ≥ 90	Aminoglycosides	6	1031.04	229773.90
Average Percent Coverage ≥ 90	Tetracyclines	6	667.27	218785.09
Average Percent Coverage ≥ 90	Glycopeptides	6	359.73	112079.32
Average Percent Coverage ≥ 90	Fluoroquinolones	6	428.21	111009.16
Average Percent Coverage ≥ 90	Sulfonamides	6	0.00	0.00
Average Percent Coverage ≥ 90	Fosfomycin	6	0.00	0.00

5.5 Clinically highlighted signals

Code

critical_hits <- amr_cases %>%
  filter(grepl("CTX-M|mphA", `ARO Term`, ignore.case = TRUE)) %>%
  select(amr_case, studycode, analysis_group, `ARO Term`, `Completely Mapped Reads`, `Average Percent Coverage`, rpk, rpkm, `Drug Class`) %>%
  arrange(amr_case, `ARO Term`, desc(`Completely Mapped Reads`))

knitr::kable(critical_hits, digits = 2)

amr_case	studycode	analysis_group	ARO Term	Completely Mapped Reads	Average Percent Coverage	rpk	rpkm	Drug Class
No coverage filter	22EN-C-02	Child 0-23 months	CTX-M-112	1	6.51	1.14	17.37	cephalosporin
No coverage filter	22EN-C-07	Child 24-59 months	CTX-M-122	2	21.69	2.28	39.43	cephalosporin
No coverage filter	22EN-C-07	Child 24-59 months	CTX-M-134	16	79.11	18.26	315.44	cephalosporin
No coverage filter	22EN-A-17	Adult	CTX-M-134	2	25.23	2.28	43.41	cephalosporin
No coverage filter	22EN-A-17	Adult	CTX-M-137	1	13.58	1.14	21.70	cephalosporin
No coverage filter	22EN-A-04	Adult	CTX-M-15	25	92.01	28.54	683.61	cephalosporin; penicillin beta-lactam
No coverage filter	22EN-A-17	Adult	CTX-M-202	21	11.36	25.93	492.93	cephalosporin
No coverage filter	22EN-A-04	Adult	CTX-M-254	1	13.58	1.14	27.34	cephalosporin
No coverage filter	22EN-C-02	Child 0-23 months	CTX-M-65	2	21.00	2.28	34.74	cephalosporin
No coverage filter	22EN-A-17	Adult	CTX-M-78	2	27.17	2.28	43.41	cephalosporin
No coverage filter	22EN-C-07	Child 24-59 months	mphA	20	44.41	22.00	379.99	macrolide antibiotic
No coverage filter	22EN-A-17	Adult	mphA	3	5.42	3.39	64.45	macrolide antibiotic
No coverage filter	22EN-C-02	Child 0-23 months	mphA	1	2.55	1.11	16.85	macrolide antibiotic
No coverage filter	22EN-C-03	Child 0-23 months	mphA	1	3.47	1.24	12.18	macrolide antibiotic
Average Percent Coverage ≥ 90	22EN-A-04	Adult	CTX-M-15	25	92.01	28.54	4907.79	cephalosporin; penicillin beta-lactam

AMR interpretation

Without the coverage filter, the AMR profile is broader and closer to the kind of descriptive signal emphasized in the paper.
Once Average Percent Coverage ≥ 90 is imposed, the profile becomes much more conservative and several previously visible genes disappear.
In the paper-style class grouping, MLS and β-lactamase-related signals remain prominent, but the balance between classes changes substantially between the two AMR cases.
Among the highlighted genes, some signals remain only in the no-filter case, while the strict-filter case retains a much smaller subset, including only limited CTX-M signal and no retained mphA.
This side-by-side comparison is useful for teaching how sensitive resistome interpretation can be to filtering choices.

6 Integrated exploratory notes

The next plot is deliberately exploratory. It asks whether samples with lower taxonomic diversity also tend to show a larger AMR signal under the two AMR filtering strategies.

Code

integrated_summary <- alpha_diversity %>%
  select(sample_name, studycode, analysis_group, shannon) %>%
  left_join(amr_sample_summary %>% select(amr_case, sample_name, total_rpkm, n_retained_aro_terms), by = "sample_name")

integrated_plot <- ggplot(integrated_summary, aes(x = shannon, y = total_rpkm, color = analysis_group, label = studycode)) +
  geom_point(aes(shape = analysis_group), size = 3) +
  ggrepel::geom_text_repel(
    size = 3,
    show.legend = FALSE,
    seed = 42,
    box.padding = 0.25,
    point.padding = 0.25,
    min.segment.length = 0
  ) +
  scale_color_manual(values = group_palette, labels = group_labels) +
  scale_shape_manual(values = group_shapes, labels = group_labels) +
  facet_wrap(~ amr_case, nrow = 1) +
  scale_x_continuous(limits = c(0, NA), expand = expansion(mult = c(0, 0.05))) +
  scale_y_continuous(labels = comma_format(), limits = c(0, NA), expand = expansion(mult = c(0, 0.05))) +
  labs(
    title = "Exploratory view: AMR RPKM versus Shannon diversity",
    x = "Shannon diversity",
    y = "Summed AMR RPKM",
    color = "Group",
    shape = "Group"
  )

save_fig(integrated_plot, "integrated_amr_vs_alpha.png", width = 9, height = 5)
integrated_plot

This plot suggests that the lower-diversity child samples also have relatively high retained AMR signal in both AMR summaries, but the strength of that pattern depends on the filter used. With only six samples, this should be treated as a visual prompt for a larger follow-up analysis, not as an inferred biological relationship.

Take-home note. The relationship between diversity and AMR signal looks broadly similar in both AMR cases, but the apparent strength of the pattern depends on the filtering rule.

7 Comparison with the original paper

Code

comparison_table <- tibble::tibble(
  theme = c(
    "Age-related microbiome structure",
    "Alpha diversity pattern",
    "Beta diversity pattern",
    "Broad AMR burden",
    "Key AMR classes",
    "Clinically highlighted genes"
  ),
  paper = c(
    "Young children under 2 years were structurally distinct from older children and adults.",
    "Youngest children had lower diversity; older children approached adult values.",
    "Older children and adults overlapped more than younger children.",
    "Youngest children had the highest median AMR gene burden.",
    "Tetracycline and MLS signals were widespread.",
    "blaCTX-M and mphA/mphD were emphasized in younger children."
  ),
  subset_reanalysis = c(
    "The two youngest-child samples are distinct from the adult cluster; the single older-child sample is intermediate in phylum composition but not fully adult-like.",
    "Adults have the highest Shannon values in this subset; all three child samples are lower.",
    "Adults cluster closer to one another; child samples are more separated from the adults than adults are from each other.",
    "Without the coverage filter, children still tend to show higher AMR signal on average; with the ≥ 90 filter that separation becomes weaker and the retained hit counts do not follow a clean age gradient.",
    "Using paper-style class grouping, MLS and β-lactamase-related signals remain visible in both analyses, but the class balance changes substantially after strict filtering.",
    "CTX-M and mphA-like signals are more visible without the coverage filter; only limited CTX-M signal remains after strict filtering, and mphA is not retained."
  ),
  take_home = c(
    "Broad directional concordance, but only partial support because n is very small.",
    "Directionally consistent with the paper.",
    "Directionally consistent for adult clustering; weaker support for an older-child/adult overlap claim.",
    "Suggestive but not a clean reproduction.",
    "Concordance is stronger in the no-filter view and weaker in the strict-filter view.",
    "Agreement for these clinically highlighted genes is highly filter-dependent."
  )
)

knitr::kable(comparison_table)

theme	paper	subset_reanalysis	take_home
Age-related microbiome structure	Young children under 2 years were structurally distinct from older children and adults.	The two youngest-child samples are distinct from the adult cluster; the single older-child sample is intermediate in phylum composition but not fully adult-like.	Broad directional concordance, but only partial support because n is very small.
Alpha diversity pattern	Youngest children had lower diversity; older children approached adult values.	Adults have the highest Shannon values in this subset; all three child samples are lower.	Directionally consistent with the paper.
Beta diversity pattern	Older children and adults overlapped more than younger children.	Adults cluster closer to one another; child samples are more separated from the adults than adults are from each other.	Directionally consistent for adult clustering; weaker support for an older-child/adult overlap claim.
Broad AMR burden	Youngest children had the highest median AMR gene burden.	Without the coverage filter, children still tend to show higher AMR signal on average; with the ≥ 90 filter that separation becomes weaker and the retained hit counts do not follow a clean age gradient.	Suggestive but not a clean reproduction.
Key AMR classes	Tetracycline and MLS signals were widespread.	Using paper-style class grouping, MLS and β-lactamase-related signals remain visible in both analyses, but the class balance changes substantially after strict filtering.	Concordance is stronger in the no-filter view and weaker in the strict-filter view.
Clinically highlighted genes	blaCTX-M and mphA/mphD were emphasized in younger children.	CTX-M and mphA-like signals are more visible without the coverage filter; only limited CTX-M signal remains after strict filtering, and mphA is not retained.	Agreement for these clinically highlighted genes is highly filter-dependent.

7.1 Quick summary in English

Findings that remain broadly similar to the original paper

On the taxonomy side, the adult samples in this 6-sample subset still show higher alpha diversity and tighter clustering than the younger-child samples.
The two children under 24 months still look compositionally distinct from the adult cluster, which remains directionally consistent with the paper’s main microbiome maturation result.
The overall structure of the subset is therefore still compatible with the broad idea that younger gut microbiomes are less mature and more distinct than adult microbiomes.

Findings that differ, weaken, or only partially agree once AMR filtering is tightened

We still cannot fully evaluate the paper’s “older children become more adult-like” pattern because this subset contains only one child aged 24-59 months.
On the beta-diversity side, adults cluster more tightly, but the single older-child sample does not clearly overlap with the adult cluster the way the full paper suggested.
In the AMR analysis, the no-filter view and the strict-filter view do not tell exactly the same story.
Under the stricter workflow, many canonical paper-highlighted genes drop out, while a smaller set of higher-coverage hits remains.
This means that AMR concordance with the paper is stronger descriptively without the coverage filter and weaker but more conservative with the ≥ 90 filter.

Short take-home message

This 6-sample re-analysis still supports the paper at the broad microbiome-structure level.
On the AMR side, agreement becomes stronger in the no-filter descriptive view and weaker but more conservative once the analysis is restricted to high-coverage hits and RPKM-based summaries.
The comparison should therefore be read as a workshop-oriented directional cross-check, not as a full replication of the original paper.

8 Limitations

This notebook should be read with several major limitations in mind:

It is based on only six re-analyzed samples.
The original paper analyzed 42 participants and had more balanced age strata.
The local metadata are sparse, and even the age_month field is ambiguous for adults.
Taxonomic ecology here uses f_weighted_at_rank, not the paper’s original Kraken2/Bracken abundance outputs.
The unclassified fraction is large in every sample, so the “without unclassified” analyses describe only the classified component.
The AMR workflow is not directly equivalent to the paper’s SRST2 + ARGANNOT pipeline.
The notebook compares two AMR views: one without a coverage filter and one with Average Percent Coverage ≥ 90. The stricter view is conservative and may remove biologically real but lower-coverage hits.
The AMR normalization uses Completely Mapped Reads, Reference Length, and a sample-internal RPKM scaling based on retained AMR reads available in the local RGI tables.
Because the local AMR tables do not include total metagenomic read counts, this RPKM calculation is an available-data approximation rather than a perfect reconstruction of the paper’s SRST2-based RPKM framework.
With group sizes of 2, 1, and 3, formal inference is not the main point of the exercise.

9 Suggested next analyses

If this workshop analysis were expanded into a more complete re-analysis, the next priorities would be:

Re-analyze a much larger subset, ideally recovering age groups closer to the original paper design.
Standardize metadata carefully, especially adult age, diet, household structure, and recent antimicrobial exposure.
Re-run taxonomic profiling with a workflow closer to the original study if strict replication is the goal.
Add a more principled AMR comparison layer, for example family-level harmonization between CARD and ARGANNOT outputs.
Investigate whether the strongest CTX-M detections remain after stricter filtering on coverage and breadth.
Explore organism-to-gene linkage methods if the training objective extends beyond simple resistome description.

10 Conclusion

This 6-sample workshop re-analysis captures several of the paper’s broad ideas. The taxonomy results remain directionally compatible with a more adult-like and more diverse microbiome in adults.

At the same time, the subset does not cleanly reproduce every published AMR pattern, and the answer depends strongly on which AMR view is used. The no-filter normalized view preserves more of the broad descriptive signal that resembles the paper, whereas the high-coverage view is more conservative and retains a smaller subset of genes and class-level patterns. Because the sample count is tiny and the AMR detection workflow differs, these differences should be interpreted as a reminder about workflow dependence, filtering sensitivity, and sampling uncertainty, not as evidence that the original paper was wrong.

The main workshop take-home is that this small re-analysis is useful for learning how to structure a metagenomic notebook, inspect taxonomy with and without unclassified signal, summarize AMR at gene and drug-class levels, and compare cautiously back to a published study. It is not sufficient for definitive biological conclusions.

Tip

Continue to Actionable Diagnostics & Clinical Reasoning.