Clinical Foundations & AMR Surveillance: AMR Analysis & Summary

AMR burden, filtering choices, and cautious comparison with Pereira-Dias et al.

Author

Hao Chung The, Minh-Quan Ton-Ngoc

Published

May 15, 2026

1 R-based exploratory analysis

The sections below assume that the pipeline outputs above already exist. From this point onward, the notebook shifts from pipeline execution to exploratory interpretation.

Code
library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)
library(ggrepel)
library(phyloseq)
library(vegan)
library(patchwork)
library(scales)
library(data.table)
library(broom)

data_dir <- file.path("..", "materials", "vietnamese-healthy-gut-microbiome")

group_palette <- c(
  "Child 0-23 months" = "#D55E00",
  "Child 24-59 months" = "#0072B2",
  "Adult" = "#009E73"
)

group_shapes <- c(
  "Child 0-23 months" = 16,
  "Child 24-59 months" = 17,
  "Adult" = 15
)

group_labels <- c(
  "Child 0-23 months" = "Child\n(0-23 months)",
  "Child 24-59 months" = "Child\n(24-59 months)",
  "Adult" = "Adult"
)

group_labeller <- labeller(analysis_group = as_labeller(group_labels))

my_color <- c(
  "#4e79a7", "#f28e2b", "#59a14f", "#b6992d",
  "#499894", "#e15759", "#b07aa1", "#9d7660"
)

make_fill_palette <- function(levels_vector, other_label = "Other") {
  core_levels <- setdiff(levels_vector, other_label)
  n_cols <- length(core_levels)
  if (n_cols > length(my_color)) {
    extra_cols <- scales::hue_pal()(n_cols - length(my_color))
    core_cols <- c(my_color, extra_cols)
  } else {
    core_cols <- my_color[seq_len(n_cols)]
  }
  cols <- setNames(core_cols, core_levels)
  if (other_label %in% levels_vector) {
    cols[other_label] <- "grey85"
  }
  cols[levels_vector]
}

paper_amr_class_levels <- c(
  "Aminoglycosides", "β-lactamases", "Fosfomycin", "Fluoroquinolones",
  "Glycopeptides", "MLS", "Chloramphenicol", "Sulfonamides",
  "Tetracyclines", "Trimethoprim", "Other"
)

map_paper_amr_class <- function(drug_class) {
  case_when(
    drug_class %in% c("aminoglycoside antibiotic") ~ "Aminoglycosides",
    drug_class %in% c("cephalosporin", "penicillin beta-lactam", "carbapenem", "monobactam") ~ "β-lactamases",
    drug_class %in% c("phosphonic acid antibiotic") ~ "Fosfomycin",
    drug_class %in% c("fluoroquinolone antibiotic") ~ "Fluoroquinolones",
    drug_class %in% c("glycopeptide antibiotic") ~ "Glycopeptides",
    drug_class %in% c(
      "macrolide antibiotic", "lincosamide antibiotic", "streptogramin antibiotic",
      "streptogramin A antibiotic", "streptogramin B antibiotic"
    ) ~ "MLS",
    drug_class %in% c("phenicol antibiotic") ~ "Chloramphenicol",
    drug_class %in% c("sulfonamide antibiotic") ~ "Sulfonamides",
    drug_class %in% c("tetracycline antibiotic", "glycylcycline") ~ "Tetracyclines",
    drug_class %in% c("diaminopyrimidine antibiotic") ~ "Trimethoprim",
    TRUE ~ "Other"
  )
}

theme_set(
  theme_bw(base_size = 12) +
    theme(
      panel.grid.minor = element_blank(),
      plot.title.position = "plot",
      legend.position = "right"
    )
)

fig_dir <- "figures"
dir.create(fig_dir, showWarnings = FALSE)

save_fig <- function(plot, filename, width = 9, height = 5.5, dpi = 300) {
  ggsave(
    filename = file.path(fig_dir, filename),
    plot = plot,
    width = width,
    height = height,
    dpi = dpi,
    bg = "white"
  )
}

parse_lineage <- function(df) {
  df %>%
    tidyr::separate(
      col = lineage,
      into = c("domain", "phylum", "class", "order", "family", "genus", "species"),
      sep = ";",
      fill = "right",
      extra = "drop",
      remove = FALSE
    )
}

trim_sample_ids <- function(x) trimws(as.character(x))

Code note

  • This setup chunk loads the R packages used throughout the notebook and defines a few helper objects.
  • The notebook now uses ggplot2 only for plotting, to keep figure sizing predictable in the rendered output.
  • For color, the stacked bar plots use a richer custom palette, while group-level plots still combine color and point shape so they remain readable on screen and when printed.
  • The default ggplot2 theme is theme_bw(), so any plot built directly with ggplot2 follows the workshop style guide.

2 Context from the original paper

Before looking at the 6 local samples, we extract the main analytical context from the published article and the local summary notes.

Paper-level findings that are reasonable to compare against in a workshop subset

  • The study analyzed 42 healthy Vietnamese participants split into 0-23 months, 2-5 years, and adults.
  • The paper reported a clear age-related microbiome structure difference, with children under 2 years showing a distinct gut community structure.
  • Alpha diversity was lower in the youngest children than in older children and adults.
  • Beta diversity showed broad overlap between older children and adults, while younger children were more distinct and more dispersed.
  • The healthy gut resistome was described as a substantial AMR reservoir, with tetracycline and MLS resistance genes especially common.
  • Genes highlighted in the paper included ermB, lnuC, tetM, tetO, tetQ, tetW, and tet-32 as near-ubiquitous signals.
  • The paper reported a higher AMR burden in the youngest children, and noted clinically important genes such as blaCTX-M and mphA/mphD in younger children.
Code
paper_context <- tibble::tibble(
  theme = c(
    "Age-related microbiome structure",
    "Alpha diversity",
    "Beta diversity",
    "Broad AMR burden",
    "AMR classes emphasized",
    "AMR genes emphasized"
  ),
  published_finding = c(
    "Children under 2 years had a distinct gut microbiome structure relative to older children and adults.",
    "Youngest children had lower Shannon diversity; older children were closer to adults.",
    "Bray-Curtis ordination showed young children dispersed away from the overlapping older-child and adult cloud.",
    "Youngest children carried the greatest number of detected AMR genes.",
    "Tetracycline and MLS resistance were nearly ubiquitous.",
    "ermB, lnuC, tet-32, tetM, tetO, tetQ, tetW were common; blaCTX-M and mphA/mphD were highlighted as clinically important signals."
  )
)

knitr::kable(paper_context, col.names = c("Theme", "Finding extracted from paper / summary"))
Theme Finding extracted from paper / summary
Age-related microbiome structure Children under 2 years had a distinct gut microbiome structure relative to older children and adults.
Alpha diversity Youngest children had lower Shannon diversity; older children were closer to adults.
Beta diversity Bray-Curtis ordination showed young children dispersed away from the overlapping older-child and adult cloud.
Broad AMR burden Youngest children carried the greatest number of detected AMR genes.
AMR classes emphasized Tetracycline and MLS resistance were nearly ubiquitous.
AMR genes emphasized ermB, lnuC, tet-32, tetM, tetO, tetQ, tetW were common; blaCTX-M and mphA/mphD were highlighted as clinically important signals.

3 Analysis-ready files used below

The R analysis below assumes that the workflow and post-processing steps have already produced an analysis folder containing:

  • Metadata for the 6 re-analyzed samples.
  • Taxonomy tables with and without unclassified rows.
  • A vietnamese_gut.phyloseq folder that contains otu_table.csv, sample_data.csv, and tax_table.csv.
  • A vietnamese_gut_amr folder with one rgi-bwt gene mapping table per sample.
  • The original paper PDF and a local summary file used only for context.
Code
base_dir <- "."

paths <- tibble::tibble(
  object = c(
    "metadata",
    "taxonomy_with_unclassified",
    "taxonomy_without_unclassified",
    "phyloseq_export_dir",
    "amr_dir",
    "local_summary_md"
  ),
  file_name = c(
    "vietnamese_gut_sample_information.csv",
    "vietnamese_gut_with_unclassified.csv",
    "vietnamese_gut_without_unclassified.csv",
    "vietnamese_gut.phyloseq",
    "vietnamese_gut_amr",
    "vietnamese_gut_summary.md"
  )
) %>%
  mutate(
  path = file.path(data_dir, file_name),
  exists = file.exists(path) | dir.exists(path)
)

knitr::kable(paths)
object file_name path exists
metadata vietnamese_gut_sample_information.csv ../materials/vietnamese-healthy-gut-microbiome/vietnamese_gut_sample_information.csv TRUE
taxonomy_with_unclassified vietnamese_gut_with_unclassified.csv ../materials/vietnamese-healthy-gut-microbiome/vietnamese_gut_with_unclassified.csv TRUE
taxonomy_without_unclassified vietnamese_gut_without_unclassified.csv ../materials/vietnamese-healthy-gut-microbiome/vietnamese_gut_without_unclassified.csv TRUE
phyloseq_export_dir vietnamese_gut.phyloseq ../materials/vietnamese-healthy-gut-microbiome/vietnamese_gut.phyloseq TRUE
amr_dir vietnamese_gut_amr ../materials/vietnamese-healthy-gut-microbiome/vietnamese_gut_amr TRUE
local_summary_md vietnamese_gut_summary.md ../materials/vietnamese-healthy-gut-microbiome/vietnamese_gut_summary.md FALSE

4 Metadata inspection

4.1 Load data

Code note

  • This chunk reads the local sample metadata and creates a simple analysis_group variable that we can use consistently in plots and summaries.
  • The grouping is intentionally simple because the metadata for this workshop subset are limited.
Code
metadata <- read_csv(file.path(data_dir, "vietnamese_gut_sample_information.csv"), show_col_types = FALSE) %>%
  mutate(
    sample_name = trim_sample_ids(sample_name),
    studycode = trim_sample_ids(studycode),
    status = recode(status, "Children" = "Child", .default = status),
    analysis_group = case_when(
      status == "Adult" ~ "Adult",
      age_month < 24 ~ "Child 0-23 months",
      TRUE ~ "Child 24-59 months"
    ),
    analysis_group = factor(
      analysis_group,
      levels = c("Child 0-23 months", "Child 24-59 months", "Adult")
    )
  ) %>%
  arrange(analysis_group, studycode)

sample_order <- metadata %>%
  arrange(analysis_group, studycode) %>%
  pull(sample_name)

studycode_order <- metadata %>%
  arrange(analysis_group, studycode) %>%
  pull(studycode)

metadata
# A tibble: 6 × 7
  studycode age_month date_of_sampling to_sequence sample_name status
  <chr>         <dbl> <chr>            <chr>       <chr>       <chr> 
1 22EN-C-02        16 18-May-2017      1st only    ERR9904449  Child 
2 22EN-C-03         9 18-May-2017      1st only    ERR9904450  Child 
3 22EN-C-07        59 22-May-2017      1st only    ERR9904458  Child 
4 22EN-A-04        25 18-May-2017      1st only    ERR9904447  Adult 
5 22EN-A-05        59 19-May-2017      1st only    ERR9904452  Adult 
6 22EN-A-17        32 24-May-2017      2nd only    ERR9904480  Adult 
# ℹ 1 more variable: analysis_group <fct>

Metadata notes

  • status is the most reliable top-level grouping variable in the local file.
  • For children, age_month is usable as age in months.
  • For adults, the age_month column appears to contain values such as 25, 32, and 59, which are much more plausible as adult ages in years than months.
  • For that reason, this notebook uses analysis_group derived from status and child age thresholds, and uses the adult age values only as descriptive metadata.
Code
metadata_summary <- metadata %>%
  count(status, analysis_group, name = "n_samples")

knitr::kable(metadata_summary)
status analysis_group n_samples
Adult Adult 3
Child Child 0-23 months 2
Child Child 24-59 months 1

4.2 Harmonize sample IDs across inputs

Code note

  • This chunk checks whether the same sample IDs are present in metadata, taxonomy tables, the exported phyloseq tables, and the AMR result files.
  • Harmonizing IDs early helps avoid silent joins to the wrong sample later in the notebook.
  • We keep sample_name for joins, but use studycode later for plotting because it is more compact on figure axes.
Code
tax_with <- read_csv(file.path(data_dir, "vietnamese_gut_with_unclassified.csv"), show_col_types = FALSE) %>%
  mutate(sample_name = trim_sample_ids(sample_name))

tax_without <- read_csv(file.path(data_dir, "vietnamese_gut_without_unclassified.csv"), show_col_types = FALSE) %>%
  mutate(sample_name = trim_sample_ids(sample_name))

phy_sample_data <- read_csv(file.path(data_dir, "vietnamese_gut.phyloseq", "sample_data.csv"), show_col_types = FALSE) %>%
  mutate(sample_name = trim_sample_ids(sample_name))

amr_files <- list.files(file.path(data_dir, "vietnamese_gut_amr"), pattern = "gene_mapping_data\\.txt$", full.names = TRUE)
amr_file_samples <- sub("\\.gene_mapping_data\\.txt$", "", basename(amr_files))

sample_harmonization <- tibble::tibble(
  source = c("metadata", "taxonomy_with", "taxonomy_without", "phyloseq_export", "amr_files"),
  n_unique_samples = c(
    n_distinct(metadata$sample_name),
    n_distinct(tax_with$sample_name),
    n_distinct(tax_without$sample_name),
    n_distinct(phy_sample_data$sample_name),
    length(unique(amr_file_samples))
  ),
  sample_ids = c(
    paste(sort(unique(metadata$sample_name)), collapse = ", "),
    paste(sort(unique(tax_with$sample_name)), collapse = ", "),
    paste(sort(unique(tax_without$sample_name)), collapse = ", "),
    paste(sort(unique(phy_sample_data$sample_name)), collapse = ", "),
    paste(sort(unique(amr_file_samples)), collapse = ", ")
  )
)

knitr::kable(sample_harmonization)
source n_unique_samples sample_ids
metadata 6 ERR9904447, ERR9904449, ERR9904450, ERR9904452, ERR9904458, ERR9904480
taxonomy_with 6 ERR9904447, ERR9904449, ERR9904450, ERR9904452, ERR9904458, ERR9904480
taxonomy_without 6 ERR9904447, ERR9904449, ERR9904450, ERR9904452, ERR9904458, ERR9904480
phyloseq_export 6 ERR9904447, ERR9904449, ERR9904450, ERR9904452, ERR9904458, ERR9904480
amr_files 6 ERR9904447, ERR9904449, ERR9904450, ERR9904452, ERR9904458, ERR9904480

All four sources contain the same six sample IDs, so harmonization is straightforward.

4.3 Inspect the exported phyloseq components

Code
phy_otu_export <- read_csv(file.path(data_dir, "vietnamese_gut.phyloseq", "otu_table.csv"), show_col_types = FALSE)
phy_tax_export <- read_csv(file.path(data_dir, "vietnamese_gut.phyloseq", "tax_table.csv"), show_col_types = FALSE)

phyloseq_export_status <- tibble::tibble(
  file = c("sample_data.csv", "otu_table.csv", "tax_table.csv"),
  rows = c(nrow(phy_sample_data), nrow(phy_otu_export), nrow(phy_tax_export)),
  columns = c(ncol(phy_sample_data), ncol(phy_otu_export), ncol(phy_tax_export))
)

knitr::kable(phyloseq_export_status)
file rows columns
sample_data.csv 6 2
otu_table.csv 736 7
tax_table.csv 736 8

The vietnamese_gut.phyloseq path is not a serialized .rds phyloseq object. It is an export directory containing the component tables needed to reconstruct a phyloseq object.

4.4 Inspect the AMR folder layout

Code
amr_file_inventory <- tibble::tibble(
  file = basename(amr_files),
  sample_name = sub("\\.gene_mapping_data\\.txt$", "", basename(amr_files)),
  size_bytes = file.info(amr_files)$size
)

knitr::kable(amr_file_inventory)
file sample_name size_bytes
ERR9904447.gene_mapping_data.txt ERR9904447 109567
ERR9904449.gene_mapping_data.txt ERR9904449 100676
ERR9904450.gene_mapping_data.txt ERR9904450 96742
ERR9904452.gene_mapping_data.txt ERR9904452 80172
ERR9904458.gene_mapping_data.txt ERR9904458 103677
ERR9904480.gene_mapping_data.txt ERR9904480 126389
Code
amr_example <- fread(amr_files[1], sep = "\t", data.table = FALSE)

knitr::kable(head(amr_example, 8))
ARO Term ARO Accession Reference Model Type Reference DB Alleles with Mapped Reads Reference Allele(s) Identity to CARD Reference Protein (%) Resistomes & Variants: Observed in Genome(s) Resistomes & Variants: Observed in Plasmid(s) Resistomes & Variants: Observed Pathogen(s) Completely Mapped Reads Mapped Reads with Flanking Sequence All Mapped Reads Average Percent Coverage Average Length Coverage (bp) Average MAPQ (Completely Mapped Reads) Number of Mapped Baits Number of Mapped Baits with Reads Average Number of reads per Bait Number of reads per Bait Coefficient of Variation (%) Number of reads mapping to baits and mapping to complete gene Number of reads mapping to baits and mapping to complete gene (%) Mate Pair Linkage (# reads) Reference Length AMR Gene Family Drug Class Resistance Mechanism
ErmG 3000522 protein homolog model CARD; Resistomes & Variants 2 98.36 - 100.0 no data no data Bacteroides thetaiotaomicron 2 0 2 16.05 118.00 189.50 0 0 0 0 N/A N/A NA 735; 735 Erm 23S ribosomal RNA methyltransferase macrolide antibiotic; lincosamide antibiotic; streptogramin antibiotic; streptogramin A antibiotic; streptogramin B antibiotic antibiotic target alteration
AAC(6’)-Iw 3002567 protein homolog model CARD 1 100.0 no data no data Acinetobacter sp. 1 0 1 3.85 17.00 23.00 0 0 0 0 N/A N/A NA 441 AAC(6’) aminoglycoside antibiotic antibiotic inactivation
tet(W) 3000194 protein homolog model CARD; Resistomes & Variants 22 95.85 - 100.0 YES no data Butyrivibrio fibrisolvens; Bifidobacterium longum; Collinsella aerofaciens; Faecalibacterium prausnitzii; Lactobacillus johnsonii; Eggerthella lenta; Eubacterium maltosivorans; Lactobacillus crispatus; Streptococcus suis; Bifidobacterium breve 1247 0 1247 32.04 575.82 160.75 0 0 0 0 N/A N/A NA 1920; 1920; 1920; 2082; 1617; 1920; 1920; 1959; 2070; 1920; 1920; 1920; 1920; 1920; 1920; 1920; 1920; 1932; 1920; 1920; 1920; 219 tetracycline-resistant ribosomal protection protein tetracycline antibiotic antibiotic target protection
CGB-1 3000841 protein homolog model CARD 1 100.0 no data no data Chryseobacterium gleum 2 0 2 2.74 20.00 40.00 0 0 0 0 N/A N/A NA 729 CGB beta-lactamase carbapenem; cephalosporin; penicillin beta-lactam antibiotic inactivation
vanL 3002910 protein homolog model CARD 1 100.0 no data no data Enterococcus faecalis 1 0 1 3.05 32.00 56.00 0 0 0 0 N/A N/A NA 1050 glycopeptide resistance gene cluster; Van ligase glycopeptide antibiotic antibiotic target alteration
lsaB 3003111 protein homolog model CARD 1 100.0 no data no data Mammaliicoccus sciuri 14 0 14 2.91 43.00 39.64 0 0 0 0 N/A N/A NA 1479 lsa-type ABC-F protein lincosamide antibiotic; streptogramin antibiotic; pleuromutilin antibiotic antibiotic target protection
ErmQ 3000593 protein homolog model CARD 1 100.0 no data no data Clostridium perfringens 4 0 4 44.44 344.00 160.00 0 0 0 0 N/A N/A NA 774 Erm 23S ribosomal RNA methyltransferase macrolide antibiotic; lincosamide antibiotic; streptogramin antibiotic; streptogramin A antibiotic; streptogramin B antibiotic antibiotic target alteration
catP 3002686 protein homolog model CARD 1 100.0 no data no data Clostridium perfringens 30 0 30 100.00 624.00 156.97 0 0 0 0 N/A N/A NA 624 chloramphenicol acetyltransferase (CAT) phenicol antibiotic antibiotic inactivation

The AMR files are one table per sample and include:

  • ARO term
  • reference model metadata
  • mapped read counts
  • average coverage
  • AMR gene family
  • drug class
  • resistance mechanism

This is sufficient for descriptive gene-level and drug-class summaries.

5 AMR analysis

5.1 Inspect rgi-bwt outputs and harmonize sample IDs

Code note

  • This chunk reads all per-sample rgi-bwt result tables into one long table.
  • It also converts the key AMR columns to numeric form and extracts a usable gene length from Reference Length.
  • We then build two normalized AMR datasets:
    1. a version without a coverage filter
    2. a stricter version retaining only hits with Average Percent Coverage ≥ 90
Code
amr_raw <- lapply(amr_files, function(path) {
  read_tsv(path, show_col_types = FALSE) %>%
    mutate(
      sample_name = sub("\\.gene_mapping_data\\.txt$", "", basename(path)),
      sample_name = trim_sample_ids(sample_name)
    )
}) %>%
  bind_rows() %>%
  left_join(metadata, by = "sample_name") %>%
  mutate(
    `Completely Mapped Reads` = as.numeric(`Completely Mapped Reads`),
    `All Mapped Reads` = as.numeric(`All Mapped Reads`),
    `Average Percent Coverage` = as.numeric(`Average Percent Coverage`),
    `Average Length Coverage (bp)` = as.numeric(`Average Length Coverage (bp)`),
    reference_length_raw = as.character(`Reference Length`),
    reference_length_bp = readr::parse_number(sub(";.*$", "", reference_length_raw))
  )

amr_base <- amr_raw %>%
  filter(
    !is.na(`Completely Mapped Reads`),
    !is.na(reference_length_bp),
    reference_length_bp > 0,
    `Completely Mapped Reads` > 0
  ) %>%
  mutate(
    rpk = `Completely Mapped Reads` / (reference_length_bp / 1000)
  )

amr_all <- amr_base %>%
  group_by(sample_name) %>%
  mutate(
    retained_amr_reads = sum(`Completely Mapped Reads`, na.rm = TRUE),
    retained_amr_reads_million = retained_amr_reads / 1e6,
    rpkm = rpk / retained_amr_reads_million
  ) %>%
  ungroup() %>%
  mutate(amr_case = "No coverage filter")

amr_filtered <- amr_base %>%
  filter(!is.na(`Average Percent Coverage`), `Average Percent Coverage` >= 90) %>%
  group_by(sample_name) %>%
  mutate(
    retained_amr_reads = sum(`Completely Mapped Reads`, na.rm = TRUE),
    retained_amr_reads_million = retained_amr_reads / 1e6,
    rpkm = rpk / retained_amr_reads_million
  ) %>%
  ungroup() %>%
  mutate(amr_case = "Average Percent Coverage ≥ 90")

amr_cases <- bind_rows(amr_all, amr_filtered) %>%
  mutate(
    amr_case = factor(amr_case, levels = c("No coverage filter", "Average Percent Coverage ≥ 90"))
  )

amr_overview <- bind_rows(
  amr_all %>%
    summarise(
      amr_case = "No coverage filter",
      n_rows = n(),
      n_samples = n_distinct(sample_name),
      n_aro_terms = n_distinct(`ARO Term`)
    ),
  amr_filtered %>%
    summarise(
      amr_case = "Average Percent Coverage ≥ 90",
      n_rows = n(),
      n_samples = n_distinct(sample_name),
      n_aro_terms = n_distinct(`ARO Term`)
    )
) %>%
  ungroup()

knitr::kable(amr_overview)
amr_case n_rows n_samples n_aro_terms
No coverage filter 1859 6 683
Average Percent Coverage ≥ 90 83 6 50

Before any downstream AMR summary, this notebook applies the same normalization logic in two different ways:

  • compute RPK by normalizing Completely Mapped Reads by Reference Length
  • compute RPKM by scaling RPK within each sample by the retained per-sample AMR read total
  • compare results with and without the additional filter Average Percent Coverage ≥ 90

In other words, the number of reads mapped to each gene (RGI Completely Mapped Reads) is normalized by the gene length (RGI Reference Length) to create a Reads per kilobase of AMR gene (RPK) metric. That RPK value is then converted to RPKM using the sample-specific total retained AMR reads.

\[ \mathrm{RPK}_i = \frac{C_i}{L_i / 1000} \]

\[ \mathrm{RPKM}_i = \frac{\mathrm{RPK}_i}{\left(\sum_{j=1}^{n} C_j\right) / 10^6} \]

  • \(C_i\): Completely Mapped Reads for AMR gene \(i\)
  • \(L_i\): Reference Length for AMR gene \(i\), in base pairs
  • \(\sum_{j=1}^{n} C_j\): total Completely Mapped Reads across all retained AMR genes in the same sample
  • \(n\): number of retained AMR genes in that sample

This makes it possible for workshop participants to compare the effect of the coverage threshold directly, instead of seeing only one filtered view.

5.2 Per-sample gene burden

Code note

  • This section summarizes both AMR cases at the sample level.
  • The main idea is simple: count how many AMR hits remain per sample in each case, then compare their total normalized signal using RPKM.
Code
amr_sample_summary <- amr_cases %>%
  group_by(amr_case, sample_name) %>%
  summarise(
    n_retained_aro_terms = n_distinct(`ARO Term`),
    total_completely_mapped_reads = sum(`Completely Mapped Reads`, na.rm = TRUE),
    total_rpk = sum(rpk, na.rm = TRUE),
    total_rpkm = sum(rpkm, na.rm = TRUE),
    median_rpkm = median(rpkm, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  left_join(metadata, by = "sample_name") %>%
  arrange(amr_case, analysis_group, studycode)

knitr::kable(amr_sample_summary, digits = 2)
amr_case sample_name n_retained_aro_terms total_completely_mapped_reads total_rpk total_rpkm median_rpkm studycode age_month date_of_sampling to_sequence status analysis_group
No coverage filter ERR9904449 303 65713 42891.62 652711.3 162.32 22EN-C-02 16 18-May-2017 1st only Child Child 0-23 months
No coverage filter ERR9904450 285 101696 77348.26 760583.1 733.65 22EN-C-03 9 18-May-2017 1st only Child Child 0-23 months
No coverage filter ERR9904458 318 57902 51150.38 883395.7 237.06 22EN-C-07 59 22-May-2017 1st only Child Child 24-59 months
No coverage filter ERR9904447 330 41747 32638.71 781821.6 260.60 22EN-A-04 25 18-May-2017 1st only Adult Adult
No coverage filter ERR9904452 246 31889 22535.53 706686.5 212.98 22EN-A-05 59 19-May-2017 1st only Adult Adult
No coverage filter ERR9904480 377 52596 43358.38 824366.4 270.13 22EN-A-17 32 24-May-2017 2nd only Adult Adult
Average Percent Coverage ≥ 90 ERR9904449 9 3185 3507.82 1101356.6 56453.05 22EN-C-02 16 18-May-2017 1st only Child Child 0-23 months
Average Percent Coverage ≥ 90 ERR9904450 19 4624 6158.69 1331895.8 28661.36 22EN-C-03 9 18-May-2017 1st only Child Child 0-23 months
Average Percent Coverage ≥ 90 ERR9904458 15 4208 5903.73 1402977.8 33680.14 22EN-C-07 59 22-May-2017 1st only Child Child 24-59 months
Average Percent Coverage ≥ 90 ERR9904447 12 5815 5940.86 1021643.9 10479.33 22EN-A-04 25 18-May-2017 1st only Adult Adult
Average Percent Coverage ≥ 90 ERR9904452 13 2320 2556.41 1101898.8 51340.01 22EN-A-05 59 19-May-2017 1st only Adult Adult
Average Percent Coverage ≥ 90 ERR9904480 15 3307 3309.80 1000846.7 24667.45 22EN-A-17 32 24-May-2017 2nd only Adult Adult
Code
amr_jitter <- position_jitter(width = 0.05, height = 0, seed = 42)

amr_gene_count_plot <- ggplot(amr_sample_summary, aes(x = analysis_group, y = n_retained_aro_terms, color = analysis_group)) +
  stat_summary(fun = median, geom = "crossbar", width = 0.5, color = "grey30") +
  geom_point(aes(shape = analysis_group), size = 3, position = amr_jitter) +
  ggrepel::geom_text_repel(
    aes(label = studycode),
    position = amr_jitter,
    size = 3,
    show.legend = FALSE,
    seed = 42,
    box.padding = 0.25,
    point.padding = 0.2,
    min.segment.length = 0
  ) +
  scale_color_manual(values = group_palette, labels = group_labels) +
  scale_shape_manual(values = group_shapes, labels = group_labels) +
  facet_wrap(~ amr_case, nrow = 1) +
  scale_x_discrete(labels = group_labels) +
  scale_y_continuous(limits = c(0, NA), expand = expansion(mult = c(0, 0.05))) +
  labs(
    title = "AMR hit counts by sample under two filtering strategies",
    x = "Analysis group",
    y = "Number of retained ARO terms",
    shape = "Group"
  ) +
  theme(legend.position = "none")

amr_total_rpkm_plot <- ggplot(amr_sample_summary, aes(x = analysis_group, y = total_rpkm, color = analysis_group)) +
  stat_summary(fun = median, geom = "crossbar", width = 0.5, color = "grey30") +
  geom_point(aes(shape = analysis_group), size = 3, position = amr_jitter) +
  ggrepel::geom_text_repel(
    aes(label = studycode),
    position = amr_jitter,
    size = 3,
    show.legend = FALSE,
    seed = 42,
    box.padding = 0.25,
    point.padding = 0.2,
    min.segment.length = 0
  ) +
  scale_color_manual(values = group_palette, labels = group_labels) +
  scale_shape_manual(values = group_shapes, labels = group_labels) +
  facet_wrap(~ amr_case, nrow = 1) +
  scale_x_discrete(labels = group_labels) +
  scale_y_continuous(labels = comma_format(), limits = c(0, NA), expand = expansion(mult = c(0, 0.05))) +
  labs(
    title = "Total retained AMR signal after RPKM normalization",
    x = "Analysis group",
    y = "Summed RPKM",
    shape = "Group"
  ) +
  theme(legend.position = "none")

amr_burden_plot <- amr_gene_count_plot / amr_total_rpkm_plot
save_fig(amr_burden_plot, "amr_burden_summary.png", width = 9, height = 9.5)
amr_burden_plot

Take-home note. Comparing the top and bottom panels makes it easier to see that the strict coverage filter reduces both the number of retained hits and the apparent AMR burden.

Interpretation

  • Without the coverage filter, the AMR summary is broader and tends to preserve more of the signal that directionally resembles the paper.
  • After applying Average Percent Coverage ≥ 90, the retained AMR set becomes much smaller and more conservative.
  • The child samples still tend to show higher total AMR signal on average, but the difference becomes weaker after strict filtering, and the age pattern remains exploratory rather than definitive.

5.3 Gene-level summaries

Code
amr_gene_summary <- amr_cases %>%
  group_by(amr_case, `ARO Term`) %>%
  summarise(
    n_samples = n_distinct(sample_name),
    total_rpk = sum(rpk, na.rm = TRUE),
    total_rpkm = sum(rpkm, na.rm = TRUE),
    max_rpkm_in_one_sample = max(rpkm, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(amr_case, desc(n_samples), desc(total_rpkm))

knitr::kable(amr_gene_summary %>% group_by(amr_case) %>% slice_head(n = 20), digits = 2)
amr_case ARO Term n_samples total_rpk total_rpkm max_rpkm_in_one_sample
No coverage filter tet(Q) 6 16052.58 348350.05 144184.02
No coverage filter ErmB 6 15638.81 308079.29 76867.24
No coverage filter tet(O) 6 11677.08 228032.19 74950.74
No coverage filter lnuC 6 13230.53 223310.48 167960.56
No coverage filter vanT gene in vanG cluster 6 11604.81 212833.21 52746.43
No coverage filter vanY gene in vanB cluster 6 10553.70 177275.29 44499.57
No coverage filter tet(W) 6 10096.88 157028.43 52279.10
No coverage filter Bifidobacterium adolescentis rpoB mutants conferring resistance to rifampicin 6 9735.19 143061.05 43965.06
No coverage filter ErmF 6 6550.58 125364.49 37229.06
No coverage filter adeF 6 6128.31 117506.45 27757.48
No coverage filter tet(32) 6 4889.58 105120.92 28945.09
No coverage filter ErmX 6 6581.29 94651.85 38070.92
No coverage filter qacG 6 4910.30 85917.18 36038.63
No coverage filter Bifidobacterium bifidum ileS conferring resistance to mupirocin 6 5701.25 73419.13 30847.43
No coverage filter kdpE 6 5101.77 67845.54 31617.17
No coverage filter vanG 6 3575.18 67426.88 15181.64
No coverage filter tet(M) 6 3364.58 53357.76 12525.85
No coverage filter dfrF 6 3277.45 50005.18 15792.76
No coverage filter Escherichia coli emrE 6 3270.90 45729.89 24381.66
No coverage filter tet(40) 6 2088.45 43034.56 11120.68
Average Percent Coverage ≥ 90 mel 6 1541.87 409727.75 237252.37
Average Percent Coverage ≥ 90 catP 6 623.40 194780.91 93943.41
Average Percent Coverage ≥ 90 CblA-1 4 1094.28 388729.91 226401.95
Average Percent Coverage ≥ 90 CfxA6 3 7946.79 1883098.37 830493.08
Average Percent Coverage ≥ 90 dfrF 3 2365.66 511392.24 347331.45
Average Percent Coverage ≥ 90 OXA-347 3 975.76 246717.37 149210.74
Average Percent Coverage ≥ 90 LnuP 3 203.59 47828.37 31306.21
Average Percent Coverage ≥ 90 mef(H) 2 2123.14 510194.12 497609.17
Average Percent Coverage ≥ 90 lnuC 2 2030.14 501758.30 300237.43
Average Percent Coverage ≥ 90 lnuB 2 889.78 195006.41 166335.60
Average Percent Coverage ≥ 90 Erm(52) 2 656.99 174428.07 133642.01
Average Percent Coverage ≥ 90 Mef(En2) 2 364.01 103810.41 91119.49
Average Percent Coverage ≥ 90 vanU gene in vanG cluster 2 286.38 96217.24 68803.63
Average Percent Coverage ≥ 90 cfr(B) 2 314.29 80272.06 43198.41
Average Percent Coverage ≥ 90 clcD 2 339.05 66420.14 48478.89
Average Percent Coverage ≥ 90 EreD 2 196.41 63698.60 37939.47
Average Percent Coverage ≥ 90 ErmQ 2 202.84 50669.40 40528.19
Average Percent Coverage ≥ 90 gadW 2 74.07 17023.85 9475.14
Average Percent Coverage ≥ 90 tetB(P) 2 61.77 14216.91 8335.37
Average Percent Coverage ≥ 90 CfxA4 1 1874.74 588615.76 588615.76

The top gene list changes substantially after strict filtering. Several genes that are prominent without a coverage cutoff do not remain after the Coverage ≥ 90 rule, so the workshop interpretation below compares both cases directly.

Code
paper_theme_genes <- c("ErmB", "lnuC", "tet(32)", "tet(M)", "tet(O)", "tet(Q)", "tet(W)", "mphA")
detected_ctxm_variants <- sort(unique(amr_cases$`ARO Term`[grepl("CTX-M", amr_cases$`ARO Term`, ignore.case = TRUE)]))
focus_gene_set <- unique(c(paper_theme_genes, detected_ctxm_variants))

amr_focus_gene_table <- amr_cases %>%
  filter(`ARO Term` %in% focus_gene_set) %>%
  group_by(amr_case, sample_name, `ARO Term`) %>%
  summarise(
    rpk = sum(rpk, na.rm = TRUE),
    rpkm = sum(rpkm, na.rm = TRUE),
    max_coverage = max(`Average Percent Coverage`, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  complete(
    amr_case = levels(amr_cases$amr_case),
    sample_name = sample_order,
    `ARO Term` = focus_gene_set,
    fill = list(rpk = 0, rpkm = 0, max_coverage = NA_real_)
  )

focus_gene_order <- amr_focus_gene_table %>%
  group_by(amr_case, `ARO Term`) %>%
  summarise(total_rpkm = sum(rpkm, na.rm = TRUE), .groups = "drop") %>%
  pivot_wider(names_from = amr_case, values_from = total_rpkm, values_fill = 0) %>%
  mutate(delta_rpkm = `No coverage filter` - `Average Percent Coverage ≥ 90`) %>%
  arrange(desc(abs(delta_rpkm)), desc(`No coverage filter`), `ARO Term`) %>%
  pull(`ARO Term`)

amr_focus_gene_table <- amr_focus_gene_table %>%
  left_join(metadata %>% select(sample_name, studycode, analysis_group), by = "sample_name") %>%
  mutate(
    studycode = factor(studycode, levels = studycode_order),
    amr_case = factor(amr_case, levels = levels(amr_cases$amr_case)),
    `ARO Term` = factor(`ARO Term`, levels = rev(focus_gene_order))
  )

gene_fill_max <- max(log10(amr_focus_gene_table$rpkm + 1), na.rm = TRUE)
gene_fill_breaks <- pretty(c(0, gene_fill_max), n = 5)

amr_focus_gene_plot <- ggplot(amr_focus_gene_table, aes(x = studycode, y = `ARO Term`, fill = log10(rpkm + 1))) +
  geom_tile(color = "grey70", linewidth = 0.25) +
  facet_grid(
    amr_case ~ analysis_group,
    scales = "free_x",
    space = "free_x",
    labeller = labeller(analysis_group = as_labeller(group_labels))
  ) +
  scale_fill_gradientn(
    colours = c("grey99", "#FEE08B", "#F46D43", "#A50026"),
    limits = c(0, gene_fill_max),
    breaks = gene_fill_breaks
  ) +
  labs(
    title = "Paper-themed AMR genes under both filtering strategies",
    subtitle = "Rows share one color scale; pale bottom-row tiles indicate signals removed by the Coverage ≥ 90 filter",
    x = "Study code",
    y = "ARO term",
    fill = "log10(RPKM + 1)"
  ) +
  theme(
    #axis.text.x = element_text(angle = 0, hjust = 0),
    axis.text.y = element_text(size = 8),
    strip.background = element_rect(fill = "grey92", color = "black"),
    strip.text = element_text(face = "bold"),
    panel.spacing = grid::unit(0.5, "lines")
  )

save_fig(amr_focus_gene_plot, "amr_paper_theme_gene_heatmap.png", width = 10, height = 8)
amr_focus_gene_plot

Take-home note. Because both rows use the same fill scale and the genes are ordered by how much signal is lost after filtering, the coverage effect is much easier to read at a glance.

Code
amr_focus_gene_summary <- amr_cases %>%
  filter(`ARO Term` %in% focus_gene_set) %>%
  group_by(amr_case, `ARO Term`) %>%
  summarise(
    n_samples = n_distinct(sample_name),
    total_rpk = sum(rpk, na.rm = TRUE),
    total_rpkm = sum(rpkm, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(amr_case, desc(n_samples), desc(total_rpkm))

knitr::kable(amr_focus_gene_summary, digits = 2)
amr_case ARO Term n_samples total_rpk total_rpkm
No coverage filter tet(Q) 6 16052.58 348350.05
No coverage filter ErmB 6 15638.81 308079.29
No coverage filter tet(O) 6 11677.08 228032.19
No coverage filter lnuC 6 13230.53 223310.48
No coverage filter tet(W) 6 10096.88 157028.43
No coverage filter tet(32) 6 4889.58 105120.92
No coverage filter tet(M) 6 3364.58 53357.76
No coverage filter mphA 4 27.74 473.48
No coverage filter CTX-M-134 2 20.55 358.85
No coverage filter CTX-M-15 1 28.54 683.61
No coverage filter CTX-M-202 1 25.93 492.93
No coverage filter CTX-M-78 1 2.28 43.41
No coverage filter CTX-M-122 1 2.28 39.43
No coverage filter CTX-M-65 1 2.28 34.74
No coverage filter CTX-M-254 1 1.14 27.34
No coverage filter CTX-M-137 1 1.14 21.70
No coverage filter CTX-M-112 1 1.14 17.37
Average Percent Coverage ≥ 90 lnuC 2 2030.14 501758.30
Average Percent Coverage ≥ 90 CTX-M-15 1 28.54 4907.79

5.4 Drug-class summaries

Code
present_paper_classes <- amr_cases %>%
  select(`Drug Class`) %>%
  separate_rows(`Drug Class`, sep = ";\\s*") %>%
  mutate(paper_class = map_paper_amr_class(`Drug Class`)) %>%
  distinct(paper_class) %>%
  filter(!is.na(paper_class)) %>%
  mutate(paper_class = factor(paper_class, levels = paper_amr_class_levels)) %>%
  arrange(paper_class) %>%
  pull(paper_class)

amr_class_summary <- amr_cases %>%
  select(amr_case, sample_name, studycode, analysis_group, `ARO Term`, `Drug Class`, rpk, rpkm) %>%
  separate_rows(`Drug Class`, sep = ";\\s*") %>%
  mutate(paper_class = map_paper_amr_class(`Drug Class`)) %>%
  group_by(amr_case, sample_name, studycode, analysis_group, paper_class) %>%
  summarise(
    total_rpk = sum(rpk, na.rm = TRUE),
    total_rpkm = sum(rpkm, na.rm = TRUE),
    n_genes = n_distinct(`ARO Term`),
    .groups = "drop"
  ) %>%
  complete(
    amr_case = levels(amr_cases$amr_case),
    sample_name = sample_order,
    paper_class = present_paper_classes,
    fill = list(total_rpk = 0, total_rpkm = 0, n_genes = 0)
  ) %>%
  left_join(metadata %>% select(sample_name, studycode, analysis_group), by = "sample_name") %>%
  mutate(
    studycode = coalesce(studycode.x, studycode.y),
    analysis_group = coalesce(analysis_group.x, analysis_group.y)
  ) %>%
  select(-studycode.x, -studycode.y, -analysis_group.x, -analysis_group.y) %>%
  mutate(amr_case = factor(amr_case, levels = levels(amr_cases$amr_case)))

amr_class_order <- amr_class_summary %>%
  group_by(amr_case, paper_class) %>%
  summarise(total_rpkm = sum(total_rpkm, na.rm = TRUE), .groups = "drop") %>%
  pivot_wider(names_from = amr_case, values_from = total_rpkm, values_fill = 0) %>%
  mutate(delta_rpkm = `No coverage filter` - `Average Percent Coverage ≥ 90`) %>%
  arrange(desc(abs(delta_rpkm)), desc(`No coverage filter`), paper_class) %>%
  pull(paper_class)

amr_class_summary <- amr_class_summary %>%
  mutate(
    studycode = factor(studycode, levels = studycode_order),
    paper_class = factor(paper_class, levels = rev(amr_class_order))
  )

class_fill_max <- max(log10(amr_class_summary$total_rpkm + 1), na.rm = TRUE)
class_fill_breaks <- pretty(c(0, class_fill_max), n = 5)

amr_class_plot <- ggplot(amr_class_summary, aes(x = studycode, y = paper_class, fill = log10(total_rpkm + 1))) +
  geom_tile(color = "grey70", linewidth = 0.25) +
  facet_grid(
    amr_case ~ analysis_group,
    scales = "free_x",
    space = "free_x",
    labeller = labeller(analysis_group = as_labeller(group_labels))
  ) +
  scale_fill_gradientn(
    colours = c("grey99", "#D1E5F0", "#67A9CF", "#2166AC"),
    limits = c(0, class_fill_max),
    breaks = class_fill_breaks
  ) +
  labs(
    title = "Paper-style AMR classes under both filtering strategies",
    subtitle = "Rows share one color scale, so pale bottom-row tiles highlight class-level signal lost after strict filtering",
    x = "Study code",
    y = "Paper-style AMR class",
    fill = "log10(RPKM + 1)"
  ) +
  theme(
    #axis.text.x = element_text(angle = 45, hjust = 1),
    strip.background = element_rect(fill = "grey92", color = "black"),
    strip.text = element_text(face = "bold"),
    panel.spacing = grid::unit(0.5, "lines")
  )

save_fig(amr_class_plot, "amr_drug_class_heatmap.png", width = 10, height = 7.5)
amr_class_plot

Take-home note. Showing every class-sample combination, including zeros, makes the class-level difference between the no-filter and coverage-filter views much more explicit.

Only the paper-style AMR classes that are actually present in this 6-sample re-analysis are shown here; classes absent from the current data are omitted to avoid implying that they were observed.

Code
amr_class_overall <- amr_class_summary %>%
  group_by(amr_case, paper_class) %>%
  summarise(
    n_samples = n_distinct(sample_name),
    total_rpk = sum(total_rpk),
    total_rpkm = sum(total_rpkm),
    .groups = "drop"
  ) %>%
  arrange(amr_case, desc(total_rpkm))

knitr::kable(amr_class_overall, digits = 2)
amr_case paper_class n_samples total_rpk total_rpkm
No coverage filter MLS 6 207552.96 3699918.52
No coverage filter Tetracyclines 6 79823.82 1439291.97
No coverage filter Other 6 59507.79 888185.32
No coverage filter β-lactamases 6 49331.79 821052.43
No coverage filter Glycopeptides 6 35488.10 637193.72
No coverage filter Fluoroquinolones 6 24974.18 380010.54
No coverage filter Aminoglycosides 6 25777.29 367891.01
No coverage filter Chloramphenicol 6 10360.76 157092.03
No coverage filter Trimethoprim 6 6656.90 96039.79
No coverage filter Fosfomycin 6 3060.14 54163.42
No coverage filter Sulfonamides 6 2499.65 34694.74
Average Percent Coverage ≥ 90 MLS 6 14943.15 3742046.25
Average Percent Coverage ≥ 90 β-lactamases 6 13831.48 3623264.35
Average Percent Coverage ≥ 90 Trimethoprim 6 2627.48 568014.87
Average Percent Coverage ≥ 90 Chloramphenicol 6 1594.43 445992.86
Average Percent Coverage ≥ 90 Other 6 1692.12 409178.97
Average Percent Coverage ≥ 90 Aminoglycosides 6 1031.04 229773.90
Average Percent Coverage ≥ 90 Tetracyclines 6 667.27 218785.09
Average Percent Coverage ≥ 90 Glycopeptides 6 359.73 112079.32
Average Percent Coverage ≥ 90 Fluoroquinolones 6 428.21 111009.16
Average Percent Coverage ≥ 90 Sulfonamides 6 0.00 0.00
Average Percent Coverage ≥ 90 Fosfomycin 6 0.00 0.00

5.5 Clinically highlighted signals

Code
critical_hits <- amr_cases %>%
  filter(grepl("CTX-M|mphA", `ARO Term`, ignore.case = TRUE)) %>%
  select(amr_case, studycode, analysis_group, `ARO Term`, `Completely Mapped Reads`, `Average Percent Coverage`, rpk, rpkm, `Drug Class`) %>%
  arrange(amr_case, `ARO Term`, desc(`Completely Mapped Reads`))

knitr::kable(critical_hits, digits = 2)
amr_case studycode analysis_group ARO Term Completely Mapped Reads Average Percent Coverage rpk rpkm Drug Class
No coverage filter 22EN-C-02 Child 0-23 months CTX-M-112 1 6.51 1.14 17.37 cephalosporin
No coverage filter 22EN-C-07 Child 24-59 months CTX-M-122 2 21.69 2.28 39.43 cephalosporin
No coverage filter 22EN-C-07 Child 24-59 months CTX-M-134 16 79.11 18.26 315.44 cephalosporin
No coverage filter 22EN-A-17 Adult CTX-M-134 2 25.23 2.28 43.41 cephalosporin
No coverage filter 22EN-A-17 Adult CTX-M-137 1 13.58 1.14 21.70 cephalosporin
No coverage filter 22EN-A-04 Adult CTX-M-15 25 92.01 28.54 683.61 cephalosporin; penicillin beta-lactam
No coverage filter 22EN-A-17 Adult CTX-M-202 21 11.36 25.93 492.93 cephalosporin
No coverage filter 22EN-A-04 Adult CTX-M-254 1 13.58 1.14 27.34 cephalosporin
No coverage filter 22EN-C-02 Child 0-23 months CTX-M-65 2 21.00 2.28 34.74 cephalosporin
No coverage filter 22EN-A-17 Adult CTX-M-78 2 27.17 2.28 43.41 cephalosporin
No coverage filter 22EN-C-07 Child 24-59 months mphA 20 44.41 22.00 379.99 macrolide antibiotic
No coverage filter 22EN-A-17 Adult mphA 3 5.42 3.39 64.45 macrolide antibiotic
No coverage filter 22EN-C-02 Child 0-23 months mphA 1 2.55 1.11 16.85 macrolide antibiotic
No coverage filter 22EN-C-03 Child 0-23 months mphA 1 3.47 1.24 12.18 macrolide antibiotic
Average Percent Coverage ≥ 90 22EN-A-04 Adult CTX-M-15 25 92.01 28.54 4907.79 cephalosporin; penicillin beta-lactam

AMR interpretation

  • Without the coverage filter, the AMR profile is broader and closer to the kind of descriptive signal emphasized in the paper.
  • Once Average Percent Coverage ≥ 90 is imposed, the profile becomes much more conservative and several previously visible genes disappear.
  • In the paper-style class grouping, MLS and β-lactamase-related signals remain prominent, but the balance between classes changes substantially between the two AMR cases.
  • Among the highlighted genes, some signals remain only in the no-filter case, while the strict-filter case retains a much smaller subset, including only limited CTX-M signal and no retained mphA.
  • This side-by-side comparison is useful for teaching how sensitive resistome interpretation can be to filtering choices.

6 Integrated exploratory notes

The next plot is deliberately exploratory. It asks whether samples with lower taxonomic diversity also tend to show a larger AMR signal under the two AMR filtering strategies.

Code
integrated_summary <- alpha_diversity %>%
  select(sample_name, studycode, analysis_group, shannon) %>%
  left_join(amr_sample_summary %>% select(amr_case, sample_name, total_rpkm, n_retained_aro_terms), by = "sample_name")

integrated_plot <- ggplot(integrated_summary, aes(x = shannon, y = total_rpkm, color = analysis_group, label = studycode)) +
  geom_point(aes(shape = analysis_group), size = 3) +
  ggrepel::geom_text_repel(
    size = 3,
    show.legend = FALSE,
    seed = 42,
    box.padding = 0.25,
    point.padding = 0.25,
    min.segment.length = 0
  ) +
  scale_color_manual(values = group_palette, labels = group_labels) +
  scale_shape_manual(values = group_shapes, labels = group_labels) +
  facet_wrap(~ amr_case, nrow = 1) +
  scale_x_continuous(limits = c(0, NA), expand = expansion(mult = c(0, 0.05))) +
  scale_y_continuous(labels = comma_format(), limits = c(0, NA), expand = expansion(mult = c(0, 0.05))) +
  labs(
    title = "Exploratory view: AMR RPKM versus Shannon diversity",
    x = "Shannon diversity",
    y = "Summed AMR RPKM",
    color = "Group",
    shape = "Group"
  )

save_fig(integrated_plot, "integrated_amr_vs_alpha.png", width = 9, height = 5)
integrated_plot

This plot suggests that the lower-diversity child samples also have relatively high retained AMR signal in both AMR summaries, but the strength of that pattern depends on the filter used. With only six samples, this should be treated as a visual prompt for a larger follow-up analysis, not as an inferred biological relationship.

Take-home note. The relationship between diversity and AMR signal looks broadly similar in both AMR cases, but the apparent strength of the pattern depends on the filtering rule.

7 Comparison with the original paper

Code
comparison_table <- tibble::tibble(
  theme = c(
    "Age-related microbiome structure",
    "Alpha diversity pattern",
    "Beta diversity pattern",
    "Broad AMR burden",
    "Key AMR classes",
    "Clinically highlighted genes"
  ),
  paper = c(
    "Young children under 2 years were structurally distinct from older children and adults.",
    "Youngest children had lower diversity; older children approached adult values.",
    "Older children and adults overlapped more than younger children.",
    "Youngest children had the highest median AMR gene burden.",
    "Tetracycline and MLS signals were widespread.",
    "blaCTX-M and mphA/mphD were emphasized in younger children."
  ),
  subset_reanalysis = c(
    "The two youngest-child samples are distinct from the adult cluster; the single older-child sample is intermediate in phylum composition but not fully adult-like.",
    "Adults have the highest Shannon values in this subset; all three child samples are lower.",
    "Adults cluster closer to one another; child samples are more separated from the adults than adults are from each other.",
    "Without the coverage filter, children still tend to show higher AMR signal on average; with the ≥ 90 filter that separation becomes weaker and the retained hit counts do not follow a clean age gradient.",
    "Using paper-style class grouping, MLS and β-lactamase-related signals remain visible in both analyses, but the class balance changes substantially after strict filtering.",
    "CTX-M and mphA-like signals are more visible without the coverage filter; only limited CTX-M signal remains after strict filtering, and mphA is not retained."
  ),
  take_home = c(
    "Broad directional concordance, but only partial support because n is very small.",
    "Directionally consistent with the paper.",
    "Directionally consistent for adult clustering; weaker support for an older-child/adult overlap claim.",
    "Suggestive but not a clean reproduction.",
    "Concordance is stronger in the no-filter view and weaker in the strict-filter view.",
    "Agreement for these clinically highlighted genes is highly filter-dependent."
  )
)

knitr::kable(comparison_table)
theme paper subset_reanalysis take_home
Age-related microbiome structure Young children under 2 years were structurally distinct from older children and adults. The two youngest-child samples are distinct from the adult cluster; the single older-child sample is intermediate in phylum composition but not fully adult-like. Broad directional concordance, but only partial support because n is very small.
Alpha diversity pattern Youngest children had lower diversity; older children approached adult values. Adults have the highest Shannon values in this subset; all three child samples are lower. Directionally consistent with the paper.
Beta diversity pattern Older children and adults overlapped more than younger children. Adults cluster closer to one another; child samples are more separated from the adults than adults are from each other. Directionally consistent for adult clustering; weaker support for an older-child/adult overlap claim.
Broad AMR burden Youngest children had the highest median AMR gene burden. Without the coverage filter, children still tend to show higher AMR signal on average; with the ≥ 90 filter that separation becomes weaker and the retained hit counts do not follow a clean age gradient. Suggestive but not a clean reproduction.
Key AMR classes Tetracycline and MLS signals were widespread. Using paper-style class grouping, MLS and β-lactamase-related signals remain visible in both analyses, but the class balance changes substantially after strict filtering. Concordance is stronger in the no-filter view and weaker in the strict-filter view.
Clinically highlighted genes blaCTX-M and mphA/mphD were emphasized in younger children. CTX-M and mphA-like signals are more visible without the coverage filter; only limited CTX-M signal remains after strict filtering, and mphA is not retained. Agreement for these clinically highlighted genes is highly filter-dependent.

7.1 Quick summary in English

Findings that remain broadly similar to the original paper

  • On the taxonomy side, the adult samples in this 6-sample subset still show higher alpha diversity and tighter clustering than the younger-child samples.
  • The two children under 24 months still look compositionally distinct from the adult cluster, which remains directionally consistent with the paper’s main microbiome maturation result.
  • The overall structure of the subset is therefore still compatible with the broad idea that younger gut microbiomes are less mature and more distinct than adult microbiomes.

Findings that differ, weaken, or only partially agree once AMR filtering is tightened

  • We still cannot fully evaluate the paper’s “older children become more adult-like” pattern because this subset contains only one child aged 24-59 months.
  • On the beta-diversity side, adults cluster more tightly, but the single older-child sample does not clearly overlap with the adult cluster the way the full paper suggested.
  • In the AMR analysis, the no-filter view and the strict-filter view do not tell exactly the same story.
  • Under the stricter workflow, many canonical paper-highlighted genes drop out, while a smaller set of higher-coverage hits remains.
  • This means that AMR concordance with the paper is stronger descriptively without the coverage filter and weaker but more conservative with the ≥ 90 filter.

Short take-home message

  • This 6-sample re-analysis still supports the paper at the broad microbiome-structure level.
  • On the AMR side, agreement becomes stronger in the no-filter descriptive view and weaker but more conservative once the analysis is restricted to high-coverage hits and RPKM-based summaries.
  • The comparison should therefore be read as a workshop-oriented directional cross-check, not as a full replication of the original paper.

8 Limitations

This notebook should be read with several major limitations in mind:

  • It is based on only six re-analyzed samples.
  • The original paper analyzed 42 participants and had more balanced age strata.
  • The local metadata are sparse, and even the age_month field is ambiguous for adults.
  • Taxonomic ecology here uses f_weighted_at_rank, not the paper’s original Kraken2/Bracken abundance outputs.
  • The unclassified fraction is large in every sample, so the “without unclassified” analyses describe only the classified component.
  • The AMR workflow is not directly equivalent to the paper’s SRST2 + ARGANNOT pipeline.
  • The notebook compares two AMR views: one without a coverage filter and one with Average Percent Coverage ≥ 90. The stricter view is conservative and may remove biologically real but lower-coverage hits.
  • The AMR normalization uses Completely Mapped Reads, Reference Length, and a sample-internal RPKM scaling based on retained AMR reads available in the local RGI tables.
  • Because the local AMR tables do not include total metagenomic read counts, this RPKM calculation is an available-data approximation rather than a perfect reconstruction of the paper’s SRST2-based RPKM framework.
  • With group sizes of 2, 1, and 3, formal inference is not the main point of the exercise.

9 Suggested next analyses

If this workshop analysis were expanded into a more complete re-analysis, the next priorities would be:

  1. Re-analyze a much larger subset, ideally recovering age groups closer to the original paper design.
  2. Standardize metadata carefully, especially adult age, diet, household structure, and recent antimicrobial exposure.
  3. Re-run taxonomic profiling with a workflow closer to the original study if strict replication is the goal.
  4. Add a more principled AMR comparison layer, for example family-level harmonization between CARD and ARGANNOT outputs.
  5. Investigate whether the strongest CTX-M detections remain after stricter filtering on coverage and breadth.
  6. Explore organism-to-gene linkage methods if the training objective extends beyond simple resistome description.

10 Conclusion

This 6-sample workshop re-analysis captures several of the paper’s broad ideas. The taxonomy results remain directionally compatible with a more adult-like and more diverse microbiome in adults.

At the same time, the subset does not cleanly reproduce every published AMR pattern, and the answer depends strongly on which AMR view is used. The no-filter normalized view preserves more of the broad descriptive signal that resembles the paper, whereas the high-coverage view is more conservative and retains a smaller subset of genes and class-level patterns. Because the sample count is tiny and the AMR detection workflow differs, these differences should be interpreted as a reminder about workflow dependence, filtering sensitivity, and sampling uncertainty, not as evidence that the original paper was wrong.

The main workshop take-home is that this small re-analysis is useful for learning how to structure a metagenomic notebook, inspect taxonomy with and without unclassified signal, summarize AMR at gene and drug-class levels, and compare cautiously back to a published study. It is not sufficient for definitive biological conclusions.

Back to top