Clinical Foundations & AMR Surveillance: Taxonomy Analysis

Classified and unclassified signal in a six-sample Vietnamese gut microbiome subset

Author

Hao Chung The, Minh-Quan Ton-Ngoc

Published

May 15, 2026

1 R-based exploratory analysis

The sections below assume that the pipeline outputs above already exist. From this point onward, the notebook shifts from pipeline execution to exploratory interpretation.

Code

library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)
library(ggrepel)
library(phyloseq)
library(vegan)
library(patchwork)
library(scales)
library(data.table)
library(broom)

data_dir <- file.path("..", "materials", "vietnamese-healthy-gut-microbiome")

group_palette <- c(
  "Child 0-23 months" = "#D55E00",
  "Child 24-59 months" = "#0072B2",
  "Adult" = "#009E73"
)

group_shapes <- c(
  "Child 0-23 months" = 16,
  "Child 24-59 months" = 17,
  "Adult" = 15
)

group_labels <- c(
  "Child 0-23 months" = "Child\n(0-23 months)",
  "Child 24-59 months" = "Child\n(24-59 months)",
  "Adult" = "Adult"
)

group_labeller <- labeller(analysis_group = as_labeller(group_labels))

my_color <- c(
  "#4e79a7", "#f28e2b", "#59a14f", "#b6992d",
  "#499894", "#e15759", "#b07aa1", "#9d7660"
)

make_fill_palette <- function(levels_vector, other_label = "Other") {
  core_levels <- setdiff(levels_vector, other_label)
  n_cols <- length(core_levels)
  if (n_cols > length(my_color)) {
    extra_cols <- scales::hue_pal()(n_cols - length(my_color))
    core_cols <- c(my_color, extra_cols)
  } else {
    core_cols <- my_color[seq_len(n_cols)]
  }
  cols <- setNames(core_cols, core_levels)
  if (other_label %in% levels_vector) {
    cols[other_label] <- "grey85"
  }
  cols[levels_vector]
}

paper_amr_class_levels <- c(
  "Aminoglycosides", "β-lactamases", "Fosfomycin", "Fluoroquinolones",
  "Glycopeptides", "MLS", "Chloramphenicol", "Sulfonamides",
  "Tetracyclines", "Trimethoprim", "Other"
)

map_paper_amr_class <- function(drug_class) {
  case_when(
    drug_class %in% c("aminoglycoside antibiotic") ~ "Aminoglycosides",
    drug_class %in% c("cephalosporin", "penicillin beta-lactam", "carbapenem", "monobactam") ~ "β-lactamases",
    drug_class %in% c("phosphonic acid antibiotic") ~ "Fosfomycin",
    drug_class %in% c("fluoroquinolone antibiotic") ~ "Fluoroquinolones",
    drug_class %in% c("glycopeptide antibiotic") ~ "Glycopeptides",
    drug_class %in% c(
      "macrolide antibiotic", "lincosamide antibiotic", "streptogramin antibiotic",
      "streptogramin A antibiotic", "streptogramin B antibiotic"
    ) ~ "MLS",
    drug_class %in% c("phenicol antibiotic") ~ "Chloramphenicol",
    drug_class %in% c("sulfonamide antibiotic") ~ "Sulfonamides",
    drug_class %in% c("tetracycline antibiotic", "glycylcycline") ~ "Tetracyclines",
    drug_class %in% c("diaminopyrimidine antibiotic") ~ "Trimethoprim",
    TRUE ~ "Other"
  )
}

theme_set(
  theme_bw(base_size = 12) +
    theme(
      panel.grid.minor = element_blank(),
      plot.title.position = "plot",
      legend.position = "right"
    )
)

fig_dir <- "figures"
dir.create(fig_dir, showWarnings = FALSE)

save_fig <- function(plot, filename, width = 9, height = 5.5, dpi = 300) {
  ggsave(
    filename = file.path(fig_dir, filename),
    plot = plot,
    width = width,
    height = height,
    dpi = dpi,
    bg = "white"
  )
}

parse_lineage <- function(df) {
  df %>%
    tidyr::separate(
      col = lineage,
      into = c("domain", "phylum", "class", "order", "family", "genus", "species"),
      sep = ";",
      fill = "right",
      extra = "drop",
      remove = FALSE
    )
}

trim_sample_ids <- function(x) trimws(as.character(x))

Code note

This setup chunk loads the R packages used throughout the notebook and defines a few helper objects.
The notebook now uses ggplot2 only for plotting, to keep figure sizing predictable in the rendered output.
For color, the stacked bar plots use a richer custom palette, while group-level plots still combine color and point shape so they remain readable on screen and when printed.
The default ggplot2 theme is theme_bw(), so any plot built directly with ggplot2 follows the workshop style guide.

2 Context from the original paper

Before looking at the 6 local samples, we extract the main analytical context from the published article and the local summary notes.

Paper-level findings that are reasonable to compare against in a workshop subset

The study analyzed 42 healthy Vietnamese participants split into 0-23 months, 2-5 years, and adults.
The paper reported a clear age-related microbiome structure difference, with children under 2 years showing a distinct gut community structure.
Alpha diversity was lower in the youngest children than in older children and adults.
Beta diversity showed broad overlap between older children and adults, while younger children were more distinct and more dispersed.
The healthy gut resistome was described as a substantial AMR reservoir, with tetracycline and MLS resistance genes especially common.
Genes highlighted in the paper included ermB, lnuC, tetM, tetO, tetQ, tetW, and tet-32 as near-ubiquitous signals.
The paper reported a higher AMR burden in the youngest children, and noted clinically important genes such as blaCTX-M and mphA/mphD in younger children.

Code

paper_context <- tibble::tibble(
  theme = c(
    "Age-related microbiome structure",
    "Alpha diversity",
    "Beta diversity",
    "Broad AMR burden",
    "AMR classes emphasized",
    "AMR genes emphasized"
  ),
  published_finding = c(
    "Children under 2 years had a distinct gut microbiome structure relative to older children and adults.",
    "Youngest children had lower Shannon diversity; older children were closer to adults.",
    "Bray-Curtis ordination showed young children dispersed away from the overlapping older-child and adult cloud.",
    "Youngest children carried the greatest number of detected AMR genes.",
    "Tetracycline and MLS resistance were nearly ubiquitous.",
    "ermB, lnuC, tet-32, tetM, tetO, tetQ, tetW were common; blaCTX-M and mphA/mphD were highlighted as clinically important signals."
  )
)

knitr::kable(paper_context, col.names = c("Theme", "Finding extracted from paper / summary"))

Theme	Finding extracted from paper / summary
Age-related microbiome structure	Children under 2 years had a distinct gut microbiome structure relative to older children and adults.
Alpha diversity	Youngest children had lower Shannon diversity; older children were closer to adults.
Beta diversity	Bray-Curtis ordination showed young children dispersed away from the overlapping older-child and adult cloud.
Broad AMR burden	Youngest children carried the greatest number of detected AMR genes.
AMR classes emphasized	Tetracycline and MLS resistance were nearly ubiquitous.
AMR genes emphasized	ermB, lnuC, tet-32, tetM, tetO, tetQ, tetW were common; blaCTX-M and mphA/mphD were highlighted as clinically important signals.

3 Analysis-ready files used below

The R analysis below assumes that the workflow and post-processing steps have already produced an analysis folder containing:

Metadata for the 6 re-analyzed samples.
Taxonomy tables with and without unclassified rows.
A vietnamese_gut.phyloseq folder that contains otu_table.csv, sample_data.csv, and tax_table.csv.
A vietnamese_gut_amr folder with one rgi-bwt gene mapping table per sample.
The original paper PDF and a local summary file used only for context.

Code

base_dir <- "."

paths <- tibble::tibble(
  object = c(
    "metadata",
    "taxonomy_with_unclassified",
    "taxonomy_without_unclassified",
    "phyloseq_export_dir",
    "amr_dir",
    "local_summary_md"
  ),
  file_name = c(
    "vietnamese_gut_sample_information.csv",
    "vietnamese_gut_with_unclassified.csv",
    "vietnamese_gut_without_unclassified.csv",
    "vietnamese_gut.phyloseq",
    "vietnamese_gut_amr",
    "vietnamese_gut_summary.md"
  )
) %>%
  mutate(
  path = file.path(data_dir, file_name),
  exists = file.exists(path) | dir.exists(path)
)

knitr::kable(paths)

object	file_name	path	exists
metadata	vietnamese_gut_sample_information.csv	../materials/vietnamese-healthy-gut-microbiome/vietnamese_gut_sample_information.csv	TRUE
taxonomy_with_unclassified	vietnamese_gut_with_unclassified.csv	../materials/vietnamese-healthy-gut-microbiome/vietnamese_gut_with_unclassified.csv	TRUE
taxonomy_without_unclassified	vietnamese_gut_without_unclassified.csv	../materials/vietnamese-healthy-gut-microbiome/vietnamese_gut_without_unclassified.csv	TRUE
phyloseq_export_dir	vietnamese_gut.phyloseq	../materials/vietnamese-healthy-gut-microbiome/vietnamese_gut.phyloseq	TRUE
amr_dir	vietnamese_gut_amr	../materials/vietnamese-healthy-gut-microbiome/vietnamese_gut_amr	TRUE
local_summary_md	vietnamese_gut_summary.md	../materials/vietnamese-healthy-gut-microbiome/vietnamese_gut_summary.md	FALSE

4 Metadata inspection

4.1 Load data

Code note

This chunk reads the local sample metadata and creates a simple analysis_group variable that we can use consistently in plots and summaries.
The grouping is intentionally simple because the metadata for this workshop subset are limited.

Code

metadata <- read_csv(file.path(data_dir, "vietnamese_gut_sample_information.csv"), show_col_types = FALSE) %>%
  mutate(
    sample_name = trim_sample_ids(sample_name),
    studycode = trim_sample_ids(studycode),
    status = recode(status, "Children" = "Child", .default = status),
    analysis_group = case_when(
      status == "Adult" ~ "Adult",
      age_month < 24 ~ "Child 0-23 months",
      TRUE ~ "Child 24-59 months"
    ),
    analysis_group = factor(
      analysis_group,
      levels = c("Child 0-23 months", "Child 24-59 months", "Adult")
    )
  ) %>%
  arrange(analysis_group, studycode)

sample_order <- metadata %>%
  arrange(analysis_group, studycode) %>%
  pull(sample_name)

studycode_order <- metadata %>%
  arrange(analysis_group, studycode) %>%
  pull(studycode)

metadata

# A tibble: 6 × 7
  studycode age_month date_of_sampling to_sequence sample_name status
  <chr>         <dbl> <chr>            <chr>       <chr>       <chr> 
1 22EN-C-02        16 18-May-2017      1st only    ERR9904449  Child 
2 22EN-C-03         9 18-May-2017      1st only    ERR9904450  Child 
3 22EN-C-07        59 22-May-2017      1st only    ERR9904458  Child 
4 22EN-A-04        25 18-May-2017      1st only    ERR9904447  Adult 
5 22EN-A-05        59 19-May-2017      1st only    ERR9904452  Adult 
6 22EN-A-17        32 24-May-2017      2nd only    ERR9904480  Adult 
# ℹ 1 more variable: analysis_group <fct>

Metadata notes

status is the most reliable top-level grouping variable in the local file.
For children, age_month is usable as age in months.
For adults, the age_month column appears to contain values such as 25, 32, and 59, which are much more plausible as adult ages in years than months.
For that reason, this notebook uses analysis_group derived from status and child age thresholds, and uses the adult age values only as descriptive metadata.

Code

metadata_summary <- metadata %>%
  count(status, analysis_group, name = "n_samples")

knitr::kable(metadata_summary)

status	analysis_group	n_samples
Adult	Adult	3
Child	Child 0-23 months	2
Child	Child 24-59 months	1

4.2 Harmonize sample IDs across inputs

Code note

This chunk checks whether the same sample IDs are present in metadata, taxonomy tables, the exported phyloseq tables, and the AMR result files.
Harmonizing IDs early helps avoid silent joins to the wrong sample later in the notebook.
We keep sample_name for joins, but use studycode later for plotting because it is more compact on figure axes.

Code

tax_with <- read_csv(file.path(data_dir, "vietnamese_gut_with_unclassified.csv"), show_col_types = FALSE) %>%
  mutate(sample_name = trim_sample_ids(sample_name))

tax_without <- read_csv(file.path(data_dir, "vietnamese_gut_without_unclassified.csv"), show_col_types = FALSE) %>%
  mutate(sample_name = trim_sample_ids(sample_name))

phy_sample_data <- read_csv(file.path(data_dir, "vietnamese_gut.phyloseq", "sample_data.csv"), show_col_types = FALSE) %>%
  mutate(sample_name = trim_sample_ids(sample_name))

amr_files <- list.files(file.path(data_dir, "vietnamese_gut_amr"), pattern = "gene_mapping_data\\.txt$", full.names = TRUE)
amr_file_samples <- sub("\\.gene_mapping_data\\.txt$", "", basename(amr_files))

sample_harmonization <- tibble::tibble(
  source = c("metadata", "taxonomy_with", "taxonomy_without", "phyloseq_export", "amr_files"),
  n_unique_samples = c(
    n_distinct(metadata$sample_name),
    n_distinct(tax_with$sample_name),
    n_distinct(tax_without$sample_name),
    n_distinct(phy_sample_data$sample_name),
    length(unique(amr_file_samples))
  ),
  sample_ids = c(
    paste(sort(unique(metadata$sample_name)), collapse = ", "),
    paste(sort(unique(tax_with$sample_name)), collapse = ", "),
    paste(sort(unique(tax_without$sample_name)), collapse = ", "),
    paste(sort(unique(phy_sample_data$sample_name)), collapse = ", "),
    paste(sort(unique(amr_file_samples)), collapse = ", ")
  )
)

knitr::kable(sample_harmonization)

source	n_unique_samples	sample_ids
metadata	6	ERR9904447, ERR9904449, ERR9904450, ERR9904452, ERR9904458, ERR9904480
taxonomy_with	6	ERR9904447, ERR9904449, ERR9904450, ERR9904452, ERR9904458, ERR9904480
taxonomy_without	6	ERR9904447, ERR9904449, ERR9904450, ERR9904452, ERR9904458, ERR9904480
phyloseq_export	6	ERR9904447, ERR9904449, ERR9904450, ERR9904452, ERR9904458, ERR9904480
amr_files	6	ERR9904447, ERR9904449, ERR9904450, ERR9904452, ERR9904458, ERR9904480

All four sources contain the same six sample IDs, so harmonization is straightforward.

4.3 Inspect the exported phyloseq components

Code

phy_otu_export <- read_csv(file.path(data_dir, "vietnamese_gut.phyloseq", "otu_table.csv"), show_col_types = FALSE)
phy_tax_export <- read_csv(file.path(data_dir, "vietnamese_gut.phyloseq", "tax_table.csv"), show_col_types = FALSE)

phyloseq_export_status <- tibble::tibble(
  file = c("sample_data.csv", "otu_table.csv", "tax_table.csv"),
  rows = c(nrow(phy_sample_data), nrow(phy_otu_export), nrow(phy_tax_export)),
  columns = c(ncol(phy_sample_data), ncol(phy_otu_export), ncol(phy_tax_export))
)

knitr::kable(phyloseq_export_status)

file	rows	columns
sample_data.csv	6	2
otu_table.csv	736	7
tax_table.csv	736	8

The vietnamese_gut.phyloseq path is not a serialized .rds phyloseq object. It is an export directory containing the component tables needed to reconstruct a phyloseq object.

4.4 Inspect the AMR folder layout

Code

amr_file_inventory <- tibble::tibble(
  file = basename(amr_files),
  sample_name = sub("\\.gene_mapping_data\\.txt$", "", basename(amr_files)),
  size_bytes = file.info(amr_files)$size
)

knitr::kable(amr_file_inventory)

file	sample_name	size_bytes
ERR9904447.gene_mapping_data.txt	ERR9904447	109567
ERR9904449.gene_mapping_data.txt	ERR9904449	100676
ERR9904450.gene_mapping_data.txt	ERR9904450	96742
ERR9904452.gene_mapping_data.txt	ERR9904452	80172
ERR9904458.gene_mapping_data.txt	ERR9904458	103677
ERR9904480.gene_mapping_data.txt	ERR9904480	126389

Code

amr_example <- fread(amr_files[1], sep = "\t", data.table = FALSE)

knitr::kable(head(amr_example, 8))

ARO Term	ARO Accession	Reference Model Type	Reference DB	Alleles with Mapped Reads	Reference Allele(s) Identity to CARD Reference Protein (%)	Resistomes & Variants: Observed in Genome(s)	Resistomes & Variants: Observed in Plasmid(s)	Resistomes & Variants: Observed Pathogen(s)	Completely Mapped Reads	All Mapped Reads	Average Percent Coverage	Average Length Coverage (bp)	Average MAPQ (Completely Mapped Reads)	Number of reads mapping to baits and mapping to complete gene	Number of reads mapping to baits and mapping to complete gene (%)	Mate Pair Linkage (# reads)	Reference Length	AMR Gene Family	Drug Class	Resistance Mechanism
ErmG	3000522	protein homolog model	CARD; Resistomes & Variants	2	98.36 - 100.0	no data	no data	Bacteroides thetaiotaomicron	2	2	16.05	118.00	189.50	N/A	N/A	NA	735; 735	Erm 23S ribosomal RNA methyltransferase	macrolide antibiotic; lincosamide antibiotic; streptogramin antibiotic; streptogramin A antibiotic; streptogramin B antibiotic	antibiotic target alteration
AAC(6’)-Iw	3002567	protein homolog model	CARD	1	100.0	no data	no data	Acinetobacter sp.	1	1	3.85	17.00	23.00	N/A	N/A	NA	441	AAC(6’)	aminoglycoside antibiotic	antibiotic inactivation
tet(W)	3000194	protein homolog model	CARD; Resistomes & Variants	22	95.85 - 100.0	YES	no data	Butyrivibrio fibrisolvens; Bifidobacterium longum; Collinsella aerofaciens; Faecalibacterium prausnitzii; Lactobacillus johnsonii; Eggerthella lenta; Eubacterium maltosivorans; Lactobacillus crispatus; Streptococcus suis; Bifidobacterium breve	1247	1247	32.04	575.82	160.75	N/A	N/A	NA	1920; 1920; 1920; 2082; 1617; 1920; 1920; 1959; 2070; 1920; 1920; 1920; 1920; 1920; 1920; 1920; 1920; 1932; 1920; 1920; 1920; 219	tetracycline-resistant ribosomal protection protein	tetracycline antibiotic	antibiotic target protection
CGB-1	3000841	protein homolog model	CARD	1	100.0	no data	no data	Chryseobacterium gleum	2	2	2.74	20.00	40.00	N/A	N/A	NA	729	CGB beta-lactamase	carbapenem; cephalosporin; penicillin beta-lactam	antibiotic inactivation
vanL	3002910	protein homolog model	CARD	1	100.0	no data	no data	Enterococcus faecalis	1	1	3.05	32.00	56.00	N/A	N/A	NA	1050	glycopeptide resistance gene cluster; Van ligase	glycopeptide antibiotic	antibiotic target alteration
lsaB	3003111	protein homolog model	CARD	1	100.0	no data	no data	Mammaliicoccus sciuri	14	14	2.91	43.00	39.64	N/A	N/A	NA	1479	lsa-type ABC-F protein	lincosamide antibiotic; streptogramin antibiotic; pleuromutilin antibiotic	antibiotic target protection
ErmQ	3000593	protein homolog model	CARD	1	100.0	no data	no data	Clostridium perfringens	4	4	44.44	344.00	160.00	N/A	N/A	NA	774	Erm 23S ribosomal RNA methyltransferase	macrolide antibiotic; lincosamide antibiotic; streptogramin antibiotic; streptogramin A antibiotic; streptogramin B antibiotic	antibiotic target alteration
catP	3002686	protein homolog model	CARD	1	100.0	no data	no data	Clostridium perfringens	30	30	100.00	624.00	156.97	N/A	N/A	NA	624	chloramphenicol acetyltransferase (CAT)	phenicol antibiotic	antibiotic inactivation

The AMR files are one table per sample and include:

ARO term
reference model metadata
mapped read counts
average coverage
AMR gene family
drug class
resistance mechanism

This is sufficient for descriptive gene-level and drug-class summaries.

5 Taxonomy analysis

5.1 Load data

Code note

Here we load the two taxonomy tables and split the lineage string into standard taxonomic ranks.
We then join the taxonomy data back to the metadata so every row knows which sample and age group it belongs to.

Code

tax_with_parsed <- parse_lineage(tax_with) %>%
  left_join(metadata, by = "sample_name")

tax_without_parsed <- parse_lineage(tax_without) %>%
  left_join(metadata, by = "sample_name")

tax_structure_summary <- tibble::tibble(
  table = c("with_unclassified", "without_unclassified"),
  rows = c(nrow(tax_with_parsed), nrow(tax_without_parsed)),
  unique_samples = c(n_distinct(tax_with_parsed$sample_name), n_distinct(tax_without_parsed$sample_name)),
  unique_matches = c(n_distinct(tax_with_parsed$match_name), n_distinct(tax_without_parsed$match_name))
)

knitr::kable(tax_structure_summary)

table	rows	unique_samples	unique_matches
with_unclassified	1344	6	737
without_unclassified	1338	6	736

5.2 Inspect data structure

Code

taxonomy_numeric_checks <- bind_rows(
  tax_with_parsed %>%
    group_by(sample_name) %>%
    summarise(
      table = "with_unclassified",
      f_weighted_sum = sum(f_weighted_at_rank, na.rm = TRUE),
      unweighted_sum = sum(unweighted_fraction, na.rm = TRUE),
      .groups = "drop"
    ),
  tax_without_parsed %>%
    group_by(sample_name) %>%
    summarise(
      table = "without_unclassified",
      f_weighted_sum = sum(f_weighted_at_rank, na.rm = TRUE),
      unweighted_sum = sum(unweighted_fraction, na.rm = TRUE),
      .groups = "drop"
    )
) %>%
  arrange(table, sample_name)

knitr::kable(taxonomy_numeric_checks, digits = 4)

sample_name	table	f_weighted_sum	unweighted_sum
ERR9904447	with_unclassified	1	1.0000
ERR9904449	with_unclassified	1	1.0000
ERR9904450	with_unclassified	1	1.0000
ERR9904452	with_unclassified	1	1.0000
ERR9904458	with_unclassified	1	1.0000
ERR9904480	with_unclassified	1	1.0000
ERR9904447	without_unclassified	1	0.5407
ERR9904449	without_unclassified	1	0.5545
ERR9904450	without_unclassified	1	0.4473
ERR9904452	without_unclassified	1	0.5550
ERR9904458	without_unclassified	1	0.5369
ERR9904480	without_unclassified	1	0.5560

Two points matter here:

f_weighted_at_rank sums to approximately 1 within each sample in both taxonomy tables, so it is appropriate for abundance-based ecological summaries.
unweighted_fraction does not behave like a standard compositional percentage and should not be interpreted as if it must sum to 1 or 100.

5.3 Compare with vs without unclassified

Code note

This section isolates the rows labelled unclassified in the full taxonomy table.
The goal is to show workshop participants how much of each sample remains taxonomically unresolved before using the renormalized table for ecological summaries.

Code

unclassified_summary <- tax_with_parsed %>%
  filter(tolower(lineage) == "unclassified") %>%
  transmute(
    sample_name,
    analysis_group,
    unclassified_f_weighted = f_weighted_at_rank,
    unclassified_unweighted = unweighted_fraction
  ) %>%
  arrange(analysis_group, sample_name)

knitr::kable(unclassified_summary, digits = 4)

sample_name	analysis_group	unclassified_f_weighted	unclassified_unweighted
ERR9904449	Child 0-23 months	0.4177	0.4455
ERR9904450	Child 0-23 months	0.3812	0.5527
ERR9904458	Child 24-59 months	0.3779	0.4631
ERR9904447	Adult	0.4890	0.4593
ERR9904452	Adult	0.4177	0.4450
ERR9904480	Adult	0.4896	0.4440

Code

unclassified_plot_data <- unclassified_summary %>%
    pivot_longer(
        cols = c(unclassified_f_weighted, unclassified_unweighted),
        names_to = "metric",
        values_to = "fraction"
    ) %>%
    mutate(
        metric = recode(
            metric,
            unclassified_f_weighted = "f_weighted_at_rank",
            unclassified_unweighted = "unweighted_fraction"
        ),
        fraction_percent = 100 * fraction,
        studycode = factor(metadata$studycode[match(sample_name, metadata$sample_name)], levels = studycode_order)
    )

unclassified_segments <- unclassified_plot_data %>%
    group_by(sample_name, studycode, analysis_group) %>%
    summarise(
        y_min = min(fraction_percent, na.rm = TRUE),
        y_max = max(fraction_percent, na.rm = TRUE),
        .groups = "drop"
    ) %>%
    mutate(studycode = factor(studycode, levels = studycode_order))

unclassified_plot <- ggplot(unclassified_plot_data, aes(x = studycode, y = fraction_percent)) +
    geom_segment(
        data = unclassified_segments,
        aes(x = studycode, xend = studycode, y = y_min, yend = y_max),
        linewidth = 0.7,
        color = "grey55"
    ) +
    geom_point(aes(fill = metric, shape = metric), size = 3.2, color = "black", stroke = 0.35) +
    facet_wrap(~analysis_group, scales = "free_x", labeller = group_labeller) +
    scale_y_continuous(
        labels = label_percent(scale = 1, accuracy = 1),
        limits = c(0, 100),
        expand = expansion(mult = c(0, 0.02))
    ) +
    scale_fill_manual(values = c("f_weighted_at_rank" = "#4e79a7", "unweighted_fraction" = "#e15759")) +
    scale_shape_manual(values = c("f_weighted_at_rank" = 21, "unweighted_fraction" = 24)) +
    labs(
        title = "Large unclassified fractions remain in all six samples",
        # subtitle = "Within each sample, the two points are connected vertically to compare f_weighted_at_rank and unweighted_fraction directly",
        x = "Study code",
        y = "Fraction assigned to 'unclassified'",
        fill = "Metric",
        shape = "Metric"
    ) +
    theme(axis.text.x = element_text(angle = 20, hjust = 1))

save_fig(unclassified_plot, "taxonomy_unclassified_fraction.png", width = 8, height = 5)
unclassified_plot

Take-home note. The gap between the two points for each sample shows how much the interpretation changes depending on whether we use the weighted classified fraction or the unweighted fraction.

In this subset, the weighted unclassified fraction ranges from roughly 38% to 49%. That is large enough that analyses using the table without unclassified rows should be interpreted as composition conditional on classified signal, not as a complete description of all reads.

When a sample has a large unclassified fraction, what is the safest way to read the classified-only composition table?

A. As a complete picture of the whole metagenome B. As the composition of the classified portion of the sample, not the entire DNA signal C. As a table that becomes automatically unbiased once unclassified rows are removed D. As a result that should always be discarded without further discussion

Hint

Ask what information is lost when the database cannot confidently assign a large part of the sample.

5.4 Reconstruct a phyloseq object for the demonstration

Code note

This chunk converts the long-format taxonomy table into the three core pieces needed by phyloseq: an OTU table, a taxonomy table, and sample metadata.
We do not need phyloseq for every step, but it is useful to show how these data could be reused in a standard microbiome workflow.

Code

otu_demo <- tax_without_parsed %>%
  select(match_name, sample_name, f_weighted_at_rank) %>%
  distinct() %>%
  pivot_wider(names_from = sample_name, values_from = f_weighted_at_rank, values_fill = 0)

tax_demo <- tax_without_parsed %>%
  select(match_name, domain, phylum, class, order, family, genus, species) %>%
  distinct()

sample_demo <- metadata %>%
  select(sample_name, studycode, status, age_month, analysis_group)

otu_mat <- as.matrix(as.data.frame(otu_demo[, -1]))
rownames(otu_mat) <- otu_demo$match_name

tax_mat <- as.matrix(as.data.frame(tax_demo[, -1]))
rownames(tax_mat) <- tax_demo$match_name

sample_df <- as.data.frame(sample_demo)
rownames(sample_df) <- sample_df$sample_name
sample_df$sample_name <- NULL

phy_demo <- phyloseq(
  otu_table(otu_mat, taxa_are_rows = TRUE),
  tax_table(tax_mat),
  sample_data(sample_df)
)

tibble::tibble(
  nsamples = nsamples(phy_demo),
  ntaxa = ntaxa(phy_demo),
  rank_names = paste(rank_names(phy_demo), collapse = ", ")
)

# A tibble: 1 × 3
  nsamples ntaxa rank_names                                          
     <int> <int> <chr>                                               
1        6   736 domain, phylum, class, order, family, genus, species

This reconstructed object is convenient for workshop workflows, but the ecology below still uses the same underlying f_weighted_at_rank abundances from vietnamese_gut_without_unclassified.csv.

5.5 Composition plots

5.5.1 Phylum composition

Code note

We first rank phyla by their total abundance across all six samples.
To keep the stacked bar plot readable, we keep the top 10 phyla and collapse the remainder into Other.
The x-axis uses studycode rather than sample_name, which keeps the figure easier to read.
Within each sample, the stack is ordered from highest abundance at the bottom to lower abundance above, with Other forced to the top.
The stack order is controlled explicitly so the dominant phyla sit lower in each bar and Other remains at the top of each sample.

Code

top_phyla <- tax_without_parsed %>%
  group_by(phylum) %>%
  summarise(total_abundance = sum(f_weighted_at_rank, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(total_abundance)) %>%
  slice_head(n = 10) %>%
  pull(phylum)

phylum_plot_data <- tax_without_parsed %>%
  mutate(phylum_plot = if_else(phylum %in% top_phyla, phylum, "Other")) %>%
  group_by(sample_name, studycode, analysis_group, phylum_plot) %>%
  summarise(rel_abundance = sum(f_weighted_at_rank, na.rm = TRUE), .groups = "drop") %>%
  group_by(sample_name, studycode, analysis_group) %>%
  mutate(stack_order = if_else(phylum_plot == "Other", -1, rank(rel_abundance, ties.method = "first"))) %>%
  ungroup()

phylum_levels <- phylum_plot_data %>%
  group_by(phylum_plot) %>%
  summarise(total_abundance = sum(rel_abundance), .groups = "drop") %>%
  arrange(desc(total_abundance)) %>%
  pull(phylum_plot)

phylum_fill <- make_fill_palette(phylum_levels)

phylum_plot_data <- phylum_plot_data %>%
  mutate(
    studycode = factor(studycode, levels = studycode_order),
    phylum_plot = factor(phylum_plot, levels = phylum_levels)
  )

phylum_plot <- ggplot(phylum_plot_data, aes(x = studycode, y = rel_abundance, fill = phylum_plot, order = stack_order)) +
  geom_col(width = 0.75, color = "grey15", linewidth = 0.15) +
  facet_wrap(~ analysis_group, scales = "free_x", labeller = group_labeller) +
  scale_y_continuous(
    labels = percent_format(accuracy = 1),
    limits = c(0, 1),
    expand = expansion(mult = c(0, 0.02))
  ) +
  scale_fill_manual(values = phylum_fill, breaks = phylum_levels) +
  labs(
    title = "Phylum-level composition after removing unclassified rows",
    subtitle = "Top 10 phyla by overall relative abundance; 'Other' is stacked on top within each sample",
    x = "Study code",
    y = "Relative abundance (f_weighted_at_rank)",
    fill = "Phylum"
  ) +
  theme(
    axis.text.x = element_text(angle = 20, hjust = 1),
    legend.position = "right"
  )

save_fig(phylum_plot, "taxonomy_phylum_composition.png", width = 8, height = 5.5)
phylum_plot

Take-home note. This phylum-level stacked bar plot gives the quickest visual summary of how the dominant classified signal shifts across the six samples.

Code

phylum_top10_by_sample <- tax_without_parsed %>%
  group_by(studycode, analysis_group, phylum) %>%
  summarise(rel_abundance = sum(f_weighted_at_rank, na.rm = TRUE), .groups = "drop") %>%
  group_by(studycode, analysis_group) %>%
  arrange(desc(rel_abundance), .by_group = TRUE) %>%
  slice_head(n = 10) %>%
  ungroup()

phylum_top10_by_group <- tax_without_parsed %>%
  group_by(studycode, analysis_group, phylum) %>%
  summarise(rel_abundance = sum(f_weighted_at_rank, na.rm = TRUE), .groups = "drop") %>%
  group_by(analysis_group, phylum) %>%
  summarise(mean_rel_abundance = mean(rel_abundance), .groups = "drop") %>%
  group_by(analysis_group) %>%
  arrange(desc(mean_rel_abundance), .by_group = TRUE) %>%
  slice_head(n = 10) %>%
  ungroup()

knitr::kable(phylum_top10_by_sample, digits = 4, caption = "Top 10 phyla by relative abundance within each sample")

Top 10 phyla by relative abundance within each sample
studycode	analysis_group	phylum	rel_abundance
22EN-A-04	Adult	p__Bacillota	0.5340
22EN-A-04	Adult	p__Bacteroidota	0.4262
22EN-A-04	Adult	p__Pseudomonadota	0.0133
22EN-A-04	Adult	p__Actinomycetota	0.0107
22EN-A-04	Adult	p__Methanobacteriota	0.0095
22EN-A-04	Adult	p__Desulfobacterota_I	0.0054
22EN-A-04	Adult	p__Verrucomicrobiota	0.0008
22EN-A-04	Adult	p__Fusobacteriota	0.0002
22EN-A-05	Adult	p__Bacillota	0.5304
22EN-A-05	Adult	p__Bacteroidota	0.3622
22EN-A-05	Adult	p__Cyanobacteriota	0.0523
22EN-A-05	Adult	p__Actinomycetota	0.0523
22EN-A-05	Adult	p__Desulfobacterota_I	0.0016
22EN-A-05	Adult	p__Pseudomonadota	0.0009
22EN-A-05	Adult	p__Verrucomicrobiota	0.0003
22EN-A-17	Adult	p__Bacillota	0.5838
22EN-A-17	Adult	p__Bacteroidota	0.3094
22EN-A-17	Adult	p__Pseudomonadota	0.0785
22EN-A-17	Adult	p__Actinomycetota	0.0168
22EN-A-17	Adult	p__Desulfobacterota_I	0.0083
22EN-A-17	Adult	p__Cyanobacteriota	0.0027
22EN-A-17	Adult	p__Fusobacteriota	0.0004
22EN-C-02	Child 0-23 months	p__Actinomycetota	0.3518
22EN-C-02	Child 0-23 months	p__Bacillota	0.3461
22EN-C-02	Child 0-23 months	p__Bacteroidota	0.2856
22EN-C-02	Child 0-23 months	p__Pseudomonadota	0.0152
22EN-C-02	Child 0-23 months	p__Desulfobacterota_I	0.0008
22EN-C-02	Child 0-23 months	p__Fusobacteriota	0.0005
22EN-C-03	Child 0-23 months	p__Actinomycetota	0.5505
22EN-C-03	Child 0-23 months	p__Bacillota	0.3671
22EN-C-03	Child 0-23 months	p__Pseudomonadota	0.0788
22EN-C-03	Child 0-23 months	p__Bacteroidota	0.0036
22EN-C-07	Child 24-59 months	p__Bacillota	0.6576
22EN-C-07	Child 24-59 months	p__Bacteroidota	0.2635
22EN-C-07	Child 24-59 months	p__Actinomycetota	0.0734
22EN-C-07	Child 24-59 months	p__Pseudomonadota	0.0047
22EN-C-07	Child 24-59 months	p__Desulfobacterota_I	0.0005
22EN-C-07	Child 24-59 months	p__Fusobacteriota	0.0002
22EN-C-07	Child 24-59 months	p__Patescibacteriota	0.0001

Code

knitr::kable(phylum_top10_by_group, digits = 4, caption = "Top 10 phyla by mean relative abundance within each age group")

Top 10 phyla by mean relative abundance within each age group
analysis_group	phylum	mean_rel_abundance
Child 0-23 months	p__Actinomycetota	0.4511
Child 0-23 months	p__Bacillota	0.3566
Child 0-23 months	p__Bacteroidota	0.1446
Child 0-23 months	p__Pseudomonadota	0.0470
Child 0-23 months	p__Desulfobacterota_I	0.0008
Child 0-23 months	p__Fusobacteriota	0.0005
Child 24-59 months	p__Bacillota	0.6576
Child 24-59 months	p__Bacteroidota	0.2635
Child 24-59 months	p__Actinomycetota	0.0734
Child 24-59 months	p__Pseudomonadota	0.0047
Child 24-59 months	p__Desulfobacterota_I	0.0005
Child 24-59 months	p__Fusobacteriota	0.0002
Child 24-59 months	p__Patescibacteriota	0.0001
Adult	p__Bacillota	0.5494
Adult	p__Bacteroidota	0.3659
Adult	p__Pseudomonadota	0.0309
Adult	p__Cyanobacteriota	0.0275
Adult	p__Actinomycetota	0.0266
Adult	p__Methanobacteriota	0.0095
Adult	p__Desulfobacterota_I	0.0051
Adult	p__Verrucomicrobiota	0.0006
Adult	p__Fusobacteriota	0.0003

5.5.2 Genus composition

Code note

A stacked bar plot at genus level is harder to read if we include too many genera.
We therefore keep the top 10 genera across the dataset and collapse the rest into Other.
This gives a sample-level overview that is more intuitive for workshop participants before moving to the genus heatmap.

Code

top_genera_bar <- tax_without_parsed %>%
  filter(!is.na(genus), genus != "") %>%
  group_by(genus) %>%
  summarise(total_abundance = sum(f_weighted_at_rank, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(total_abundance)) %>%
  slice_head(n = 10) %>%
  pull(genus)

genus_bar_plot_data <- tax_without_parsed %>%
  mutate(genus_plot = if_else(!is.na(genus) & genus %in% top_genera_bar, genus, "Other")) %>%
  group_by(sample_name, studycode, analysis_group, genus_plot) %>%
  summarise(rel_abundance = sum(f_weighted_at_rank, na.rm = TRUE), .groups = "drop") %>%
  group_by(sample_name, studycode, analysis_group) %>%
  mutate(stack_order = if_else(genus_plot == "Other", -1, rank(rel_abundance, ties.method = "first"))) %>%
  ungroup()

genus_levels <- genus_bar_plot_data %>%
  group_by(genus_plot) %>%
  summarise(total_abundance = sum(rel_abundance), .groups = "drop") %>%
  arrange(desc(total_abundance)) %>%
  pull(genus_plot)

genus_fill <- make_fill_palette(genus_levels)

genus_bar_plot_data <- genus_bar_plot_data %>%
  mutate(
    studycode = factor(studycode, levels = studycode_order),
    genus_plot = factor(genus_plot, levels = genus_levels)
  )

genus_bar_plot <- ggplot(genus_bar_plot_data, aes(x = studycode, y = rel_abundance, fill = genus_plot, order = stack_order)) +
  geom_col(width = 0.75, color = "grey15", linewidth = 0.15) +
  facet_wrap(~ analysis_group, scales = "free_x", labeller = group_labeller) +
  scale_y_continuous(
    labels = percent_format(accuracy = 1),
    limits = c(0, 1),
    expand = expansion(mult = c(0, 0.02))
  ) +
  scale_fill_manual(values = genus_fill, breaks = genus_levels) +
  labs(
    title = "Genus-level composition after removing unclassified rows",
    subtitle = "Top 10 genera by overall relative abundance; 'Other' is stacked on top within each sample",
    x = "Study code",
    y = "Relative abundance (f_weighted_at_rank)",
    fill = "Genus"
  ) +
  theme(
    axis.text.x = element_text(angle = 20, hjust = 1),
    legend.position = "right"
  )

save_fig(genus_bar_plot, "taxonomy_genus_composition.png", width = 10, height = 6)
genus_bar_plot

Take-home note. The genus-level stacked bar plot is more detailed than the phylum plot and helps show which named genera drive the broad taxonomic shifts.

Code

genus_top10_by_sample <- tax_without_parsed %>%
  filter(!is.na(genus), genus != "") %>%
  group_by(studycode, analysis_group, genus) %>%
  summarise(rel_abundance = sum(f_weighted_at_rank, na.rm = TRUE), .groups = "drop") %>%
  group_by(studycode, analysis_group) %>%
  arrange(desc(rel_abundance), .by_group = TRUE) %>%
  slice_head(n = 10) %>%
  ungroup()

genus_top10_by_group <- tax_without_parsed %>%
  filter(!is.na(genus), genus != "") %>%
  group_by(studycode, analysis_group, genus) %>%
  summarise(rel_abundance = sum(f_weighted_at_rank, na.rm = TRUE), .groups = "drop") %>%
  group_by(analysis_group, genus) %>%
  summarise(mean_rel_abundance = mean(rel_abundance), .groups = "drop") %>%
  group_by(analysis_group) %>%
  arrange(desc(mean_rel_abundance), .by_group = TRUE) %>%
  slice_head(n = 10) %>%
  ungroup()

knitr::kable(genus_top10_by_sample, digits = 4, caption = "Top 10 genera by relative abundance within each sample")

Top 10 genera by relative abundance within each sample
studycode	analysis_group	genus	rel_abundance
22EN-A-04	Adult	g__Prevotella	0.2903
22EN-A-04	Adult	g__Blautia_A	0.0480
22EN-A-04	Adult	g__Cryptobacteroides	0.0448
22EN-A-04	Adult	g__Megamonas	0.0343
22EN-A-04	Adult	g__CAG-877	0.0289
22EN-A-04	Adult	g__Coprococcus	0.0279
22EN-A-04	Adult	g__Bacteroides	0.0270
22EN-A-04	Adult	g__Faecalibacterium	0.0269
22EN-A-04	Adult	g__Anaerostipes	0.0199
22EN-A-04	Adult	g__Alloprevotella	0.0171
22EN-A-05	Adult	g__Bacteroides	0.1816
22EN-A-05	Adult	g__Blautia_A	0.0865
22EN-A-05	Adult	g__Prevotella	0.0655
22EN-A-05	Adult	g__Alistipes	0.0644
22EN-A-05	Adult	g__Stercorousia	0.0523
22EN-A-05	Adult	g__Bifidobacterium	0.0466
22EN-A-05	Adult	g__Agathobacter	0.0426
22EN-A-05	Adult	g__Fusicatenibacter	0.0342
22EN-A-05	Adult	g__CAG-302	0.0308
22EN-A-05	Adult	g__Phascolarctobacterium	0.0299
22EN-A-17	Adult	g__Prevotella	0.1827
22EN-A-17	Adult	g__Agathobacter	0.0849
22EN-A-17	Adult	g__Blautia_A	0.0849
22EN-A-17	Adult	g__Faecalibacterium	0.0577
22EN-A-17	Adult	g__Fimenecus	0.0572
22EN-A-17	Adult	g__Klebsiella	0.0496
22EN-A-17	Adult	g__Bacteroides	0.0295
22EN-A-17	Adult	g__Alloprevotella	0.0286
22EN-A-17	Adult	g__Dorea_A	0.0266
22EN-A-17	Adult	g__Phocaeicola	0.0221
22EN-C-02	Child 0-23 months	g__Bifidobacterium	0.3456
22EN-C-02	Child 0-23 months	g__Bacteroides	0.1499
22EN-C-02	Child 0-23 months	g__Phocaeicola	0.1343
22EN-C-02	Child 0-23 months	g__Anaerostipes	0.0578
22EN-C-02	Child 0-23 months	g__Megasphaera	0.0363
22EN-C-02	Child 0-23 months	g__Mediterraneibacter	0.0342
22EN-C-02	Child 0-23 months	g__Blautia_A	0.0313
22EN-C-02	Child 0-23 months	g__Megamonas	0.0259
22EN-C-02	Child 0-23 months	g__Faecalibacterium	0.0218
22EN-C-02	Child 0-23 months	g__Sellimonas	0.0173
22EN-C-03	Child 0-23 months	g__Bifidobacterium	0.4785
22EN-C-03	Child 0-23 months	g__Streptococcus	0.1488
22EN-C-03	Child 0-23 months	g__Mediterraneibacter	0.1143
22EN-C-03	Child 0-23 months	g__Escherichia	0.0735
22EN-C-03	Child 0-23 months	g__Collinsella	0.0713
22EN-C-03	Child 0-23 months	g__Veillonella	0.0216
22EN-C-03	Child 0-23 months	g__Fusicatenibacter	0.0192
22EN-C-03	Child 0-23 months	g__Blautia_A	0.0162
22EN-C-03	Child 0-23 months	g__Enterococcus	0.0152
22EN-C-03	Child 0-23 months	g__Dorea	0.0073
22EN-C-07	Child 24-59 months	g__Megamonas	0.1603
22EN-C-07	Child 24-59 months	g__Ruminococcus_B	0.1507
22EN-C-07	Child 24-59 months	g__Phocaeicola	0.1279
22EN-C-07	Child 24-59 months	g__Bacteroides	0.1163
22EN-C-07	Child 24-59 months	g__Bifidobacterium	0.0680
22EN-C-07	Child 24-59 months	g__Faecalibacterium	0.0484
22EN-C-07	Child 24-59 months	g__Ruminococcoides	0.0474
22EN-C-07	Child 24-59 months	g__Streptococcus	0.0453
22EN-C-07	Child 24-59 months	g__Blautia_A	0.0380
22EN-C-07	Child 24-59 months	g__Veillonella	0.0372

Code

knitr::kable(genus_top10_by_group, digits = 4, caption = "Top 10 genera by mean relative abundance within each age group")

Top 10 genera by mean relative abundance within each age group
analysis_group	genus	mean_rel_abundance
Child 0-23 months	g__Bifidobacterium	0.4120
Child 0-23 months	g__Bacteroides	0.1499
Child 0-23 months	g__Phocaeicola	0.1343
Child 0-23 months	g__Streptococcus	0.0793
Child 0-23 months	g__Mediterraneibacter	0.0743
Child 0-23 months	g__Anaerostipes	0.0578
Child 0-23 months	g__Escherichia	0.0391
Child 0-23 months	g__Collinsella	0.0374
Child 0-23 months	g__Megasphaera	0.0363
Child 0-23 months	g__Megamonas	0.0259
Child 24-59 months	g__Megamonas	0.1603
Child 24-59 months	g__Ruminococcus_B	0.1507
Child 24-59 months	g__Phocaeicola	0.1279
Child 24-59 months	g__Bacteroides	0.1163
Child 24-59 months	g__Bifidobacterium	0.0680
Child 24-59 months	g__Faecalibacterium	0.0484
Child 24-59 months	g__Ruminococcoides	0.0474
Child 24-59 months	g__Streptococcus	0.0453
Child 24-59 months	g__Blautia_A	0.0380
Child 24-59 months	g__Veillonella	0.0372
Adult	g__Prevotella	0.1795
Adult	g__Bacteroides	0.0794
Adult	g__Blautia_A	0.0731
Adult	g__Stercorousia	0.0523
Adult	g__Agathobacter	0.0466
Adult	g__Cryptobacteroides	0.0448
Adult	g__Faecalibacterium	0.0363
Adult	g__Alistipes	0.0295
Adult	g__Fimenecus	0.0289
Adult	g__CAG-877	0.0289

5.5.3 Genus heatmap

Code note

The stacked bar plot above is good for overall composition, but it becomes hard to see smaller differences.
This heatmap shows the same data in a matrix view, which makes it easier to spot sample-specific enrichment patterns.

Code

top_genera <- tax_without_parsed %>%
  filter(!is.na(genus), genus != "") %>%
  group_by(genus) %>%
  summarise(total_abundance = sum(f_weighted_at_rank, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(total_abundance)) %>%
  slice_head(n = 15) %>%
  pull(genus)

genus_plot_data <- tax_without_parsed %>%
  filter(genus %in% top_genera) %>%
  group_by(sample_name, studycode, analysis_group, genus) %>%
  summarise(rel_abundance = sum(f_weighted_at_rank, na.rm = TRUE), .groups = "drop") %>%
  mutate(
    studycode = factor(studycode, levels = studycode_order),
    genus = factor(genus, levels = rev(top_genera))
  )

genus_fill_max_raw <- max(genus_plot_data$rel_abundance, na.rm = TRUE)
genus_fill_limit <- ((floor((100 * genus_fill_max_raw) / 10) + 1) * 10) / 100
genus_fill_breaks <- seq(0, genus_fill_limit, by = 0.1)

genus_plot <- ggplot(genus_plot_data, aes(x = studycode, y = genus, fill = rel_abundance)) +
  geom_tile(color = "white", linewidth = 0.2) +
  facet_wrap(~ analysis_group, scales = "free_x", labeller = group_labeller) +
  scale_fill_gradient(
    low = "grey95",
    high = "#2C7FB8",
    limits = c(0, genus_fill_limit),
    breaks = genus_fill_breaks,
    labels = percent_format(accuracy = 1)
  ) +
  labs(
    title = "Top genera across the six re-analyzed samples",
    x = "Study code",
    y = "Genus",
    fill = "Relative abundance"
  ) +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

save_fig(genus_plot, "taxonomy_genus_heatmap.png", width = 10, height = 6.5)
genus_plot

Take-home note. The heatmap makes it easier to see sample-specific genus enrichment patterns that are harder to spot in the stacked bar plot.

Descriptive interpretation

The two youngest-child samples show more Actinomycetota than the adult samples, driven largely by Bifidobacterium.
Adult samples are richer in the classic adult-associated Bacillota + Bacteroidota balance.
The single older-child sample has a phylum profile that looks more adult-like than the two youngest samples, but a single sample is not enough to claim convergence.

5.6 Alpha diversity

Code note

Alpha diversity asks how diverse each sample is on its own.
We use the Shannon index because it captures both richness and evenness and is familiar to many microbiome readers.

Code

alpha_diversity <- tax_without_parsed %>%
  group_by(sample_name) %>%
  summarise(
    shannon = vegan::diversity(f_weighted_at_rank, index = "shannon"),
    observed_taxa = n_distinct(match_name),
    .groups = "drop"
  ) %>%
  left_join(metadata, by = "sample_name") %>%
  arrange(analysis_group, studycode)

knitr::kable(alpha_diversity, digits = 3)

sample_name	shannon	observed_taxa	studycode	age_month	date_of_sampling	to_sequence	status	analysis_group
ERR9904449	3.530	140	22EN-C-02	16	18-May-2017	1st only	Child	Child 0-23 months
ERR9904450	2.848	81	22EN-C-03	9	18-May-2017	1st only	Child	Child 0-23 months
ERR9904458	3.282	153	22EN-C-07	59	22-May-2017	1st only	Child	Child 24-59 months
ERR9904447	4.701	376	22EN-A-04	25	18-May-2017	1st only	Adult	Adult
ERR9904452	4.191	252	22EN-A-05	59	19-May-2017	1st only	Adult	Adult
ERR9904480	4.602	336	22EN-A-17	32	24-May-2017	2nd only	Adult	Adult

Code

alpha_jitter <- position_jitter(width = 0.06, height = 0, seed = 42)

alpha_plot <- ggplot(alpha_diversity, aes(x = analysis_group, y = shannon, color = analysis_group)) +
  geom_boxplot(width = 0.55, outlier.shape = NA, alpha = 0.2, linewidth = 0.4) +
  geom_point(aes(shape = analysis_group), size = 3, position = alpha_jitter) +
  ggrepel::geom_text_repel(
    aes(label = studycode),
    position = alpha_jitter,
    size = 3,
    show.legend = FALSE,
    seed = 42,
    box.padding = 0.35,
    point.padding = 0.2,
    min.segment.length = 0
  ) +
  scale_color_manual(values = group_palette, labels = group_labels) +
  scale_shape_manual(values = group_shapes, labels = group_labels) +
  scale_x_discrete(labels = group_labels) +
  scale_y_continuous(limits = c(0, NA), expand = expansion(mult = c(0, 0.05))) +
  labs(
    #title = "Adults show the highest Shannon diversity in this 6-sample subset",
    x = "Analysis group",
    y = "Shannon diversity",
    color = "Group",
    shape = "Group"
  ) +
  theme(legend.position = "none")

save_fig(alpha_plot, "taxonomy_alpha_diversity.png", width = 5, height = 5)
alpha_plot

Take-home note. In this subset, adult samples sit at the higher end of Shannon diversity, while the child samples are consistently lower.

No formal groupwise alpha-diversity test is emphasized here. With group sizes of 2, 1, and 3, a p-value would suggest more precision than the data can support.

5.7 Beta diversity / ordination

Code note

Beta diversity compares samples against one another rather than within a single sample.
Here we build a species-by-sample matrix from f_weighted_at_rank, compute Bray-Curtis dissimilarity, and then use classical multidimensional scaling to visualize those distances.

Code

species_wide <- tax_without_parsed %>%
    select(sample_name, match_name, f_weighted_at_rank) %>%
    distinct() %>%
    pivot_wider(names_from = match_name, values_from = f_weighted_at_rank, values_fill = 0) %>%
    arrange(match(sample_name, sample_order))

species_mat <- as.matrix(as.data.frame(species_wide[, -1]))
rownames(species_mat) <- species_wide$sample_name

bray_dist <- vegdist(species_mat, method = "euclidean")
bray_cmd <- cmdscale(bray_dist, eig = TRUE, k = 2)

beta_diversity <- tibble::tibble(
    sample_name = rownames(bray_cmd$points),
    Axis1 = bray_cmd$points[, 1],
    Axis2 = bray_cmd$points[, 2]
) %>%
    left_join(metadata, by = "sample_name")

positive_eig <- bray_cmd$eig[bray_cmd$eig > 0]
axis1_var <- positive_eig[1] / sum(positive_eig)
axis2_var <- positive_eig[2] / sum(positive_eig)

pairwise_bray <- as.matrix(bray_dist) %>%
    as.data.frame() %>%
    tibble::rownames_to_column("sample_1") %>%
    pivot_longer(-sample_1, names_to = "sample_2", values_to = "bray_curtis") %>%
    filter(sample_1 < sample_2) %>%
    left_join(metadata %>% select(sample_name, group_1 = analysis_group), by = c("sample_1" = "sample_name")) %>%
    left_join(metadata %>% select(sample_name, group_2 = analysis_group), by = c("sample_2" = "sample_name")) %>%
    left_join(metadata %>% select(sample_name, studycode_1 = studycode), by = c("sample_1" = "sample_name")) %>%
    left_join(metadata %>% select(sample_name, studycode_2 = studycode), by = c("sample_2" = "sample_name")) %>%
    select(studycode_1, studycode_2, group_1, group_2, bray_curtis) %>%
    arrange(bray_curtis)

knitr::kable(pairwise_bray, digits = 3)

studycode_1	studycode_2	group_1	group_2	bray_curtis
22EN-A-04	22EN-A-17	Adult	Adult	0.167
22EN-A-05	22EN-A-17	Adult	Adult	0.196
22EN-A-04	22EN-A-05	Adult	Adult	0.219
22EN-C-02	22EN-A-17	Child 0-23 months	Adult	0.256
22EN-C-02	22EN-A-05	Child 0-23 months	Adult	0.257
22EN-A-04	22EN-C-02	Adult	Child 0-23 months	0.264
22EN-C-02	22EN-C-07	Child 0-23 months	Child 24-59 months	0.281
22EN-A-04	22EN-C-07	Adult	Child 24-59 months	0.289
22EN-C-07	22EN-A-17	Child 24-59 months	Adult	0.296
22EN-A-05	22EN-C-07	Adult	Child 24-59 months	0.300
22EN-C-02	22EN-C-03	Child 0-23 months	Child 0-23 months	0.322
22EN-C-03	22EN-A-17	Child 0-23 months	Adult	0.341
22EN-A-04	22EN-C-03	Adult	Child 0-23 months	0.345
22EN-C-03	22EN-A-05	Child 0-23 months	Adult	0.354
22EN-C-03	22EN-C-07	Child 0-23 months	Child 24-59 months	0.407

Code

beta_plot <- ggplot(beta_diversity, aes(x = Axis1, y = Axis2, color = analysis_group)) +
  geom_point(aes(shape = analysis_group), size = 3) +
  stat_ellipse(aes(group = analysis_group), linewidth = 0.6, linetype = 2, level = 0.8, show.legend = FALSE) +
  ggrepel::geom_text_repel(
    aes(label = studycode),
    size = 3,
    show.legend = FALSE,
    seed = 42,
    box.padding = 0.25,
    point.padding = 0.25,
    min.segment.length = 0
  ) +
  scale_color_manual(values = group_palette, labels = group_labels) +
  scale_shape_manual(values = group_shapes, labels = group_labels) +
  scale_x_continuous(expand = expansion(mult = c(0.05, 0.05))) +
  scale_y_continuous(expand = expansion(mult = c(0.05, 0.05))) +
  labs(
    title = "Bray-Curtis ordination on species-level f_weighted_at_rank",
    x = sprintf("Axis 1 (%.1f%% explained)", 100 * axis1_var),
    y = sprintf("Axis 2 (%.1f%% explained)", 100 * axis2_var),
    color = "Group",
    shape = "Group"
  )

save_fig(beta_plot, "taxonomy_bray_curtis_ordination.png", width = 8.5, height = 5.8)
beta_plot

Take-home note. The ordination shows that the adult samples group more tightly together, while the child samples are farther away from that adult cluster.

5.8 Taxonomy interpretation

This subset is directionally consistent with the paper in some ways:

The adult samples cluster more closely to one another than the two youngest-child samples.
Adults also show the highest Shannon diversity in the subset.
The two youngest-child samples are compositionally distinct, especially because of stronger Actinomycetota / Bifidobacterium signal and lower Shannon diversity.

However, the subset does not fully recapitulate the paper’s age gradient:

There is only one older-child sample.
That older-child sample has a more adult-like phylum balance than the younger-child samples, but it still remains Bray-Curtis-distant from the adults.
Therefore, we can say the local data are compatible with the reported maturation trend, but this subset is too small to claim a clean replication.

Tip

Back to workflow setup | Continue to AMR Analysis & Summary.