Clinical Foundations & AMR Surveillance: Taxonomy Analysis

Classified and unclassified signal in a six-sample Vietnamese gut microbiome subset

Author

Hao Chung The, Minh-Quan Ton-Ngoc

Published

May 15, 2026

1 R-based exploratory analysis

The sections below assume that the pipeline outputs above already exist. From this point onward, the notebook shifts from pipeline execution to exploratory interpretation.

Code
library(dplyr)
library(tidyr)
library(readr)
library(ggplot2)
library(ggrepel)
library(phyloseq)
library(vegan)
library(patchwork)
library(scales)
library(data.table)
library(broom)

data_dir <- file.path("..", "materials", "vietnamese-healthy-gut-microbiome")

group_palette <- c(
  "Child 0-23 months" = "#D55E00",
  "Child 24-59 months" = "#0072B2",
  "Adult" = "#009E73"
)

group_shapes <- c(
  "Child 0-23 months" = 16,
  "Child 24-59 months" = 17,
  "Adult" = 15
)

group_labels <- c(
  "Child 0-23 months" = "Child\n(0-23 months)",
  "Child 24-59 months" = "Child\n(24-59 months)",
  "Adult" = "Adult"
)

group_labeller <- labeller(analysis_group = as_labeller(group_labels))

my_color <- c(
  "#4e79a7", "#f28e2b", "#59a14f", "#b6992d",
  "#499894", "#e15759", "#b07aa1", "#9d7660"
)

make_fill_palette <- function(levels_vector, other_label = "Other") {
  core_levels <- setdiff(levels_vector, other_label)
  n_cols <- length(core_levels)
  if (n_cols > length(my_color)) {
    extra_cols <- scales::hue_pal()(n_cols - length(my_color))
    core_cols <- c(my_color, extra_cols)
  } else {
    core_cols <- my_color[seq_len(n_cols)]
  }
  cols <- setNames(core_cols, core_levels)
  if (other_label %in% levels_vector) {
    cols[other_label] <- "grey85"
  }
  cols[levels_vector]
}

paper_amr_class_levels <- c(
  "Aminoglycosides", "β-lactamases", "Fosfomycin", "Fluoroquinolones",
  "Glycopeptides", "MLS", "Chloramphenicol", "Sulfonamides",
  "Tetracyclines", "Trimethoprim", "Other"
)

map_paper_amr_class <- function(drug_class) {
  case_when(
    drug_class %in% c("aminoglycoside antibiotic") ~ "Aminoglycosides",
    drug_class %in% c("cephalosporin", "penicillin beta-lactam", "carbapenem", "monobactam") ~ "β-lactamases",
    drug_class %in% c("phosphonic acid antibiotic") ~ "Fosfomycin",
    drug_class %in% c("fluoroquinolone antibiotic") ~ "Fluoroquinolones",
    drug_class %in% c("glycopeptide antibiotic") ~ "Glycopeptides",
    drug_class %in% c(
      "macrolide antibiotic", "lincosamide antibiotic", "streptogramin antibiotic",
      "streptogramin A antibiotic", "streptogramin B antibiotic"
    ) ~ "MLS",
    drug_class %in% c("phenicol antibiotic") ~ "Chloramphenicol",
    drug_class %in% c("sulfonamide antibiotic") ~ "Sulfonamides",
    drug_class %in% c("tetracycline antibiotic", "glycylcycline") ~ "Tetracyclines",
    drug_class %in% c("diaminopyrimidine antibiotic") ~ "Trimethoprim",
    TRUE ~ "Other"
  )
}

theme_set(
  theme_bw(base_size = 12) +
    theme(
      panel.grid.minor = element_blank(),
      plot.title.position = "plot",
      legend.position = "right"
    )
)

fig_dir <- "figures"
dir.create(fig_dir, showWarnings = FALSE)

save_fig <- function(plot, filename, width = 9, height = 5.5, dpi = 300) {
  ggsave(
    filename = file.path(fig_dir, filename),
    plot = plot,
    width = width,
    height = height,
    dpi = dpi,
    bg = "white"
  )
}

parse_lineage <- function(df) {
  df %>%
    tidyr::separate(
      col = lineage,
      into = c("domain", "phylum", "class", "order", "family", "genus", "species"),
      sep = ";",
      fill = "right",
      extra = "drop",
      remove = FALSE
    )
}

trim_sample_ids <- function(x) trimws(as.character(x))

Code note

  • This setup chunk loads the R packages used throughout the notebook and defines a few helper objects.
  • The notebook now uses ggplot2 only for plotting, to keep figure sizing predictable in the rendered output.
  • For color, the stacked bar plots use a richer custom palette, while group-level plots still combine color and point shape so they remain readable on screen and when printed.
  • The default ggplot2 theme is theme_bw(), so any plot built directly with ggplot2 follows the workshop style guide.

2 Context from the original paper

Before looking at the 6 local samples, we extract the main analytical context from the published article and the local summary notes.

Paper-level findings that are reasonable to compare against in a workshop subset

  • The study analyzed 42 healthy Vietnamese participants split into 0-23 months, 2-5 years, and adults.
  • The paper reported a clear age-related microbiome structure difference, with children under 2 years showing a distinct gut community structure.
  • Alpha diversity was lower in the youngest children than in older children and adults.
  • Beta diversity showed broad overlap between older children and adults, while younger children were more distinct and more dispersed.
  • The healthy gut resistome was described as a substantial AMR reservoir, with tetracycline and MLS resistance genes especially common.
  • Genes highlighted in the paper included ermB, lnuC, tetM, tetO, tetQ, tetW, and tet-32 as near-ubiquitous signals.
  • The paper reported a higher AMR burden in the youngest children, and noted clinically important genes such as blaCTX-M and mphA/mphD in younger children.
Code
paper_context <- tibble::tibble(
  theme = c(
    "Age-related microbiome structure",
    "Alpha diversity",
    "Beta diversity",
    "Broad AMR burden",
    "AMR classes emphasized",
    "AMR genes emphasized"
  ),
  published_finding = c(
    "Children under 2 years had a distinct gut microbiome structure relative to older children and adults.",
    "Youngest children had lower Shannon diversity; older children were closer to adults.",
    "Bray-Curtis ordination showed young children dispersed away from the overlapping older-child and adult cloud.",
    "Youngest children carried the greatest number of detected AMR genes.",
    "Tetracycline and MLS resistance were nearly ubiquitous.",
    "ermB, lnuC, tet-32, tetM, tetO, tetQ, tetW were common; blaCTX-M and mphA/mphD were highlighted as clinically important signals."
  )
)

knitr::kable(paper_context, col.names = c("Theme", "Finding extracted from paper / summary"))
Theme Finding extracted from paper / summary
Age-related microbiome structure Children under 2 years had a distinct gut microbiome structure relative to older children and adults.
Alpha diversity Youngest children had lower Shannon diversity; older children were closer to adults.
Beta diversity Bray-Curtis ordination showed young children dispersed away from the overlapping older-child and adult cloud.
Broad AMR burden Youngest children carried the greatest number of detected AMR genes.
AMR classes emphasized Tetracycline and MLS resistance were nearly ubiquitous.
AMR genes emphasized ermB, lnuC, tet-32, tetM, tetO, tetQ, tetW were common; blaCTX-M and mphA/mphD were highlighted as clinically important signals.

3 Analysis-ready files used below

The R analysis below assumes that the workflow and post-processing steps have already produced an analysis folder containing:

  • Metadata for the 6 re-analyzed samples.
  • Taxonomy tables with and without unclassified rows.
  • A vietnamese_gut.phyloseq folder that contains otu_table.csv, sample_data.csv, and tax_table.csv.
  • A vietnamese_gut_amr folder with one rgi-bwt gene mapping table per sample.
  • The original paper PDF and a local summary file used only for context.
Code
base_dir <- "."

paths <- tibble::tibble(
  object = c(
    "metadata",
    "taxonomy_with_unclassified",
    "taxonomy_without_unclassified",
    "phyloseq_export_dir",
    "amr_dir",
    "local_summary_md"
  ),
  file_name = c(
    "vietnamese_gut_sample_information.csv",
    "vietnamese_gut_with_unclassified.csv",
    "vietnamese_gut_without_unclassified.csv",
    "vietnamese_gut.phyloseq",
    "vietnamese_gut_amr",
    "vietnamese_gut_summary.md"
  )
) %>%
  mutate(
  path = file.path(data_dir, file_name),
  exists = file.exists(path) | dir.exists(path)
)

knitr::kable(paths)
object file_name path exists
metadata vietnamese_gut_sample_information.csv ../materials/vietnamese-healthy-gut-microbiome/vietnamese_gut_sample_information.csv TRUE
taxonomy_with_unclassified vietnamese_gut_with_unclassified.csv ../materials/vietnamese-healthy-gut-microbiome/vietnamese_gut_with_unclassified.csv TRUE
taxonomy_without_unclassified vietnamese_gut_without_unclassified.csv ../materials/vietnamese-healthy-gut-microbiome/vietnamese_gut_without_unclassified.csv TRUE
phyloseq_export_dir vietnamese_gut.phyloseq ../materials/vietnamese-healthy-gut-microbiome/vietnamese_gut.phyloseq TRUE
amr_dir vietnamese_gut_amr ../materials/vietnamese-healthy-gut-microbiome/vietnamese_gut_amr TRUE
local_summary_md vietnamese_gut_summary.md ../materials/vietnamese-healthy-gut-microbiome/vietnamese_gut_summary.md FALSE

4 Metadata inspection

4.1 Load data

Code note

  • This chunk reads the local sample metadata and creates a simple analysis_group variable that we can use consistently in plots and summaries.
  • The grouping is intentionally simple because the metadata for this workshop subset are limited.
Code
metadata <- read_csv(file.path(data_dir, "vietnamese_gut_sample_information.csv"), show_col_types = FALSE) %>%
  mutate(
    sample_name = trim_sample_ids(sample_name),
    studycode = trim_sample_ids(studycode),
    status = recode(status, "Children" = "Child", .default = status),
    analysis_group = case_when(
      status == "Adult" ~ "Adult",
      age_month < 24 ~ "Child 0-23 months",
      TRUE ~ "Child 24-59 months"
    ),
    analysis_group = factor(
      analysis_group,
      levels = c("Child 0-23 months", "Child 24-59 months", "Adult")
    )
  ) %>%
  arrange(analysis_group, studycode)

sample_order <- metadata %>%
  arrange(analysis_group, studycode) %>%
  pull(sample_name)

studycode_order <- metadata %>%
  arrange(analysis_group, studycode) %>%
  pull(studycode)

metadata
# A tibble: 6 × 7
  studycode age_month date_of_sampling to_sequence sample_name status
  <chr>         <dbl> <chr>            <chr>       <chr>       <chr> 
1 22EN-C-02        16 18-May-2017      1st only    ERR9904449  Child 
2 22EN-C-03         9 18-May-2017      1st only    ERR9904450  Child 
3 22EN-C-07        59 22-May-2017      1st only    ERR9904458  Child 
4 22EN-A-04        25 18-May-2017      1st only    ERR9904447  Adult 
5 22EN-A-05        59 19-May-2017      1st only    ERR9904452  Adult 
6 22EN-A-17        32 24-May-2017      2nd only    ERR9904480  Adult 
# ℹ 1 more variable: analysis_group <fct>

Metadata notes

  • status is the most reliable top-level grouping variable in the local file.
  • For children, age_month is usable as age in months.
  • For adults, the age_month column appears to contain values such as 25, 32, and 59, which are much more plausible as adult ages in years than months.
  • For that reason, this notebook uses analysis_group derived from status and child age thresholds, and uses the adult age values only as descriptive metadata.
Code
metadata_summary <- metadata %>%
  count(status, analysis_group, name = "n_samples")

knitr::kable(metadata_summary)
status analysis_group n_samples
Adult Adult 3
Child Child 0-23 months 2
Child Child 24-59 months 1

4.2 Harmonize sample IDs across inputs

Code note

  • This chunk checks whether the same sample IDs are present in metadata, taxonomy tables, the exported phyloseq tables, and the AMR result files.
  • Harmonizing IDs early helps avoid silent joins to the wrong sample later in the notebook.
  • We keep sample_name for joins, but use studycode later for plotting because it is more compact on figure axes.
Code
tax_with <- read_csv(file.path(data_dir, "vietnamese_gut_with_unclassified.csv"), show_col_types = FALSE) %>%
  mutate(sample_name = trim_sample_ids(sample_name))

tax_without <- read_csv(file.path(data_dir, "vietnamese_gut_without_unclassified.csv"), show_col_types = FALSE) %>%
  mutate(sample_name = trim_sample_ids(sample_name))

phy_sample_data <- read_csv(file.path(data_dir, "vietnamese_gut.phyloseq", "sample_data.csv"), show_col_types = FALSE) %>%
  mutate(sample_name = trim_sample_ids(sample_name))

amr_files <- list.files(file.path(data_dir, "vietnamese_gut_amr"), pattern = "gene_mapping_data\\.txt$", full.names = TRUE)
amr_file_samples <- sub("\\.gene_mapping_data\\.txt$", "", basename(amr_files))

sample_harmonization <- tibble::tibble(
  source = c("metadata", "taxonomy_with", "taxonomy_without", "phyloseq_export", "amr_files"),
  n_unique_samples = c(
    n_distinct(metadata$sample_name),
    n_distinct(tax_with$sample_name),
    n_distinct(tax_without$sample_name),
    n_distinct(phy_sample_data$sample_name),
    length(unique(amr_file_samples))
  ),
  sample_ids = c(
    paste(sort(unique(metadata$sample_name)), collapse = ", "),
    paste(sort(unique(tax_with$sample_name)), collapse = ", "),
    paste(sort(unique(tax_without$sample_name)), collapse = ", "),
    paste(sort(unique(phy_sample_data$sample_name)), collapse = ", "),
    paste(sort(unique(amr_file_samples)), collapse = ", ")
  )
)

knitr::kable(sample_harmonization)
source n_unique_samples sample_ids
metadata 6 ERR9904447, ERR9904449, ERR9904450, ERR9904452, ERR9904458, ERR9904480
taxonomy_with 6 ERR9904447, ERR9904449, ERR9904450, ERR9904452, ERR9904458, ERR9904480
taxonomy_without 6 ERR9904447, ERR9904449, ERR9904450, ERR9904452, ERR9904458, ERR9904480
phyloseq_export 6 ERR9904447, ERR9904449, ERR9904450, ERR9904452, ERR9904458, ERR9904480
amr_files 6 ERR9904447, ERR9904449, ERR9904450, ERR9904452, ERR9904458, ERR9904480

All four sources contain the same six sample IDs, so harmonization is straightforward.

4.3 Inspect the exported phyloseq components

Code
phy_otu_export <- read_csv(file.path(data_dir, "vietnamese_gut.phyloseq", "otu_table.csv"), show_col_types = FALSE)
phy_tax_export <- read_csv(file.path(data_dir, "vietnamese_gut.phyloseq", "tax_table.csv"), show_col_types = FALSE)

phyloseq_export_status <- tibble::tibble(
  file = c("sample_data.csv", "otu_table.csv", "tax_table.csv"),
  rows = c(nrow(phy_sample_data), nrow(phy_otu_export), nrow(phy_tax_export)),
  columns = c(ncol(phy_sample_data), ncol(phy_otu_export), ncol(phy_tax_export))
)

knitr::kable(phyloseq_export_status)
file rows columns
sample_data.csv 6 2
otu_table.csv 736 7
tax_table.csv 736 8

The vietnamese_gut.phyloseq path is not a serialized .rds phyloseq object. It is an export directory containing the component tables needed to reconstruct a phyloseq object.

4.4 Inspect the AMR folder layout

Code
amr_file_inventory <- tibble::tibble(
  file = basename(amr_files),
  sample_name = sub("\\.gene_mapping_data\\.txt$", "", basename(amr_files)),
  size_bytes = file.info(amr_files)$size
)

knitr::kable(amr_file_inventory)
file sample_name size_bytes
ERR9904447.gene_mapping_data.txt ERR9904447 109567
ERR9904449.gene_mapping_data.txt ERR9904449 100676
ERR9904450.gene_mapping_data.txt ERR9904450 96742
ERR9904452.gene_mapping_data.txt ERR9904452 80172
ERR9904458.gene_mapping_data.txt ERR9904458 103677
ERR9904480.gene_mapping_data.txt ERR9904480 126389
Code
amr_example <- fread(amr_files[1], sep = "\t", data.table = FALSE)

knitr::kable(head(amr_example, 8))
ARO Term ARO Accession Reference Model Type Reference DB Alleles with Mapped Reads Reference Allele(s) Identity to CARD Reference Protein (%) Resistomes & Variants: Observed in Genome(s) Resistomes & Variants: Observed in Plasmid(s) Resistomes & Variants: Observed Pathogen(s) Completely Mapped Reads Mapped Reads with Flanking Sequence All Mapped Reads Average Percent Coverage Average Length Coverage (bp) Average MAPQ (Completely Mapped Reads) Number of Mapped Baits Number of Mapped Baits with Reads Average Number of reads per Bait Number of reads per Bait Coefficient of Variation (%) Number of reads mapping to baits and mapping to complete gene Number of reads mapping to baits and mapping to complete gene (%) Mate Pair Linkage (# reads) Reference Length AMR Gene Family Drug Class Resistance Mechanism
ErmG 3000522 protein homolog model CARD; Resistomes & Variants 2 98.36 - 100.0 no data no data Bacteroides thetaiotaomicron 2 0 2 16.05 118.00 189.50 0 0 0 0 N/A N/A NA 735; 735 Erm 23S ribosomal RNA methyltransferase macrolide antibiotic; lincosamide antibiotic; streptogramin antibiotic; streptogramin A antibiotic; streptogramin B antibiotic antibiotic target alteration
AAC(6’)-Iw 3002567 protein homolog model CARD 1 100.0 no data no data Acinetobacter sp. 1 0 1 3.85 17.00 23.00 0 0 0 0 N/A N/A NA 441 AAC(6’) aminoglycoside antibiotic antibiotic inactivation
tet(W) 3000194 protein homolog model CARD; Resistomes & Variants 22 95.85 - 100.0 YES no data Butyrivibrio fibrisolvens; Bifidobacterium longum; Collinsella aerofaciens; Faecalibacterium prausnitzii; Lactobacillus johnsonii; Eggerthella lenta; Eubacterium maltosivorans; Lactobacillus crispatus; Streptococcus suis; Bifidobacterium breve 1247 0 1247 32.04 575.82 160.75 0 0 0 0 N/A N/A NA 1920; 1920; 1920; 2082; 1617; 1920; 1920; 1959; 2070; 1920; 1920; 1920; 1920; 1920; 1920; 1920; 1920; 1932; 1920; 1920; 1920; 219 tetracycline-resistant ribosomal protection protein tetracycline antibiotic antibiotic target protection
CGB-1 3000841 protein homolog model CARD 1 100.0 no data no data Chryseobacterium gleum 2 0 2 2.74 20.00 40.00 0 0 0 0 N/A N/A NA 729 CGB beta-lactamase carbapenem; cephalosporin; penicillin beta-lactam antibiotic inactivation
vanL 3002910 protein homolog model CARD 1 100.0 no data no data Enterococcus faecalis 1 0 1 3.05 32.00 56.00 0 0 0 0 N/A N/A NA 1050 glycopeptide resistance gene cluster; Van ligase glycopeptide antibiotic antibiotic target alteration
lsaB 3003111 protein homolog model CARD 1 100.0 no data no data Mammaliicoccus sciuri 14 0 14 2.91 43.00 39.64 0 0 0 0 N/A N/A NA 1479 lsa-type ABC-F protein lincosamide antibiotic; streptogramin antibiotic; pleuromutilin antibiotic antibiotic target protection
ErmQ 3000593 protein homolog model CARD 1 100.0 no data no data Clostridium perfringens 4 0 4 44.44 344.00 160.00 0 0 0 0 N/A N/A NA 774 Erm 23S ribosomal RNA methyltransferase macrolide antibiotic; lincosamide antibiotic; streptogramin antibiotic; streptogramin A antibiotic; streptogramin B antibiotic antibiotic target alteration
catP 3002686 protein homolog model CARD 1 100.0 no data no data Clostridium perfringens 30 0 30 100.00 624.00 156.97 0 0 0 0 N/A N/A NA 624 chloramphenicol acetyltransferase (CAT) phenicol antibiotic antibiotic inactivation

The AMR files are one table per sample and include:

  • ARO term
  • reference model metadata
  • mapped read counts
  • average coverage
  • AMR gene family
  • drug class
  • resistance mechanism

This is sufficient for descriptive gene-level and drug-class summaries.

5 Taxonomy analysis

5.1 Load data

Code note

  • Here we load the two taxonomy tables and split the lineage string into standard taxonomic ranks.
  • We then join the taxonomy data back to the metadata so every row knows which sample and age group it belongs to.
Code
tax_with_parsed <- parse_lineage(tax_with) %>%
  left_join(metadata, by = "sample_name")

tax_without_parsed <- parse_lineage(tax_without) %>%
  left_join(metadata, by = "sample_name")

tax_structure_summary <- tibble::tibble(
  table = c("with_unclassified", "without_unclassified"),
  rows = c(nrow(tax_with_parsed), nrow(tax_without_parsed)),
  unique_samples = c(n_distinct(tax_with_parsed$sample_name), n_distinct(tax_without_parsed$sample_name)),
  unique_matches = c(n_distinct(tax_with_parsed$match_name), n_distinct(tax_without_parsed$match_name))
)

knitr::kable(tax_structure_summary)
table rows unique_samples unique_matches
with_unclassified 1344 6 737
without_unclassified 1338 6 736

5.2 Inspect data structure

Code
taxonomy_numeric_checks <- bind_rows(
  tax_with_parsed %>%
    group_by(sample_name) %>%
    summarise(
      table = "with_unclassified",
      f_weighted_sum = sum(f_weighted_at_rank, na.rm = TRUE),
      unweighted_sum = sum(unweighted_fraction, na.rm = TRUE),
      .groups = "drop"
    ),
  tax_without_parsed %>%
    group_by(sample_name) %>%
    summarise(
      table = "without_unclassified",
      f_weighted_sum = sum(f_weighted_at_rank, na.rm = TRUE),
      unweighted_sum = sum(unweighted_fraction, na.rm = TRUE),
      .groups = "drop"
    )
) %>%
  arrange(table, sample_name)

knitr::kable(taxonomy_numeric_checks, digits = 4)
sample_name table f_weighted_sum unweighted_sum
ERR9904447 with_unclassified 1 1.0000
ERR9904449 with_unclassified 1 1.0000
ERR9904450 with_unclassified 1 1.0000
ERR9904452 with_unclassified 1 1.0000
ERR9904458 with_unclassified 1 1.0000
ERR9904480 with_unclassified 1 1.0000
ERR9904447 without_unclassified 1 0.5407
ERR9904449 without_unclassified 1 0.5545
ERR9904450 without_unclassified 1 0.4473
ERR9904452 without_unclassified 1 0.5550
ERR9904458 without_unclassified 1 0.5369
ERR9904480 without_unclassified 1 0.5560

Two points matter here:

  • f_weighted_at_rank sums to approximately 1 within each sample in both taxonomy tables, so it is appropriate for abundance-based ecological summaries.
  • unweighted_fraction does not behave like a standard compositional percentage and should not be interpreted as if it must sum to 1 or 100.

5.3 Compare with vs without unclassified

Code note

  • This section isolates the rows labelled unclassified in the full taxonomy table.
  • The goal is to show workshop participants how much of each sample remains taxonomically unresolved before using the renormalized table for ecological summaries.
Code
unclassified_summary <- tax_with_parsed %>%
  filter(tolower(lineage) == "unclassified") %>%
  transmute(
    sample_name,
    analysis_group,
    unclassified_f_weighted = f_weighted_at_rank,
    unclassified_unweighted = unweighted_fraction
  ) %>%
  arrange(analysis_group, sample_name)

knitr::kable(unclassified_summary, digits = 4)
sample_name analysis_group unclassified_f_weighted unclassified_unweighted
ERR9904449 Child 0-23 months 0.4177 0.4455
ERR9904450 Child 0-23 months 0.3812 0.5527
ERR9904458 Child 24-59 months 0.3779 0.4631
ERR9904447 Adult 0.4890 0.4593
ERR9904452 Adult 0.4177 0.4450
ERR9904480 Adult 0.4896 0.4440
Code
unclassified_plot_data <- unclassified_summary %>%
    pivot_longer(
        cols = c(unclassified_f_weighted, unclassified_unweighted),
        names_to = "metric",
        values_to = "fraction"
    ) %>%
    mutate(
        metric = recode(
            metric,
            unclassified_f_weighted = "f_weighted_at_rank",
            unclassified_unweighted = "unweighted_fraction"
        ),
        fraction_percent = 100 * fraction,
        studycode = factor(metadata$studycode[match(sample_name, metadata$sample_name)], levels = studycode_order)
    )

unclassified_segments <- unclassified_plot_data %>%
    group_by(sample_name, studycode, analysis_group) %>%
    summarise(
        y_min = min(fraction_percent, na.rm = TRUE),
        y_max = max(fraction_percent, na.rm = TRUE),
        .groups = "drop"
    ) %>%
    mutate(studycode = factor(studycode, levels = studycode_order))

unclassified_plot <- ggplot(unclassified_plot_data, aes(x = studycode, y = fraction_percent)) +
    geom_segment(
        data = unclassified_segments,
        aes(x = studycode, xend = studycode, y = y_min, yend = y_max),
        linewidth = 0.7,
        color = "grey55"
    ) +
    geom_point(aes(fill = metric, shape = metric), size = 3.2, color = "black", stroke = 0.35) +
    facet_wrap(~analysis_group, scales = "free_x", labeller = group_labeller) +
    scale_y_continuous(
        labels = label_percent(scale = 1, accuracy = 1),
        limits = c(0, 100),
        expand = expansion(mult = c(0, 0.02))
    ) +
    scale_fill_manual(values = c("f_weighted_at_rank" = "#4e79a7", "unweighted_fraction" = "#e15759")) +
    scale_shape_manual(values = c("f_weighted_at_rank" = 21, "unweighted_fraction" = 24)) +
    labs(
        title = "Large unclassified fractions remain in all six samples",
        # subtitle = "Within each sample, the two points are connected vertically to compare f_weighted_at_rank and unweighted_fraction directly",
        x = "Study code",
        y = "Fraction assigned to 'unclassified'",
        fill = "Metric",
        shape = "Metric"
    ) +
    theme(axis.text.x = element_text(angle = 20, hjust = 1))

save_fig(unclassified_plot, "taxonomy_unclassified_fraction.png", width = 8, height = 5)
unclassified_plot

Take-home note. The gap between the two points for each sample shows how much the interpretation changes depending on whether we use the weighted classified fraction or the unweighted fraction.

In this subset, the weighted unclassified fraction ranges from roughly 38% to 49%. That is large enough that analyses using the table without unclassified rows should be interpreted as composition conditional on classified signal, not as a complete description of all reads.

When a sample has a large unclassified fraction, what is the safest way to read the classified-only composition table?

Hint

Ask what information is lost when the database cannot confidently assign a large part of the sample.

5.4 Reconstruct a phyloseq object for the demonstration

Code note

  • This chunk converts the long-format taxonomy table into the three core pieces needed by phyloseq: an OTU table, a taxonomy table, and sample metadata.
  • We do not need phyloseq for every step, but it is useful to show how these data could be reused in a standard microbiome workflow.
Code
otu_demo <- tax_without_parsed %>%
  select(match_name, sample_name, f_weighted_at_rank) %>%
  distinct() %>%
  pivot_wider(names_from = sample_name, values_from = f_weighted_at_rank, values_fill = 0)

tax_demo <- tax_without_parsed %>%
  select(match_name, domain, phylum, class, order, family, genus, species) %>%
  distinct()

sample_demo <- metadata %>%
  select(sample_name, studycode, status, age_month, analysis_group)

otu_mat <- as.matrix(as.data.frame(otu_demo[, -1]))
rownames(otu_mat) <- otu_demo$match_name

tax_mat <- as.matrix(as.data.frame(tax_demo[, -1]))
rownames(tax_mat) <- tax_demo$match_name

sample_df <- as.data.frame(sample_demo)
rownames(sample_df) <- sample_df$sample_name
sample_df$sample_name <- NULL

phy_demo <- phyloseq(
  otu_table(otu_mat, taxa_are_rows = TRUE),
  tax_table(tax_mat),
  sample_data(sample_df)
)

tibble::tibble(
  nsamples = nsamples(phy_demo),
  ntaxa = ntaxa(phy_demo),
  rank_names = paste(rank_names(phy_demo), collapse = ", ")
)
# A tibble: 1 × 3
  nsamples ntaxa rank_names                                          
     <int> <int> <chr>                                               
1        6   736 domain, phylum, class, order, family, genus, species

This reconstructed object is convenient for workshop workflows, but the ecology below still uses the same underlying f_weighted_at_rank abundances from vietnamese_gut_without_unclassified.csv.

5.5 Composition plots

5.5.1 Phylum composition

Code note

  • We first rank phyla by their total abundance across all six samples.
  • To keep the stacked bar plot readable, we keep the top 10 phyla and collapse the remainder into Other.
  • The x-axis uses studycode rather than sample_name, which keeps the figure easier to read.
  • Within each sample, the stack is ordered from highest abundance at the bottom to lower abundance above, with Other forced to the top.
  • The stack order is controlled explicitly so the dominant phyla sit lower in each bar and Other remains at the top of each sample.
Code
top_phyla <- tax_without_parsed %>%
  group_by(phylum) %>%
  summarise(total_abundance = sum(f_weighted_at_rank, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(total_abundance)) %>%
  slice_head(n = 10) %>%
  pull(phylum)

phylum_plot_data <- tax_without_parsed %>%
  mutate(phylum_plot = if_else(phylum %in% top_phyla, phylum, "Other")) %>%
  group_by(sample_name, studycode, analysis_group, phylum_plot) %>%
  summarise(rel_abundance = sum(f_weighted_at_rank, na.rm = TRUE), .groups = "drop") %>%
  group_by(sample_name, studycode, analysis_group) %>%
  mutate(stack_order = if_else(phylum_plot == "Other", -1, rank(rel_abundance, ties.method = "first"))) %>%
  ungroup()

phylum_levels <- phylum_plot_data %>%
  group_by(phylum_plot) %>%
  summarise(total_abundance = sum(rel_abundance), .groups = "drop") %>%
  arrange(desc(total_abundance)) %>%
  pull(phylum_plot)

phylum_fill <- make_fill_palette(phylum_levels)

phylum_plot_data <- phylum_plot_data %>%
  mutate(
    studycode = factor(studycode, levels = studycode_order),
    phylum_plot = factor(phylum_plot, levels = phylum_levels)
  )

phylum_plot <- ggplot(phylum_plot_data, aes(x = studycode, y = rel_abundance, fill = phylum_plot, order = stack_order)) +
  geom_col(width = 0.75, color = "grey15", linewidth = 0.15) +
  facet_wrap(~ analysis_group, scales = "free_x", labeller = group_labeller) +
  scale_y_continuous(
    labels = percent_format(accuracy = 1),
    limits = c(0, 1),
    expand = expansion(mult = c(0, 0.02))
  ) +
  scale_fill_manual(values = phylum_fill, breaks = phylum_levels) +
  labs(
    title = "Phylum-level composition after removing unclassified rows",
    subtitle = "Top 10 phyla by overall relative abundance; 'Other' is stacked on top within each sample",
    x = "Study code",
    y = "Relative abundance (f_weighted_at_rank)",
    fill = "Phylum"
  ) +
  theme(
    axis.text.x = element_text(angle = 20, hjust = 1),
    legend.position = "right"
  )

save_fig(phylum_plot, "taxonomy_phylum_composition.png", width = 8, height = 5.5)
phylum_plot

Take-home note. This phylum-level stacked bar plot gives the quickest visual summary of how the dominant classified signal shifts across the six samples.

Code
phylum_top10_by_sample <- tax_without_parsed %>%
  group_by(studycode, analysis_group, phylum) %>%
  summarise(rel_abundance = sum(f_weighted_at_rank, na.rm = TRUE), .groups = "drop") %>%
  group_by(studycode, analysis_group) %>%
  arrange(desc(rel_abundance), .by_group = TRUE) %>%
  slice_head(n = 10) %>%
  ungroup()

phylum_top10_by_group <- tax_without_parsed %>%
  group_by(studycode, analysis_group, phylum) %>%
  summarise(rel_abundance = sum(f_weighted_at_rank, na.rm = TRUE), .groups = "drop") %>%
  group_by(analysis_group, phylum) %>%
  summarise(mean_rel_abundance = mean(rel_abundance), .groups = "drop") %>%
  group_by(analysis_group) %>%
  arrange(desc(mean_rel_abundance), .by_group = TRUE) %>%
  slice_head(n = 10) %>%
  ungroup()

knitr::kable(phylum_top10_by_sample, digits = 4, caption = "Top 10 phyla by relative abundance within each sample")
Top 10 phyla by relative abundance within each sample
studycode analysis_group phylum rel_abundance
22EN-A-04 Adult p__Bacillota 0.5340
22EN-A-04 Adult p__Bacteroidota 0.4262
22EN-A-04 Adult p__Pseudomonadota 0.0133
22EN-A-04 Adult p__Actinomycetota 0.0107
22EN-A-04 Adult p__Methanobacteriota 0.0095
22EN-A-04 Adult p__Desulfobacterota_I 0.0054
22EN-A-04 Adult p__Verrucomicrobiota 0.0008
22EN-A-04 Adult p__Fusobacteriota 0.0002
22EN-A-05 Adult p__Bacillota 0.5304
22EN-A-05 Adult p__Bacteroidota 0.3622
22EN-A-05 Adult p__Cyanobacteriota 0.0523
22EN-A-05 Adult p__Actinomycetota 0.0523
22EN-A-05 Adult p__Desulfobacterota_I 0.0016
22EN-A-05 Adult p__Pseudomonadota 0.0009
22EN-A-05 Adult p__Verrucomicrobiota 0.0003
22EN-A-17 Adult p__Bacillota 0.5838
22EN-A-17 Adult p__Bacteroidota 0.3094
22EN-A-17 Adult p__Pseudomonadota 0.0785
22EN-A-17 Adult p__Actinomycetota 0.0168
22EN-A-17 Adult p__Desulfobacterota_I 0.0083
22EN-A-17 Adult p__Cyanobacteriota 0.0027
22EN-A-17 Adult p__Fusobacteriota 0.0004
22EN-C-02 Child 0-23 months p__Actinomycetota 0.3518
22EN-C-02 Child 0-23 months p__Bacillota 0.3461
22EN-C-02 Child 0-23 months p__Bacteroidota 0.2856
22EN-C-02 Child 0-23 months p__Pseudomonadota 0.0152
22EN-C-02 Child 0-23 months p__Desulfobacterota_I 0.0008
22EN-C-02 Child 0-23 months p__Fusobacteriota 0.0005
22EN-C-03 Child 0-23 months p__Actinomycetota 0.5505
22EN-C-03 Child 0-23 months p__Bacillota 0.3671
22EN-C-03 Child 0-23 months p__Pseudomonadota 0.0788
22EN-C-03 Child 0-23 months p__Bacteroidota 0.0036
22EN-C-07 Child 24-59 months p__Bacillota 0.6576
22EN-C-07 Child 24-59 months p__Bacteroidota 0.2635
22EN-C-07 Child 24-59 months p__Actinomycetota 0.0734
22EN-C-07 Child 24-59 months p__Pseudomonadota 0.0047
22EN-C-07 Child 24-59 months p__Desulfobacterota_I 0.0005
22EN-C-07 Child 24-59 months p__Fusobacteriota 0.0002
22EN-C-07 Child 24-59 months p__Patescibacteriota 0.0001
Code
knitr::kable(phylum_top10_by_group, digits = 4, caption = "Top 10 phyla by mean relative abundance within each age group")
Top 10 phyla by mean relative abundance within each age group
analysis_group phylum mean_rel_abundance
Child 0-23 months p__Actinomycetota 0.4511
Child 0-23 months p__Bacillota 0.3566
Child 0-23 months p__Bacteroidota 0.1446
Child 0-23 months p__Pseudomonadota 0.0470
Child 0-23 months p__Desulfobacterota_I 0.0008
Child 0-23 months p__Fusobacteriota 0.0005
Child 24-59 months p__Bacillota 0.6576
Child 24-59 months p__Bacteroidota 0.2635
Child 24-59 months p__Actinomycetota 0.0734
Child 24-59 months p__Pseudomonadota 0.0047
Child 24-59 months p__Desulfobacterota_I 0.0005
Child 24-59 months p__Fusobacteriota 0.0002
Child 24-59 months p__Patescibacteriota 0.0001
Adult p__Bacillota 0.5494
Adult p__Bacteroidota 0.3659
Adult p__Pseudomonadota 0.0309
Adult p__Cyanobacteriota 0.0275
Adult p__Actinomycetota 0.0266
Adult p__Methanobacteriota 0.0095
Adult p__Desulfobacterota_I 0.0051
Adult p__Verrucomicrobiota 0.0006
Adult p__Fusobacteriota 0.0003

5.5.2 Genus composition

Code note

  • A stacked bar plot at genus level is harder to read if we include too many genera.
  • We therefore keep the top 10 genera across the dataset and collapse the rest into Other.
  • This gives a sample-level overview that is more intuitive for workshop participants before moving to the genus heatmap.
Code
top_genera_bar <- tax_without_parsed %>%
  filter(!is.na(genus), genus != "") %>%
  group_by(genus) %>%
  summarise(total_abundance = sum(f_weighted_at_rank, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(total_abundance)) %>%
  slice_head(n = 10) %>%
  pull(genus)

genus_bar_plot_data <- tax_without_parsed %>%
  mutate(genus_plot = if_else(!is.na(genus) & genus %in% top_genera_bar, genus, "Other")) %>%
  group_by(sample_name, studycode, analysis_group, genus_plot) %>%
  summarise(rel_abundance = sum(f_weighted_at_rank, na.rm = TRUE), .groups = "drop") %>%
  group_by(sample_name, studycode, analysis_group) %>%
  mutate(stack_order = if_else(genus_plot == "Other", -1, rank(rel_abundance, ties.method = "first"))) %>%
  ungroup()

genus_levels <- genus_bar_plot_data %>%
  group_by(genus_plot) %>%
  summarise(total_abundance = sum(rel_abundance), .groups = "drop") %>%
  arrange(desc(total_abundance)) %>%
  pull(genus_plot)

genus_fill <- make_fill_palette(genus_levels)

genus_bar_plot_data <- genus_bar_plot_data %>%
  mutate(
    studycode = factor(studycode, levels = studycode_order),
    genus_plot = factor(genus_plot, levels = genus_levels)
  )

genus_bar_plot <- ggplot(genus_bar_plot_data, aes(x = studycode, y = rel_abundance, fill = genus_plot, order = stack_order)) +
  geom_col(width = 0.75, color = "grey15", linewidth = 0.15) +
  facet_wrap(~ analysis_group, scales = "free_x", labeller = group_labeller) +
  scale_y_continuous(
    labels = percent_format(accuracy = 1),
    limits = c(0, 1),
    expand = expansion(mult = c(0, 0.02))
  ) +
  scale_fill_manual(values = genus_fill, breaks = genus_levels) +
  labs(
    title = "Genus-level composition after removing unclassified rows",
    subtitle = "Top 10 genera by overall relative abundance; 'Other' is stacked on top within each sample",
    x = "Study code",
    y = "Relative abundance (f_weighted_at_rank)",
    fill = "Genus"
  ) +
  theme(
    axis.text.x = element_text(angle = 20, hjust = 1),
    legend.position = "right"
  )

save_fig(genus_bar_plot, "taxonomy_genus_composition.png", width = 10, height = 6)
genus_bar_plot

Take-home note. The genus-level stacked bar plot is more detailed than the phylum plot and helps show which named genera drive the broad taxonomic shifts.

Code
genus_top10_by_sample <- tax_without_parsed %>%
  filter(!is.na(genus), genus != "") %>%
  group_by(studycode, analysis_group, genus) %>%
  summarise(rel_abundance = sum(f_weighted_at_rank, na.rm = TRUE), .groups = "drop") %>%
  group_by(studycode, analysis_group) %>%
  arrange(desc(rel_abundance), .by_group = TRUE) %>%
  slice_head(n = 10) %>%
  ungroup()

genus_top10_by_group <- tax_without_parsed %>%
  filter(!is.na(genus), genus != "") %>%
  group_by(studycode, analysis_group, genus) %>%
  summarise(rel_abundance = sum(f_weighted_at_rank, na.rm = TRUE), .groups = "drop") %>%
  group_by(analysis_group, genus) %>%
  summarise(mean_rel_abundance = mean(rel_abundance), .groups = "drop") %>%
  group_by(analysis_group) %>%
  arrange(desc(mean_rel_abundance), .by_group = TRUE) %>%
  slice_head(n = 10) %>%
  ungroup()

knitr::kable(genus_top10_by_sample, digits = 4, caption = "Top 10 genera by relative abundance within each sample")
Top 10 genera by relative abundance within each sample
studycode analysis_group genus rel_abundance
22EN-A-04 Adult g__Prevotella 0.2903
22EN-A-04 Adult g__Blautia_A 0.0480
22EN-A-04 Adult g__Cryptobacteroides 0.0448
22EN-A-04 Adult g__Megamonas 0.0343
22EN-A-04 Adult g__CAG-877 0.0289
22EN-A-04 Adult g__Coprococcus 0.0279
22EN-A-04 Adult g__Bacteroides 0.0270
22EN-A-04 Adult g__Faecalibacterium 0.0269
22EN-A-04 Adult g__Anaerostipes 0.0199
22EN-A-04 Adult g__Alloprevotella 0.0171
22EN-A-05 Adult g__Bacteroides 0.1816
22EN-A-05 Adult g__Blautia_A 0.0865
22EN-A-05 Adult g__Prevotella 0.0655
22EN-A-05 Adult g__Alistipes 0.0644
22EN-A-05 Adult g__Stercorousia 0.0523
22EN-A-05 Adult g__Bifidobacterium 0.0466
22EN-A-05 Adult g__Agathobacter 0.0426
22EN-A-05 Adult g__Fusicatenibacter 0.0342
22EN-A-05 Adult g__CAG-302 0.0308
22EN-A-05 Adult g__Phascolarctobacterium 0.0299
22EN-A-17 Adult g__Prevotella 0.1827
22EN-A-17 Adult g__Agathobacter 0.0849
22EN-A-17 Adult g__Blautia_A 0.0849
22EN-A-17 Adult g__Faecalibacterium 0.0577
22EN-A-17 Adult g__Fimenecus 0.0572
22EN-A-17 Adult g__Klebsiella 0.0496
22EN-A-17 Adult g__Bacteroides 0.0295
22EN-A-17 Adult g__Alloprevotella 0.0286
22EN-A-17 Adult g__Dorea_A 0.0266
22EN-A-17 Adult g__Phocaeicola 0.0221
22EN-C-02 Child 0-23 months g__Bifidobacterium 0.3456
22EN-C-02 Child 0-23 months g__Bacteroides 0.1499
22EN-C-02 Child 0-23 months g__Phocaeicola 0.1343
22EN-C-02 Child 0-23 months g__Anaerostipes 0.0578
22EN-C-02 Child 0-23 months g__Megasphaera 0.0363
22EN-C-02 Child 0-23 months g__Mediterraneibacter 0.0342
22EN-C-02 Child 0-23 months g__Blautia_A 0.0313
22EN-C-02 Child 0-23 months g__Megamonas 0.0259
22EN-C-02 Child 0-23 months g__Faecalibacterium 0.0218
22EN-C-02 Child 0-23 months g__Sellimonas 0.0173
22EN-C-03 Child 0-23 months g__Bifidobacterium 0.4785
22EN-C-03 Child 0-23 months g__Streptococcus 0.1488
22EN-C-03 Child 0-23 months g__Mediterraneibacter 0.1143
22EN-C-03 Child 0-23 months g__Escherichia 0.0735
22EN-C-03 Child 0-23 months g__Collinsella 0.0713
22EN-C-03 Child 0-23 months g__Veillonella 0.0216
22EN-C-03 Child 0-23 months g__Fusicatenibacter 0.0192
22EN-C-03 Child 0-23 months g__Blautia_A 0.0162
22EN-C-03 Child 0-23 months g__Enterococcus 0.0152
22EN-C-03 Child 0-23 months g__Dorea 0.0073
22EN-C-07 Child 24-59 months g__Megamonas 0.1603
22EN-C-07 Child 24-59 months g__Ruminococcus_B 0.1507
22EN-C-07 Child 24-59 months g__Phocaeicola 0.1279
22EN-C-07 Child 24-59 months g__Bacteroides 0.1163
22EN-C-07 Child 24-59 months g__Bifidobacterium 0.0680
22EN-C-07 Child 24-59 months g__Faecalibacterium 0.0484
22EN-C-07 Child 24-59 months g__Ruminococcoides 0.0474
22EN-C-07 Child 24-59 months g__Streptococcus 0.0453
22EN-C-07 Child 24-59 months g__Blautia_A 0.0380
22EN-C-07 Child 24-59 months g__Veillonella 0.0372
Code
knitr::kable(genus_top10_by_group, digits = 4, caption = "Top 10 genera by mean relative abundance within each age group")
Top 10 genera by mean relative abundance within each age group
analysis_group genus mean_rel_abundance
Child 0-23 months g__Bifidobacterium 0.4120
Child 0-23 months g__Bacteroides 0.1499
Child 0-23 months g__Phocaeicola 0.1343
Child 0-23 months g__Streptococcus 0.0793
Child 0-23 months g__Mediterraneibacter 0.0743
Child 0-23 months g__Anaerostipes 0.0578
Child 0-23 months g__Escherichia 0.0391
Child 0-23 months g__Collinsella 0.0374
Child 0-23 months g__Megasphaera 0.0363
Child 0-23 months g__Megamonas 0.0259
Child 24-59 months g__Megamonas 0.1603
Child 24-59 months g__Ruminococcus_B 0.1507
Child 24-59 months g__Phocaeicola 0.1279
Child 24-59 months g__Bacteroides 0.1163
Child 24-59 months g__Bifidobacterium 0.0680
Child 24-59 months g__Faecalibacterium 0.0484
Child 24-59 months g__Ruminococcoides 0.0474
Child 24-59 months g__Streptococcus 0.0453
Child 24-59 months g__Blautia_A 0.0380
Child 24-59 months g__Veillonella 0.0372
Adult g__Prevotella 0.1795
Adult g__Bacteroides 0.0794
Adult g__Blautia_A 0.0731
Adult g__Stercorousia 0.0523
Adult g__Agathobacter 0.0466
Adult g__Cryptobacteroides 0.0448
Adult g__Faecalibacterium 0.0363
Adult g__Alistipes 0.0295
Adult g__Fimenecus 0.0289
Adult g__CAG-877 0.0289

5.5.3 Genus heatmap

Code note

  • The stacked bar plot above is good for overall composition, but it becomes hard to see smaller differences.
  • This heatmap shows the same data in a matrix view, which makes it easier to spot sample-specific enrichment patterns.
Code
top_genera <- tax_without_parsed %>%
  filter(!is.na(genus), genus != "") %>%
  group_by(genus) %>%
  summarise(total_abundance = sum(f_weighted_at_rank, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(total_abundance)) %>%
  slice_head(n = 15) %>%
  pull(genus)

genus_plot_data <- tax_without_parsed %>%
  filter(genus %in% top_genera) %>%
  group_by(sample_name, studycode, analysis_group, genus) %>%
  summarise(rel_abundance = sum(f_weighted_at_rank, na.rm = TRUE), .groups = "drop") %>%
  mutate(
    studycode = factor(studycode, levels = studycode_order),
    genus = factor(genus, levels = rev(top_genera))
  )

genus_fill_max_raw <- max(genus_plot_data$rel_abundance, na.rm = TRUE)
genus_fill_limit <- ((floor((100 * genus_fill_max_raw) / 10) + 1) * 10) / 100
genus_fill_breaks <- seq(0, genus_fill_limit, by = 0.1)

genus_plot <- ggplot(genus_plot_data, aes(x = studycode, y = genus, fill = rel_abundance)) +
  geom_tile(color = "white", linewidth = 0.2) +
  facet_wrap(~ analysis_group, scales = "free_x", labeller = group_labeller) +
  scale_fill_gradient(
    low = "grey95",
    high = "#2C7FB8",
    limits = c(0, genus_fill_limit),
    breaks = genus_fill_breaks,
    labels = percent_format(accuracy = 1)
  ) +
  labs(
    title = "Top genera across the six re-analyzed samples",
    x = "Study code",
    y = "Genus",
    fill = "Relative abundance"
  ) +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

save_fig(genus_plot, "taxonomy_genus_heatmap.png", width = 10, height = 6.5)
genus_plot

Take-home note. The heatmap makes it easier to see sample-specific genus enrichment patterns that are harder to spot in the stacked bar plot.

Descriptive interpretation

  • The two youngest-child samples show more Actinomycetota than the adult samples, driven largely by Bifidobacterium.
  • Adult samples are richer in the classic adult-associated Bacillota + Bacteroidota balance.
  • The single older-child sample has a phylum profile that looks more adult-like than the two youngest samples, but a single sample is not enough to claim convergence.

5.6 Alpha diversity

Code note

  • Alpha diversity asks how diverse each sample is on its own.
  • We use the Shannon index because it captures both richness and evenness and is familiar to many microbiome readers.
Code
alpha_diversity <- tax_without_parsed %>%
  group_by(sample_name) %>%
  summarise(
    shannon = vegan::diversity(f_weighted_at_rank, index = "shannon"),
    observed_taxa = n_distinct(match_name),
    .groups = "drop"
  ) %>%
  left_join(metadata, by = "sample_name") %>%
  arrange(analysis_group, studycode)

knitr::kable(alpha_diversity, digits = 3)
sample_name shannon observed_taxa studycode age_month date_of_sampling to_sequence status analysis_group
ERR9904449 3.530 140 22EN-C-02 16 18-May-2017 1st only Child Child 0-23 months
ERR9904450 2.848 81 22EN-C-03 9 18-May-2017 1st only Child Child 0-23 months
ERR9904458 3.282 153 22EN-C-07 59 22-May-2017 1st only Child Child 24-59 months
ERR9904447 4.701 376 22EN-A-04 25 18-May-2017 1st only Adult Adult
ERR9904452 4.191 252 22EN-A-05 59 19-May-2017 1st only Adult Adult
ERR9904480 4.602 336 22EN-A-17 32 24-May-2017 2nd only Adult Adult
Code
alpha_jitter <- position_jitter(width = 0.06, height = 0, seed = 42)

alpha_plot <- ggplot(alpha_diversity, aes(x = analysis_group, y = shannon, color = analysis_group)) +
  geom_boxplot(width = 0.55, outlier.shape = NA, alpha = 0.2, linewidth = 0.4) +
  geom_point(aes(shape = analysis_group), size = 3, position = alpha_jitter) +
  ggrepel::geom_text_repel(
    aes(label = studycode),
    position = alpha_jitter,
    size = 3,
    show.legend = FALSE,
    seed = 42,
    box.padding = 0.35,
    point.padding = 0.2,
    min.segment.length = 0
  ) +
  scale_color_manual(values = group_palette, labels = group_labels) +
  scale_shape_manual(values = group_shapes, labels = group_labels) +
  scale_x_discrete(labels = group_labels) +
  scale_y_continuous(limits = c(0, NA), expand = expansion(mult = c(0, 0.05))) +
  labs(
    #title = "Adults show the highest Shannon diversity in this 6-sample subset",
    x = "Analysis group",
    y = "Shannon diversity",
    color = "Group",
    shape = "Group"
  ) +
  theme(legend.position = "none")

save_fig(alpha_plot, "taxonomy_alpha_diversity.png", width = 5, height = 5)
alpha_plot

Take-home note. In this subset, adult samples sit at the higher end of Shannon diversity, while the child samples are consistently lower.

No formal groupwise alpha-diversity test is emphasized here. With group sizes of 2, 1, and 3, a p-value would suggest more precision than the data can support.

5.7 Beta diversity / ordination

Code note

  • Beta diversity compares samples against one another rather than within a single sample.
  • Here we build a species-by-sample matrix from f_weighted_at_rank, compute Bray-Curtis dissimilarity, and then use classical multidimensional scaling to visualize those distances.
Code
species_wide <- tax_without_parsed %>%
    select(sample_name, match_name, f_weighted_at_rank) %>%
    distinct() %>%
    pivot_wider(names_from = match_name, values_from = f_weighted_at_rank, values_fill = 0) %>%
    arrange(match(sample_name, sample_order))

species_mat <- as.matrix(as.data.frame(species_wide[, -1]))
rownames(species_mat) <- species_wide$sample_name

bray_dist <- vegdist(species_mat, method = "euclidean")
bray_cmd <- cmdscale(bray_dist, eig = TRUE, k = 2)

beta_diversity <- tibble::tibble(
    sample_name = rownames(bray_cmd$points),
    Axis1 = bray_cmd$points[, 1],
    Axis2 = bray_cmd$points[, 2]
) %>%
    left_join(metadata, by = "sample_name")

positive_eig <- bray_cmd$eig[bray_cmd$eig > 0]
axis1_var <- positive_eig[1] / sum(positive_eig)
axis2_var <- positive_eig[2] / sum(positive_eig)

pairwise_bray <- as.matrix(bray_dist) %>%
    as.data.frame() %>%
    tibble::rownames_to_column("sample_1") %>%
    pivot_longer(-sample_1, names_to = "sample_2", values_to = "bray_curtis") %>%
    filter(sample_1 < sample_2) %>%
    left_join(metadata %>% select(sample_name, group_1 = analysis_group), by = c("sample_1" = "sample_name")) %>%
    left_join(metadata %>% select(sample_name, group_2 = analysis_group), by = c("sample_2" = "sample_name")) %>%
    left_join(metadata %>% select(sample_name, studycode_1 = studycode), by = c("sample_1" = "sample_name")) %>%
    left_join(metadata %>% select(sample_name, studycode_2 = studycode), by = c("sample_2" = "sample_name")) %>%
    select(studycode_1, studycode_2, group_1, group_2, bray_curtis) %>%
    arrange(bray_curtis)

knitr::kable(pairwise_bray, digits = 3)
studycode_1 studycode_2 group_1 group_2 bray_curtis
22EN-A-04 22EN-A-17 Adult Adult 0.167
22EN-A-05 22EN-A-17 Adult Adult 0.196
22EN-A-04 22EN-A-05 Adult Adult 0.219
22EN-C-02 22EN-A-17 Child 0-23 months Adult 0.256
22EN-C-02 22EN-A-05 Child 0-23 months Adult 0.257
22EN-A-04 22EN-C-02 Adult Child 0-23 months 0.264
22EN-C-02 22EN-C-07 Child 0-23 months Child 24-59 months 0.281
22EN-A-04 22EN-C-07 Adult Child 24-59 months 0.289
22EN-C-07 22EN-A-17 Child 24-59 months Adult 0.296
22EN-A-05 22EN-C-07 Adult Child 24-59 months 0.300
22EN-C-02 22EN-C-03 Child 0-23 months Child 0-23 months 0.322
22EN-C-03 22EN-A-17 Child 0-23 months Adult 0.341
22EN-A-04 22EN-C-03 Adult Child 0-23 months 0.345
22EN-C-03 22EN-A-05 Child 0-23 months Adult 0.354
22EN-C-03 22EN-C-07 Child 0-23 months Child 24-59 months 0.407
Code
beta_plot <- ggplot(beta_diversity, aes(x = Axis1, y = Axis2, color = analysis_group)) +
  geom_point(aes(shape = analysis_group), size = 3) +
  stat_ellipse(aes(group = analysis_group), linewidth = 0.6, linetype = 2, level = 0.8, show.legend = FALSE) +
  ggrepel::geom_text_repel(
    aes(label = studycode),
    size = 3,
    show.legend = FALSE,
    seed = 42,
    box.padding = 0.25,
    point.padding = 0.25,
    min.segment.length = 0
  ) +
  scale_color_manual(values = group_palette, labels = group_labels) +
  scale_shape_manual(values = group_shapes, labels = group_labels) +
  scale_x_continuous(expand = expansion(mult = c(0.05, 0.05))) +
  scale_y_continuous(expand = expansion(mult = c(0.05, 0.05))) +
  labs(
    title = "Bray-Curtis ordination on species-level f_weighted_at_rank",
    x = sprintf("Axis 1 (%.1f%% explained)", 100 * axis1_var),
    y = sprintf("Axis 2 (%.1f%% explained)", 100 * axis2_var),
    color = "Group",
    shape = "Group"
  )

save_fig(beta_plot, "taxonomy_bray_curtis_ordination.png", width = 8.5, height = 5.8)
beta_plot

Take-home note. The ordination shows that the adult samples group more tightly together, while the child samples are farther away from that adult cluster.

5.8 Taxonomy interpretation

This subset is directionally consistent with the paper in some ways:

  • The adult samples cluster more closely to one another than the two youngest-child samples.
  • Adults also show the highest Shannon diversity in the subset.
  • The two youngest-child samples are compositionally distinct, especially because of stronger Actinomycetota / Bifidobacterium signal and lower Shannon diversity.

However, the subset does not fully recapitulate the paper’s age gradient:

  • There is only one older-child sample.
  • That older-child sample has a more adult-like phylum balance than the younger-child samples, but it still remains Bray-Curtis-distant from the adults.
  • Therefore, we can say the local data are compatible with the reported maturation trend, but this subset is too small to claim a clean replication.
Back to top