Clinical Foundations & AMR Surveillance

Workflow setup and six-sample teaching re-analysis

Author

Minh-Quan Ton-Ngoc, Hao Chung The

Published

May 15, 2026

1 How to use this notebook

Learning objectives

By the end of this notebook, learners should be able to:

explain why the Vietnamese healthy gut microbiome exercise is framed as a 6-sample demonstration re-analysis rather than a full replication study,
retrieve paired-end FASTQ files for the selected samples from ENA,
prepare a simple sample CSV and configure metaflow for a workshop run,
process the raw metaflow outputs into taxonomy, phyloseq, taxburst, and AMR-ready files,
and interpret a compact R-based exploratory analysis with appropriate caution.

Suggested workshop flow

Start with the orientation and setup sections so learners understand the purpose of the exercise before any commands are run.
Work through the pipeline steps in order: download reads, prepare the input CSV, run metaflow, and process the output files.
Only then move into the R analysis, where the emphasis shifts from file handling to interpretation.
Use the reflection prompts and collapsible hints if some participants move more quickly than others.

Why this format works well for doctors and researchers

This notebook is designed to be self-guided and practical. It lets learners move from:

a clear study question,
to reproducible data retrieval,
to pipeline execution,
to analysis-ready objects,
and finally to biological interpretation.

That structure helps mixed-background groups stay oriented, even when some learners are new to command-line bioinformatics.

2 What this Vietnamese healthy gut microbiome exercise is

Exercise framing. This notebook uses 6 selected samples from the 2021 study by Pereira-Dias et al. as a compact teaching dataset. The aim is to show how to go from raw shotgun metagenomic reads to interpretable taxonomic and AMR summaries in a way that is practical for workshop learners.

This is not a full replication of the original paper:

The original study profiled 42 healthy Vietnamese participants.
The original data were paired-end Illumina shotgun metagenomes deposited in ENA under accessions ERS1865478–ERS1865519.
The published paper performed broader age-stratified microbiome and AMR analyses than are possible in a 6-sample teaching subset.
This notebook therefore focuses on workflow demonstration, directional comparison, and cautious interpretation.

The paper used for this exercise is:

Pereira-Dias J, et al. The Gut Microbiome of Healthy Vietnamese Adults and Children Is a Major Reservoir for Resistance Genes Against Critical Antimicrobials. Journal of Infectious Diseases. 2021;224(S7):S840-S847.

Interpretive guardrails used throughout this notebook

This is a 6-sample demonstration re-analysis; the original paper analyzed 42 individuals.
Taxonomic ecology in this notebook uses f_weighted_at_rank as the main abundance value.
The taxonomy column unweighted_fraction should not be assumed to sum to 100%.
The treatment of unclassified reads materially changes interpretation.
The AMR workflow here is rgi-bwt against CARD-like annotations, not SRST2 + ARGANNOT as used in the paper.
Therefore, taxonomic and AMR comparisons are directional and exploratory, not definitive replication claims.

Which statement best captures the purpose of this Vietnamese healthy gut microbiome notebook?

A. It is a full replication that should reproduce all published statistics from the original cohort B. It is a six-sample teaching notebook that demonstrates workflow and supports cautious directional comparison C. It is only a software tutorial and should avoid biological interpretation altogether D. It should exactly mirror every tool and database used in the original paper even if that makes the workshop impractical

Hint

Focus on the difference between a teaching re-analysis and a definitive replication study.

3 metaflow overview

metaflow is a modular Nextflow pipeline for Illumina short-read metagenomics. In workshop terms, it gives learners one practical route from paired-end FASTQ files to read-based taxonomic profiling, host-removal outputs, and AMR screening results without forcing them to build every step from scratch.

Workflow diagram

The diagram below summarizes the planned workflow for upcoming releases of metaflow. Some modules shown may not be available in the current version.

Further reading. More detailed setup and usage notes for the pipeline are available in the metaflow wiki. This notebook keeps the workshop version shorter and more task-focused.

4 Software and setup guidance

4.1 Software requirements

For this workshop, learners mainly need:

git to clone the pipeline,
nextflow (version 25 or later recommended) to launch the workflow,
a Java runtime compatible with Nextflow (the wiki recommends Java 16-22),
and either conda or apptainer to provide software dependencies.

If you are using the prepared workshop machine, these pieces may already be installed.

4.2 Hardware requirements

Practical expectations for a small workshop run are:

Linux, WSL, or a remote Linux server environment,
reliable internet access for the initial FASTQ downloads,
and enough free disk space for FASTQ files, Nextflow cache files, containers, and outputs.

The broader installation guide is more conservative:

at least 32 GB RAM for MEGAHIT-based assembly,
up to 128 GB RAM if metaSPAdes is used,
and roughly 256 GB disk once databases, environments, caches, and outputs are included.

For this six-sample workshop, storage pressure is often felt before CPU becomes the main bottleneck.

4.3 How to clone the repository

cd /path/to/your/folder
git clone https://github.com/tnmquann/metaflow.git
cd metaflow

4.4 Optional dependency setup

These are concise workshop-style reminders rather than full installation manuals.

4.4.1 Nextflow

curl -fsSL https://get.nextflow.io | bash
chmod +x nextflow
mv nextflow ~/bin/

4.4.2 Java

Install a recent Java runtime and confirm it works:

java -version

4.4.3 Conda

If conda is not already available, install Miniconda or a similar lightweight distribution and then create or activate the workshop environment:

conda activate microgen2026

4.4.4 Apptainer

If your site prefers containers instead of Conda environments, ensure apptainer is installed and available:

apptainer --version

Which setup combination is most central to launching the workshop pipeline successfully?

A. A web browser and a PDF reader B. Nextflow, a compatible Java runtime, and an environment system such as Conda or Apptainer C. RStudio and ggplot2 only D. A custom-built sourmash database created during the workshop from scratch

Hint

Think about what is required before any R analysis begins.

4.5 Database note

metaflow depends on several prebuilt resources, especially for read-based analysis. In a normal installation, users must prepare these databases before running the pipeline. For this workshop, they have already been prepared to save time.

Available workshop databases:

sourmash (GTDB-reps-rs226, ksize=31, scaled=1000): /database/gtdb/gtdb-rs226/gtdb-rs226-reps.k31.rocksdb
YACHT (GTDB-reps-rs226, ksize=31, ANI threshold 0.995): /database/gtdb/gtdb-rs226/gtdb-rs226-reps.k31_0.995_pretrained
sourmash taxonomy lineage: /database/gtdb/gtdb-rs226/gtdb-rs226.lineages.sqldb
hostile (human-t2t-hla): /datasets/human_genome/hostile
CARD database for rgi_bwt: /database/rgi_db

Workshop shortcut. Because these resources already exist, the practical job for learners is usually to point the config file to the correct prepared paths rather than building the databases from scratch.

4.6 Practical cautions

Run-safety reminders

If you plan to run jobs overnight, keep the session inside screen or tmux.
metaflow writes cache files wherever the nextflow run command is launched, so start the run from a directory with enough storage.
Do not reuse the same Nextflow cache folder or the same output folder for multiple active runs at the same time.

5 CSV input format for metaflow

According to the current Usage.md, the documented samplesheet format for CSV input is:

sample_id,run_id,group,short_reads_1,short_reads_2,long_reads

For this workshop, you can think of that file as a sample manifest: one row per paired-end sample, with optional metadata columns that help the pipeline track multiple runs or groups.

Practical requirements from the wiki:

use a .csv file,
include at least the short-read columns,
keep run_id unique if one biological sample has more than one sequencing run,
and leave long_reads blank for this short-read-only exercise.

sample_id,run_id,group,short_reads_1,short_reads_2,long_reads
ERR9904447,,0,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904447_1.fastq.gz,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904447_2.fastq.gz,
ERR9904449,,1,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904449_1.fastq.gz,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904449_2.fastq.gz,
ERR9904450,,1,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904450_1.fastq.gz,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904450_2.fastq.gz,
ERR9904452,,0,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904452_1.fastq.gz,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904452_2.fastq.gz,
ERR9904458,,1,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904458_1.fastq.gz,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904458_2.fastq.gz,
ERR9904480,,0,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904480_1.fastq.gz,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904480_2.fastq.gz,

Column guide:

sample_id: the sample identifier that should stay stable through the workflow; here we use the ENA run accession so downstream file matching stays simple
run_id: optional run label; leave blank when each sample has only one run
group: optional grouping label; simple numeric or text labels are fine for this workshop
short_reads_1: full path to mate 1
short_reads_2: full path to mate 2
long_reads: left blank because this exercise uses Illumina paired-end short reads only

Practical tip. Absolute paths are safest for workshop workstations. If you use relative paths, make sure they resolve correctly from the directory where the pipeline is launched.

Which samplesheet format best matches the current metaflow wiki for this workshop?

A. sample_name,read1,read2 only B. sample_id,run_id,group,short_reads_1,short_reads_2,long_reads C. studycode,fastq_1,fastq_2,status D. A tab-separated file with no header

Hint

The answer should match the wording in Usage.md, not just a minimal manifest that happens to work elsewhere.

6 Workshop config: `microgen2026.config`

The workshop pipeline run below uses: microgen2026.config

Current config content:

params {
    trace_report_suffix   = new java.util.Date().format('yyyy-MM-dd_HH-mm-ss')
    enable_readbase = true
    input = "day-4/materials/vietnamese-healthy-gut-microbiome"
    input_format = "csv"
    outdir = "day-4/materials/vietnamese-healthy-gut-microbiome/out"
    hostile_reference = "/datasets/human_genome/hostile"
    hostile_index = "human-t2t-hla"
    sourmash_database = "/database/gtdb/gtdb-rs226/gtdb-rs226-reps.k31.rocksdb"
    sourmash_ksize = 31
    sourmash_taxonomy_csv = "/database/gtdb/gtdb-rs226/gtdb-rs226.lineages.sqldb"
    yacht_database = "/database/gtdb/gtdb-rs226/gtdb-rs226-reps.k31_0.995_pretrained/gtdb-rs226-reps.k31_0.995_pretrained_config.json"
    enable_rgi_bwt = true
    rgi_preparecarddb_dir = "/database/rgi_db"
    rgi_include_other_models = true
    //enable_singlesketch = true
}

process {
    cpus     = { 1 * task.attempt }
    memory   = { 7.GB * task.attempt }
    time     = { 4.h * task.attempt }

    errorStrategy = { task.exitStatus in ((130..145) + [104, 134, 137, 139, 140, 143]) ? 'retry' : 'finish' }
    maxRetries = 3
    maxErrors = '-1'

    withLabel: process_single {
        cpus   = { 2 }
        memory = { 7.GB * task.attempt }
        time   = { 4.h * task.attempt }
    }
    withLabel: process_low {
        cpus   = { 6 * task.attempt }
        memory = { 12.GB * task.attempt }
        time   = { 4.h * task.attempt }
    }
    withLabel: process_medium {
        cpus   = { 12 * task.attempt }
        memory = { 36.GB * task.attempt }
        time   = { 8.h * task.attempt }
    }
    withLabel: process_high {
        cpus   = { 24 * task.attempt }
        memory = { 120.GB * task.attempt }
        time   = { 24.h * task.attempt }
    }
    withLabel: process_long {
        time = { 72.h * task.attempt }
    }
    withLabel: process_high_memory {
        memory = { 120.GB * task.attempt }
    }
}

trace {
    enabled = true
    file = "${params.outdir}/pipeline_info/trace_complete_${params.trace_report_suffix}_rawtrace.txt"
    raw = true
    fields = 'task_id,hash,native_id,process,tag,name,status,exit,attempt,submit,start,complete,duration,realtime,queue,cpus,memory,disk,time,cpu_model,hostname,%cpu,%mem,peak_rss,peak_vmem,rss,vmem,rchar,wchar,syscr,syscw,read_bytes,write_bytes,vol_ctxt,inv_ctxt,module,container,workdir,error_action'
}

timeline {
    enabled = true
    file = "${params.outdir}/pipeline_info/timeline_${params.trace_report_suffix}.html"
}

report {
    enabled = true
    file = "${params.outdir}/pipeline_info/report_${params.trace_report_suffix}.html"
}

Key parameters that matter most for this exercise:

input: points to the sample manifest. For this exercise, learners should edit it to the final path of vietnamese_gut_sample.csv.
outdir: the base output directory for the run. For this exercise, set it to /day-4/materials/vietnamese-healthy-gut-microbiome/out.
enable_readbase: switches on the read-based portion of the pipeline, which is central to this workshop.
hostile_reference and hostile_index: control host-removal resources. Learners normally leave these alone unless their host database changes.
sourmash_database, sourmash_ksize, and sourmash_taxonomy_csv: control read-based taxonomic profiling and lineage annotation.
yacht_database: points to the prepared YACHT model configuration JSON inside the prebuilt workshop database directory.
enable_rgi_bwt, rgi_preparecarddb_dir, and rgi_include_other_models: control the AMR screening step and the CARD resources used for it.
process { ... }: sets CPU, memory, time, and retry behavior. Learners usually do not edit this section during a short workshop unless the execution environment is very constrained.

Why use -c here? The broader metaflow usage notes prefer parameter files for portable runs, but this workshop intentionally uses a prepared Nextflow config because it already bundles the local database paths and resource defaults for the training environment.

Required edits before the Vietnamese healthy gut microbiome run.

Change input to the path of vietnamese_gut_sample.csv.
Change outdir to /day-4/materials/vietnamese-healthy-gut-microbiome/out.

Before running the workshop job, which config variables are the most important to update first?

A. sourmash_ksize and rgi_include_other_models B. input and outdir C. cpus and memory in every process label D. trace_report_suffix and timeline.file

Hint

Ask which settings tell the pipeline what to read and where to write.

7 Hands-on Vietnamese healthy gut microbiome practical exercise

7.1 Part 1: introduce the paper

The workshop paper is a short, practical example for learning age-stratified gut metagenomics and AMR interpretation:

Pereira-Dias J, et al. (2021), Journal of Infectious Diseases, doi:10.1093/infdis/jiab398

We use it here because it has a clear microbiome-plus-AMR question, paired-end ENA data, and a manageable six-sample subset for teaching. The goal is to demonstrate workflow and interpretation, not to recreate every result in the paper.

7.2 Part 2: retrieve the sequencing data

Inside /day-4/materials/vietnamese-healthy-gut-microbiome, learners are given:

vietnamese_gut_sample_information.csv

The task is to find and download all samples whose identifiers appear in the studycode column.

7.2.1 Guided ENA retrieval exercise

Start from one study sample accession given in the paper, for example ERS1865478.
Visit https://www.ebi.ac.uk/ena/browser/view/ERS1865478.
Open the accession page and go to the Data section.
Click Read Files: Show.
Identify the Study Accession for the parent project.
Open that study accession page.
Click Show Column Selection.
Add the fields you need to recover the FASTQ download links.
The most important field is fastq_ftp.
Also keep fields such as sample_title, sample_accession, and run_accession so you can map records back to the workshop metadata.
Download the study report as TSV.
In the TSV, use sample_title to match the studycode values listed in vietnamese_gut_sample_information.csv.
Extract the paired-end FASTQ URLs for the six required samples.

Why use sample_title? In this exercise, sample_title is the practical bridge between the ENA report and the workshop metadata because it corresponds to the studycode values that learners are asked to follow.

When learners export the ENA study report, which field is the most important for recovering the paired FASTQ download links?

A. study_accession B. fastq_ftp C. base_count D. scientific_name

Hint

One field helps you identify the right samples, but another actually carries the FTP links you need to download.

Hint: example batch download script

#!/bin/bash

mkdir -p fastq && cd fastq || exit 1

urls=(
"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR990/007/ERR9904447/ERR9904447_1.fastq.gz"
"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR990/007/ERR9904447/ERR9904447_2.fastq.gz"

"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR990/002/ERR9904452/ERR9904452_1.fastq.gz"
"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR990/002/ERR9904452/ERR9904452_2.fastq.gz"

"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR990/000/ERR9904480/ERR9904480_1.fastq.gz"
"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR990/000/ERR9904480/ERR9904480_2.fastq.gz"

"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR990/009/ERR9904449/ERR9904449_1.fastq.gz"
"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR990/009/ERR9904449/ERR9904449_2.fastq.gz"

"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR990/000/ERR9904450/ERR9904450_1.fastq.gz"
"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR990/000/ERR9904450/ERR9904450_2.fastq.gz"

"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR990/008/ERR9904458/ERR9904458_1.fastq.gz"
"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR990/008/ERR9904458/ERR9904458_2.fastq.gz"
)

for url in "${urls[@]}"; do
    wget -c "$url"
done

For the remainder of this notebook, assume the paired-end FASTQ files have been downloaded into:

/day-4/materials/vietnamese-healthy-gut-microbiome/fastq

7.3 Create the metaflow input CSV

The input CSV tells metaflow exactly which samples, optional grouping fields, and FASTQ pairs should be processed together. It is the handoff between manual data retrieval and reproducible pipeline execution.

Create:

vietnamese_gut_sample.csv

Example content:

sample_id,run_id,group,short_reads_1,short_reads_2,long_reads
ERR9904447,,adult,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904447_1.fastq.gz,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904447_2.fastq.gz,
ERR9904452,,adult,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904452_1.fastq.gz,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904452_2.fastq.gz,
ERR9904480,,child_24_59m,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904480_1.fastq.gz,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904480_2.fastq.gz,
ERR9904449,,child_0_23m,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904449_1.fastq.gz,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904449_2.fastq.gz,
ERR9904450,,child_0_23m,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904450_1.fastq.gz,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904450_2.fastq.gz,
ERR9904458,,child_0_23m,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904458_1.fastq.gz,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904458_2.fastq.gz,

Column guide:

sample_id: use a stable identifier, such as the ENA run accession, so downstream outputs stay easy to match
run_id: leave blank when there is only one sequencing run per sample
group: optional learner-friendly grouping label; here it helps keep the age strata visible during teaching
short_reads_1: full path to mate 1
short_reads_2: full path to mate 2
long_reads: leave blank for this short-read-only workshop dataset

Before launching the workflow, edit the config so that:

input points to vietnamese_gut_sample.csv
outdir points to /day-4/materials/vietnamese-healthy-gut-microbiome/out

7.4 Run the pipeline

Launch the pipeline with:

nextflow run /path/to/metaflow/main.nf -c /path/to/microgen2026.config -profile conda

Command breakdown:

nextflow run: start the workflow
/path/to/metaflow/main.nf: the main pipeline entry point
-c /path/to/microgen2026.config: load the workshop config file
-profile conda: resolve software dependencies through Conda

Worked solution for the workshop file layout

cd /day-4/materials
nextflow run /path/to/metaflow/main.nf \
  -c /path/to/microgen2026.config \
  -profile conda

7.5 Process the metaflow output into analysis-ready files

After the pipeline finishes, the raw metaflow output is still not in the most convenient format for downstream teaching. In this workshop we post-process the merged read-based results into:

taxonomy tables with and without unclassified rows,
a phyloseq export directory,
taxburst-ready split files,
and a cleaned AMR file collection.

The read-based result table is:

/day-4/materials/vietnamese-healthy-gut-microbiome/out/Readbased_Analysis/Sourmash-YACHT/final_results/final_results/merged_sourmash_yacht.csv

The processing helper validated in the current workshop folder is:

/day-4/materials/process_metaflow_daa_microgen2026.py

If your teaching folder also provides a wrapper named process_metagenomic_microgen2026.py, you can substitute only the script filename; the command structure below stays the same.

This matters because the live CLI supports --output_format default, phyloseq, and taxburst, and it has a few rules that differ slightly from the original draft instructions:

--remove_unclassified is supported for default output,
taxburst output already omits synthetic unclassified rows, so --remove_unclassified should not be combined with it,
and phyloseq / taxburst outputs should write into an existing directory.

Suggested workflow:

conda activate microgen2026
mkdir -p /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis

python /day-4/materials/process_metaflow_daa_microgen2026.py \
  --input /day-4/materials/vietnamese-healthy-gut-microbiome/out/Readbased_Analysis/Sourmash-YACHT/final_results/final_results/merged_sourmash_yacht.csv \
  --input_format raw_metaflow \
  --min_coverage 0.05 \
  --output_format default \
  --output /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis \
  --remove_unclassified

mv /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis/merged_sourmash_yacht.csv.processed_metaflow.csv \
   /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis/vietnamese_gut_without_unclassified.csv

python /day-4/materials/process_metaflow_daa_microgen2026.py \
  --input /day-4/materials/vietnamese-healthy-gut-microbiome/out/Readbased_Analysis/Sourmash-YACHT/final_results/final_results/merged_sourmash_yacht.csv \
  --input_format raw_metaflow \
  --min_coverage 0.05 \
  --output_format default \
  --output /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis

mv /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis/merged_sourmash_yacht.csv.processed_metaflow.csv \
   /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis/vietnamese_gut_with_unclassified.csv

python /day-4/materials/process_metaflow_daa_microgen2026.py \
  --input /day-4/materials/vietnamese-healthy-gut-microbiome/out/Readbased_Analysis/Sourmash-YACHT/final_results/final_results/merged_sourmash_yacht.csv \
  --input_format raw_metaflow \
  --min_coverage 0.05 \
  --output_format taxburst \
  --output /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis

mv /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis/merged_sourmash_yacht.csv.taxburst \
   /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis/vietnamese_gut.taxburst

python /day-4/materials/process_metaflow_daa_microgen2026.py \
  --input /day-4/materials/vietnamese-healthy-gut-microbiome/out/Readbased_Analysis/Sourmash-YACHT/final_results/final_results/merged_sourmash_yacht.csv \
  --input_format raw_metaflow \
  --min_coverage 0.05 \
  --output_format phyloseq \
  --output /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis

mv /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis/merged_sourmash_yacht.csv.phyloseq \
   /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis/vietnamese_gut.phyloseq

mkdir -p /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis/vietnamese_gut_amr
find /day-4/materials/vietnamese-healthy-gut-microbiome/out/Readbased_Analysis/rgi_bwt -name '*.gene_mapping_data.txt' \
  -exec cp {} /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis/vietnamese_gut_amr/ \;

What each command is doing:

the first run creates the classified-only table and renormalizes f_weighted_at_rank
the second run keeps the with-unclassified version
the third run creates a directory of taxburst-ready split files
the fourth run writes a phyloseq export directory
the final find ... cp step collects the per-sample rgi_bwt mapping tables into one AMR folder

After this step, the analysis folder should contain:

vietnamese_gut.phyloseq
vietnamese_gut.taxburst
vietnamese_gut_amr
vietnamese_gut_with_unclassified.csv
vietnamese_gut_without_unclassified.csv

7.6 Generate Krona-like taxburst HTML files

Once the processed taxburst folder exists, move into it and render one HTML file per split table.

conda activate microgen2026
cd /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis/vietnamese_gut.taxburst

for file in *.taxburst.split.csv; do
    sample_name="${file%.taxburst.split.csv}"
    taxburst -F tax_annotate "$file" -o "${sample_name}.taxburst.split.html"
done

Small logic hint:

*.taxburst.split.csv matches each sample-level taxburst input file
the loop strips the suffix to recover the sample name
taxburst then writes a browsable HTML summary for that sample

Those HTML files work as Krona-like views of sample relative abundance and are useful for quick qualitative inspection in class.

Catch-up option. If learners fall behind, precomputed outputs are already available in /day-4/materials/vietnamese-healthy-gut-microbiome, so the workshop can continue into the R analysis without waiting for every pipeline run to finish.

Why does the workflow deliberately produce both vietnamese_gut_with_unclassified.csv and vietnamese_gut_without_unclassified.csv?

A. Because one of them is only a temporary file that can always be ignored B. Because one keeps classification uncertainty visible, while the other supports classified-only ecological summaries C. Because taxburst requires two copies of the same table D. Because AMR normalization cannot be done unless both files exist

Hint

Think about what changes when you remove unclassified rows and renormalize the remaining abundances.

Tip

Continue to Taxonomy Analysis.

1 How to use this notebook

2 What this Vietnamese healthy gut microbiome exercise is

3 metaflow overview

4 Software and setup guidance

4.1 Software requirements

4.2 Hardware requirements

4.3 How to clone the repository

4.4 Optional dependency setup

4.4.1 Nextflow

4.4.2 Java

4.4.3 Conda

4.4.4 Apptainer

4.5 Database note

4.6 Practical cautions

5 CSV input format for metaflow

6 Workshop config: microgen2026.config

7 Hands-on Vietnamese healthy gut microbiome practical exercise

7.1 Part 1: introduce the paper

7.2 Part 2: retrieve the sequencing data

7.2.1 Guided ENA retrieval exercise

7.3 Create the metaflow input CSV

7.4 Run the pipeline

7.5 Process the metaflow output into analysis-ready files

7.6 Generate Krona-like taxburst HTML files

6 Workshop config: `microgen2026.config`