Clinical Foundations & AMR Surveillance
Workflow setup and six-sample teaching re-analysis
1 How to use this notebook
By the end of this notebook, learners should be able to:
- explain why the Vietnamese healthy gut microbiome exercise is framed as a 6-sample demonstration re-analysis rather than a full replication study,
- retrieve paired-end FASTQ files for the selected samples from ENA,
- prepare a simple sample CSV and configure
metaflowfor a workshop run, - process the raw
metaflowoutputs into taxonomy, phyloseq, taxburst, and AMR-ready files, - and interpret a compact R-based exploratory analysis with appropriate caution.
- Start with the orientation and setup sections so learners understand the purpose of the exercise before any commands are run.
- Work through the pipeline steps in order: download reads, prepare the input CSV, run
metaflow, and process the output files. - Only then move into the R analysis, where the emphasis shifts from file handling to interpretation.
- Use the reflection prompts and collapsible hints if some participants move more quickly than others.
This notebook is designed to be self-guided and practical. It lets learners move from:
- a clear study question,
- to reproducible data retrieval,
- to pipeline execution,
- to analysis-ready objects,
- and finally to biological interpretation.
That structure helps mixed-background groups stay oriented, even when some learners are new to command-line bioinformatics.
2 What this Vietnamese healthy gut microbiome exercise is
Exercise framing. This notebook uses 6 selected samples from the 2021 study by Pereira-Dias et al. as a compact teaching dataset. The aim is to show how to go from raw shotgun metagenomic reads to interpretable taxonomic and AMR summaries in a way that is practical for workshop learners.
This is not a full replication of the original paper:
- The original study profiled 42 healthy Vietnamese participants.
- The original data were paired-end Illumina shotgun metagenomes deposited in ENA under accessions ERS1865478–ERS1865519.
- The published paper performed broader age-stratified microbiome and AMR analyses than are possible in a 6-sample teaching subset.
- This notebook therefore focuses on workflow demonstration, directional comparison, and cautious interpretation.
The paper used for this exercise is:
Pereira-Dias J, et al. The Gut Microbiome of Healthy Vietnamese Adults and Children Is a Major Reservoir for Resistance Genes Against Critical Antimicrobials. Journal of Infectious Diseases. 2021;224(S7):S840-S847.
Interpretive guardrails used throughout this notebook
- This is a 6-sample demonstration re-analysis; the original paper analyzed 42 individuals.
- Taxonomic ecology in this notebook uses
f_weighted_at_rankas the main abundance value. - The taxonomy column
unweighted_fractionshould not be assumed to sum to 100%. - The treatment of unclassified reads materially changes interpretation.
- The AMR workflow here is
rgi-bwtagainst CARD-like annotations, not SRST2 + ARGANNOT as used in the paper. - Therefore, taxonomic and AMR comparisons are directional and exploratory, not definitive replication claims.
Which statement best captures the purpose of this Vietnamese healthy gut microbiome notebook?
3 metaflow overview
metaflow is a modular Nextflow pipeline for Illumina short-read metagenomics. In workshop terms, it gives learners one practical route from paired-end FASTQ files to read-based taxonomic profiling, host-removal outputs, and AMR screening results without forcing them to build every step from scratch.
Further reading. More detailed setup and usage notes for the pipeline are available in the metaflow wiki. This notebook keeps the workshop version shorter and more task-focused.
4 Software and setup guidance
4.1 Software requirements
For this workshop, learners mainly need:
gitto clone the pipeline,nextflow(version 25 or later recommended) to launch the workflow,- a Java runtime compatible with Nextflow (the wiki recommends Java 16-22),
- and either
condaorapptainerto provide software dependencies.
If you are using the prepared workshop machine, these pieces may already be installed.
4.2 Hardware requirements
Practical expectations for a small workshop run are:
- Linux, WSL, or a remote Linux server environment,
- reliable internet access for the initial FASTQ downloads,
- and enough free disk space for FASTQ files, Nextflow cache files, containers, and outputs.
The broader installation guide is more conservative:
- at least 32 GB RAM for MEGAHIT-based assembly,
- up to 128 GB RAM if metaSPAdes is used,
- and roughly 256 GB disk once databases, environments, caches, and outputs are included.
For this six-sample workshop, storage pressure is often felt before CPU becomes the main bottleneck.
4.3 How to clone the repository
4.4 Optional dependency setup
These are concise workshop-style reminders rather than full installation manuals.
4.4.1 Nextflow
4.4.2 Java
Install a recent Java runtime and confirm it works:
4.4.3 Conda
If conda is not already available, install Miniconda or a similar lightweight distribution and then create or activate the workshop environment:
4.4.4 Apptainer
If your site prefers containers instead of Conda environments, ensure apptainer is installed and available:
Which setup combination is most central to launching the workshop pipeline successfully?
4.5 Database note
metaflow depends on several prebuilt resources, especially for read-based analysis. In a normal installation, users must prepare these databases before running the pipeline. For this workshop, they have already been prepared to save time.
Available workshop databases:
- sourmash (GTDB-reps-rs226,
ksize=31,scaled=1000):/database/gtdb/gtdb-rs226/gtdb-rs226-reps.k31.rocksdb - YACHT (GTDB-reps-rs226,
ksize=31, ANI threshold0.995):/database/gtdb/gtdb-rs226/gtdb-rs226-reps.k31_0.995_pretrained - sourmash taxonomy lineage:
/database/gtdb/gtdb-rs226/gtdb-rs226.lineages.sqldb - hostile (human-t2t-hla):
/datasets/human_genome/hostile - CARD database for
rgi_bwt:/database/rgi_db
Workshop shortcut. Because these resources already exist, the practical job for learners is usually to point the config file to the correct prepared paths rather than building the databases from scratch.
4.6 Practical cautions
- If you plan to run jobs overnight, keep the session inside
screenortmux. metaflowwrites cache files wherever thenextflow runcommand is launched, so start the run from a directory with enough storage.- Do not reuse the same Nextflow cache folder or the same output folder for multiple active runs at the same time.
5 CSV input format for metaflow
According to the current Usage.md, the documented samplesheet format for CSV input is:
sample_id,run_id,group,short_reads_1,short_reads_2,long_reads
For this workshop, you can think of that file as a sample manifest: one row per paired-end sample, with optional metadata columns that help the pipeline track multiple runs or groups.
Practical requirements from the wiki:
- use a
.csvfile, - include at least the short-read columns,
- keep
run_idunique if one biological sample has more than one sequencing run, - and leave
long_readsblank for this short-read-only exercise.
sample_id,run_id,group,short_reads_1,short_reads_2,long_reads
ERR9904447,,0,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904447_1.fastq.gz,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904447_2.fastq.gz,
ERR9904449,,1,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904449_1.fastq.gz,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904449_2.fastq.gz,
ERR9904450,,1,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904450_1.fastq.gz,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904450_2.fastq.gz,
ERR9904452,,0,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904452_1.fastq.gz,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904452_2.fastq.gz,
ERR9904458,,1,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904458_1.fastq.gz,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904458_2.fastq.gz,
ERR9904480,,0,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904480_1.fastq.gz,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904480_2.fastq.gz,
Column guide:
sample_id: the sample identifier that should stay stable through the workflow; here we use the ENA run accession so downstream file matching stays simplerun_id: optional run label; leave blank when each sample has only one rungroup: optional grouping label; simple numeric or text labels are fine for this workshopshort_reads_1: full path to mate 1short_reads_2: full path to mate 2long_reads: left blank because this exercise uses Illumina paired-end short reads only
Practical tip. Absolute paths are safest for workshop workstations. If you use relative paths, make sure they resolve correctly from the directory where the pipeline is launched.
Which samplesheet format best matches the current metaflow wiki for this workshop?
6 Workshop config: microgen2026.config
The workshop pipeline run below uses: microgen2026.config
Current config content:
params {
trace_report_suffix = new java.util.Date().format('yyyy-MM-dd_HH-mm-ss')
enable_readbase = true
input = "day-4/materials/vietnamese-healthy-gut-microbiome"
input_format = "csv"
outdir = "day-4/materials/vietnamese-healthy-gut-microbiome/out"
hostile_reference = "/datasets/human_genome/hostile"
hostile_index = "human-t2t-hla"
sourmash_database = "/database/gtdb/gtdb-rs226/gtdb-rs226-reps.k31.rocksdb"
sourmash_ksize = 31
sourmash_taxonomy_csv = "/database/gtdb/gtdb-rs226/gtdb-rs226.lineages.sqldb"
yacht_database = "/database/gtdb/gtdb-rs226/gtdb-rs226-reps.k31_0.995_pretrained/gtdb-rs226-reps.k31_0.995_pretrained_config.json"
enable_rgi_bwt = true
rgi_preparecarddb_dir = "/database/rgi_db"
rgi_include_other_models = true
//enable_singlesketch = true
}
process {
cpus = { 1 * task.attempt }
memory = { 7.GB * task.attempt }
time = { 4.h * task.attempt }
errorStrategy = { task.exitStatus in ((130..145) + [104, 134, 137, 139, 140, 143]) ? 'retry' : 'finish' }
maxRetries = 3
maxErrors = '-1'
withLabel: process_single {
cpus = { 2 }
memory = { 7.GB * task.attempt }
time = { 4.h * task.attempt }
}
withLabel: process_low {
cpus = { 6 * task.attempt }
memory = { 12.GB * task.attempt }
time = { 4.h * task.attempt }
}
withLabel: process_medium {
cpus = { 12 * task.attempt }
memory = { 36.GB * task.attempt }
time = { 8.h * task.attempt }
}
withLabel: process_high {
cpus = { 24 * task.attempt }
memory = { 120.GB * task.attempt }
time = { 24.h * task.attempt }
}
withLabel: process_long {
time = { 72.h * task.attempt }
}
withLabel: process_high_memory {
memory = { 120.GB * task.attempt }
}
}
trace {
enabled = true
file = "${params.outdir}/pipeline_info/trace_complete_${params.trace_report_suffix}_rawtrace.txt"
raw = true
fields = 'task_id,hash,native_id,process,tag,name,status,exit,attempt,submit,start,complete,duration,realtime,queue,cpus,memory,disk,time,cpu_model,hostname,%cpu,%mem,peak_rss,peak_vmem,rss,vmem,rchar,wchar,syscr,syscw,read_bytes,write_bytes,vol_ctxt,inv_ctxt,module,container,workdir,error_action'
}
timeline {
enabled = true
file = "${params.outdir}/pipeline_info/timeline_${params.trace_report_suffix}.html"
}
report {
enabled = true
file = "${params.outdir}/pipeline_info/report_${params.trace_report_suffix}.html"
}Key parameters that matter most for this exercise:
input: points to the sample manifest. For this exercise, learners should edit it to the final path ofvietnamese_gut_sample.csv.outdir: the base output directory for the run. For this exercise, set it to/day-4/materials/vietnamese-healthy-gut-microbiome/out.enable_readbase: switches on the read-based portion of the pipeline, which is central to this workshop.hostile_referenceandhostile_index: control host-removal resources. Learners normally leave these alone unless their host database changes.sourmash_database,sourmash_ksize, andsourmash_taxonomy_csv: control read-based taxonomic profiling and lineage annotation.yacht_database: points to the prepared YACHT model configuration JSON inside the prebuilt workshop database directory.enable_rgi_bwt,rgi_preparecarddb_dir, andrgi_include_other_models: control the AMR screening step and the CARD resources used for it.process { ... }: sets CPU, memory, time, and retry behavior. Learners usually do not edit this section during a short workshop unless the execution environment is very constrained.
Why use -c here? The broader metaflow usage notes prefer parameter files for portable runs, but this workshop intentionally uses a prepared Nextflow config because it already bundles the local database paths and resource defaults for the training environment.
Required edits before the Vietnamese healthy gut microbiome run.
-
Change
inputto the path ofvietnamese_gut_sample.csv. -
Change
outdirto/day-4/materials/vietnamese-healthy-gut-microbiome/out.
Before running the workshop job, which config variables are the most important to update first?
7 Hands-on Vietnamese healthy gut microbiome practical exercise
7.1 Part 1: introduce the paper
The workshop paper is a short, practical example for learning age-stratified gut metagenomics and AMR interpretation:
- Pereira-Dias J, et al. (2021), Journal of Infectious Diseases, doi:10.1093/infdis/jiab398
We use it here because it has a clear microbiome-plus-AMR question, paired-end ENA data, and a manageable six-sample subset for teaching. The goal is to demonstrate workflow and interpretation, not to recreate every result in the paper.
7.2 Part 2: retrieve the sequencing data
Inside /day-4/materials/vietnamese-healthy-gut-microbiome, learners are given:
vietnamese_gut_sample_information.csv
The task is to find and download all samples whose identifiers appear in the studycode column.
7.2.1 Guided ENA retrieval exercise
- Start from one study sample accession given in the paper, for example
ERS1865478. - Visit
https://www.ebi.ac.uk/ena/browser/view/ERS1865478. - Open the accession page and go to the Data section.
- Click Read Files: Show.
- Identify the Study Accession for the parent project.
- Open that study accession page.
- Click Show Column Selection.
- Add the fields you need to recover the FASTQ download links.
- The most important field is
fastq_ftp. - Also keep fields such as
sample_title,sample_accession, andrun_accessionso you can map records back to the workshop metadata. - Download the study report as TSV.
- In the TSV, use
sample_titleto match thestudycodevalues listed invietnamese_gut_sample_information.csv. - Extract the paired-end FASTQ URLs for the six required samples.
Why use sample_title? In this exercise, sample_title is the practical bridge between the ENA report and the workshop metadata because it corresponds to the studycode values that learners are asked to follow.
When learners export the ENA study report, which field is the most important for recovering the paired FASTQ download links?
Hint: example batch download script
#!/bin/bash
mkdir -p fastq && cd fastq || exit 1
urls=(
"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR990/007/ERR9904447/ERR9904447_1.fastq.gz"
"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR990/007/ERR9904447/ERR9904447_2.fastq.gz"
"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR990/002/ERR9904452/ERR9904452_1.fastq.gz"
"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR990/002/ERR9904452/ERR9904452_2.fastq.gz"
"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR990/000/ERR9904480/ERR9904480_1.fastq.gz"
"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR990/000/ERR9904480/ERR9904480_2.fastq.gz"
"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR990/009/ERR9904449/ERR9904449_1.fastq.gz"
"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR990/009/ERR9904449/ERR9904449_2.fastq.gz"
"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR990/000/ERR9904450/ERR9904450_1.fastq.gz"
"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR990/000/ERR9904450/ERR9904450_2.fastq.gz"
"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR990/008/ERR9904458/ERR9904458_1.fastq.gz"
"ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR990/008/ERR9904458/ERR9904458_2.fastq.gz"
)
for url in "${urls[@]}"; do
wget -c "$url"
doneFor the remainder of this notebook, assume the paired-end FASTQ files have been downloaded into:
/day-4/materials/vietnamese-healthy-gut-microbiome/fastq
7.3 Create the metaflow input CSV
The input CSV tells metaflow exactly which samples, optional grouping fields, and FASTQ pairs should be processed together. It is the handoff between manual data retrieval and reproducible pipeline execution.
Create:
vietnamese_gut_sample.csv
Example content:
sample_id,run_id,group,short_reads_1,short_reads_2,long_reads
ERR9904447,,adult,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904447_1.fastq.gz,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904447_2.fastq.gz,
ERR9904452,,adult,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904452_1.fastq.gz,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904452_2.fastq.gz,
ERR9904480,,child_24_59m,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904480_1.fastq.gz,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904480_2.fastq.gz,
ERR9904449,,child_0_23m,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904449_1.fastq.gz,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904449_2.fastq.gz,
ERR9904450,,child_0_23m,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904450_1.fastq.gz,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904450_2.fastq.gz,
ERR9904458,,child_0_23m,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904458_1.fastq.gz,/day-4/materials/vietnamese-healthy-gut-microbiome/fastq/ERR9904458_2.fastq.gz,Column guide:
sample_id: use a stable identifier, such as the ENA run accession, so downstream outputs stay easy to matchrun_id: leave blank when there is only one sequencing run per samplegroup: optional learner-friendly grouping label; here it helps keep the age strata visible during teachingshort_reads_1: full path to mate 1short_reads_2: full path to mate 2long_reads: leave blank for this short-read-only workshop dataset
Before launching the workflow, edit the config so that:
inputpoints tovietnamese_gut_sample.csvoutdirpoints to/day-4/materials/vietnamese-healthy-gut-microbiome/out
7.4 Run the pipeline
Launch the pipeline with:
Command breakdown:
nextflow run: start the workflow/path/to/metaflow/main.nf: the main pipeline entry point-c /path/to/microgen2026.config: load the workshop config file-profile conda: resolve software dependencies through Conda
7.5 Process the metaflow output into analysis-ready files
After the pipeline finishes, the raw metaflow output is still not in the most convenient format for downstream teaching. In this workshop we post-process the merged read-based results into:
- taxonomy tables with and without unclassified rows,
- a phyloseq export directory,
- taxburst-ready split files,
- and a cleaned AMR file collection.
The read-based result table is:
/day-4/materials/vietnamese-healthy-gut-microbiome/out/Readbased_Analysis/Sourmash-YACHT/final_results/final_results/merged_sourmash_yacht.csv
The processing helper validated in the current workshop folder is:
/day-4/materials/process_metaflow_daa_microgen2026.py
If your teaching folder also provides a wrapper named process_metagenomic_microgen2026.py, you can substitute only the script filename; the command structure below stays the same.
This matters because the live CLI supports --output_format default, phyloseq, and taxburst, and it has a few rules that differ slightly from the original draft instructions:
--remove_unclassifiedis supported fordefaultoutput,taxburstoutput already omits synthetic unclassified rows, so--remove_unclassifiedshould not be combined with it,- and
phyloseq/taxburstoutputs should write into an existing directory.
Suggested workflow:
conda activate microgen2026
mkdir -p /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis
python /day-4/materials/process_metaflow_daa_microgen2026.py \
--input /day-4/materials/vietnamese-healthy-gut-microbiome/out/Readbased_Analysis/Sourmash-YACHT/final_results/final_results/merged_sourmash_yacht.csv \
--input_format raw_metaflow \
--min_coverage 0.05 \
--output_format default \
--output /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis \
--remove_unclassified
mv /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis/merged_sourmash_yacht.csv.processed_metaflow.csv \
/day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis/vietnamese_gut_without_unclassified.csv
python /day-4/materials/process_metaflow_daa_microgen2026.py \
--input /day-4/materials/vietnamese-healthy-gut-microbiome/out/Readbased_Analysis/Sourmash-YACHT/final_results/final_results/merged_sourmash_yacht.csv \
--input_format raw_metaflow \
--min_coverage 0.05 \
--output_format default \
--output /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis
mv /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis/merged_sourmash_yacht.csv.processed_metaflow.csv \
/day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis/vietnamese_gut_with_unclassified.csv
python /day-4/materials/process_metaflow_daa_microgen2026.py \
--input /day-4/materials/vietnamese-healthy-gut-microbiome/out/Readbased_Analysis/Sourmash-YACHT/final_results/final_results/merged_sourmash_yacht.csv \
--input_format raw_metaflow \
--min_coverage 0.05 \
--output_format taxburst \
--output /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis
mv /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis/merged_sourmash_yacht.csv.taxburst \
/day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis/vietnamese_gut.taxburst
python /day-4/materials/process_metaflow_daa_microgen2026.py \
--input /day-4/materials/vietnamese-healthy-gut-microbiome/out/Readbased_Analysis/Sourmash-YACHT/final_results/final_results/merged_sourmash_yacht.csv \
--input_format raw_metaflow \
--min_coverage 0.05 \
--output_format phyloseq \
--output /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis
mv /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis/merged_sourmash_yacht.csv.phyloseq \
/day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis/vietnamese_gut.phyloseq
mkdir -p /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis/vietnamese_gut_amr
find /day-4/materials/vietnamese-healthy-gut-microbiome/out/Readbased_Analysis/rgi_bwt -name '*.gene_mapping_data.txt' \
-exec cp {} /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis/vietnamese_gut_amr/ \;What each command is doing:
- the first run creates the classified-only table and renormalizes
f_weighted_at_rank - the second run keeps the with-unclassified version
- the third run creates a directory of taxburst-ready split files
- the fourth run writes a phyloseq export directory
- the final
find ... cpstep collects the per-samplergi_bwtmapping tables into one AMR folder
After this step, the analysis folder should contain:
vietnamese_gut.phyloseqvietnamese_gut.taxburstvietnamese_gut_amrvietnamese_gut_with_unclassified.csvvietnamese_gut_without_unclassified.csv
7.6 Generate Krona-like taxburst HTML files
Once the processed taxburst folder exists, move into it and render one HTML file per split table.
conda activate microgen2026
cd /day-4/materials/vietnamese-healthy-gut-microbiome/out/vietnamese_gut_analysis/vietnamese_gut.taxburst
for file in *.taxburst.split.csv; do
sample_name="${file%.taxburst.split.csv}"
taxburst -F tax_annotate "$file" -o "${sample_name}.taxburst.split.html"
doneSmall logic hint:
*.taxburst.split.csvmatches each sample-level taxburst input file- the loop strips the suffix to recover the sample name
taxburstthen writes a browsable HTML summary for that sample
Those HTML files work as Krona-like views of sample relative abundance and are useful for quick qualitative inspection in class.
Catch-up option. If learners fall behind, precomputed outputs are already available in /day-4/materials/vietnamese-healthy-gut-microbiome, so the workshop can continue into the R analysis without waiting for every pipeline run to finish.
Why does the workflow deliberately produce both vietnamese_gut_with_unclassified.csv and vietnamese_gut_without_unclassified.csv?
Continue to Taxonomy Analysis.