1.3 From raw sequencing data to assemblies

Author

Nguyen MTS

Published

May 12, 2026

Introduction

From now until the end of Day-3, we will work with a real-world dataset: a nosocomial K. pneumoniae outbreak in Italy.

The dataset is under the BioProject ID: PRJNA1260529, acompanied by the publication of Captani et al, Euro Surveill, 2026. It describes an outbreak of multidrug-resistant and carbapenemase-producing K. pneumoniae ST147 in the sub-intensive care unit (S-ICU) of a tertiary care hospital in Italy between mid-February and early March 2025. NDM-producing K. pneumoniae isolates were sequenced using both Illumina short-read and Oxford Nanopore long-read technologies. The sequencing data helped illuminating the transmission dynamics of the outbreak and the genomic features of the isolates.

Using this dataset, we will learn how to transform “raw” sequencing data into “intepretable” genomes, and get a practical experience with a typical workflow in genomic epidemiology in the context of infectious disease outbreaks.

Our work-flow will comprise the following main processes:

Quality assessment (and improvement) of raw sequencing data
Genome assembly
Species identification and MLST typing
AMR gene profiling
Genome annotation
Gene variation analysis
Phylogenetic analysis

This module will guide you through the first 2 processes.

We hope you will enjoy the flow. Now, let’s begin!

Data directory

If you have completed Day-0, you probably have downloaded the dataset in the last exercise. But if you haven’t, no worries, we’ve got you covered! The data is pre-loaded on your instance under these directories:

/data/raw_shortread: Short-read data from Illumina - our focus today.
/data/raw_longread: Long-read data (Oxford Nanopore) - we will study in module 2.2.

We will also need metadata information of the dataset, such as sample name and sequencing technique for each data file. These are stored in the tab-delimited file PRJNA1260529.metadata.tsv.

Let’s take a look at the files in the raw_shortread directory:

ls raw_shortread/

You shall see a directory structure that looks like this:

├── raw_shortread 
│   ├── 925Kp_1.fastq.gz 
│   ├── 925Kp_2.fastq.gz 
│   ├── 930Kp_1.fastq.gz 
│   ├── 930Kp_2.fastq.gz 
│   ├── ...

Try to answer the following questions:

How many files are there?
What are the sizes of these files?
What do the file names tell you?

Challenge

Check if the files are truly corresponding to short-read sequencing data, but not the long-read, using command-lines.

Hint: Use the metadata file. Loop (recursively run) through the forward read files to get the sample names.

Code

metadata="path-to-metadata-file" 
for file in *_1.fastq.gz; do # start of the loop

    sample=${file%%_*}; # split the filename by the character "_" and get the first part - the sample name
    
    # add your code to search for the sample and get its sequencing tech in the metadata file

done # end of the loop

FASTQ files

Raw sequencing data is stored in FASTQ format. The filename usually ends with .fastq (or .fq), or .fastq.gz (or .fq.gz) if they are compressed.

For paired-end short-reads, each sample has two files: one for forward reads (_1) and one for reverse reads (_2).

In our dataset, the files are named as {sample-name}_1.fastq.gz and {sample-name}_2.fastq.gz for forward and reverse reads, respectively.

Each FASTQ file contains four lines for every single read:

Sequence ID: The “name” of the read (starts with @)
The DNA sequence: The actual A, T, C, and G bases.
Separator: Usually just a + sign or + followed by the sequence ID.
Quality Scores: Symbols representing how confident the sequencing machine is about each base (Phred scores).

Let’s try printing out the first 12 lines of one of the FASTQ files:

zcat raw_shortread/925Kp_1.fastq.gz | head -n12; # zcat is cat for gzipped files

The output should look like below:

@SRR33475255.1 VH01637:40:AAGVVC2M5:1:1101:22246:1000 length=301
AAGTACCTCAACGGCCATTCCGATGTCGTCGCCGGCCTGGCGGTGGTTGGCGATAACTCCGGGCTCGCGGAAAAGCTTGGCTATCTGCAGAATGCGGTAGGCGGCGTGCTTGACCCGTTCAGCAGCTTCCTGACGCTGCGCGGCATCCGTACCCTGGCGCTGCGGATGGAGCGCCACAGCGCGAACGCGCTGCAGCTCGCCGAATGGCTGGAACAGCAGCCGGAAGTGGAGCGAGTCTGGTTTCCCTGGCTGGCCTCGCACCCGCATCATCAGCTGGCGCGCCAGCAAATGGCCCTGCCCG
+SRR33475255.1 VH01637:40:AAGVVC2M5:1:1101:22246:1000 length=301
CCCCCC*CCCCCCCCCCCCC5CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC5C5CCCCCCCCCCCCCCCCCCCCCC5CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC*CCCCCCCCCCC5CCCCCCCC5CCCCC5CCCCCCC5CCCCCCCCCCCCCCC*CCCCCCCCCCCCCC5*CCCC5CCCCCC5CCCC*CCCCCCCCCCCCCCCCCCCCCC*CCC*C5CCCCCCCCCCCC*CCCCCCCCCCCC5CCCCCCCCC*CC*CCC55*5CC55*5C*C555C5CCC5C**
@SRR33475255.2 VH01637:40:AAGVVC2M5:1:1101:28797:1000 length=199
ATGTTGACCTGTAAGAACGAAAACACTGTTCTTTCCTGTGTGTCTTCCATTTCTTAGAACAACGTGCATATCTATTCTTGCATTTAAAGTCGCTTCGTCAATCCCCTATATTTCTAGGCGGGTATCTTTTAATGCAGATGGGAATACTCATCTCATAAACAGTGAATAATTATAAAGATGAAATCACACGGTTCTATTT
+SRR33475255.2 VH01637:40:AAGVVC2M5:1:1101:28797:1000 length=199
*CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC5CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC5CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC5CCCCCCCCC*CCCCC5CCC*5CCCCCCC5CCCCCCC5CCC*C55CCCCC5
@SRR33475255.3 VH01637:40:AAGVVC2M5:1:1101:32244:1000 length=295
CGGTAACTGATGCCGTATTTGCAGTACCAGCGTACGGCCCACAGAATGATGTCACGCTGAAAATGCCGGCCTTTGAATGGGTTCATGTGCAGCTCCATCAGCAAAAGGGGATGATAAGTTTATCACCACCGACTATTTGCAACAGTGCCCATCTCCGGGTTTATTATATTCGGGCGCATTTGCGCGCGGCAGCTGCTGACTGTTCAGTCTATGTACGCGGAGCCGTTCATGGAAATATTCGCGCAGGTGTTCAGGCTGTTCCCGTTCAACAAGTTCCGGCACGACGGGCTTGTTC
+SRR33475255.3 VH01637:40:AAGVVC2M5:1:1101:32244:1000 length=295
*CCC5CCCCCCCCCCCCCCCCCCCCCCCCCCC5CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC5CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC5CCCCCCCCCCCCCCCCCCCCCCC5CCCCCCC5CCCCCCCCCCCCCC5CCCC*CCCC*CCCCC5CCC*CCCCC5CCCCCCCCCCCCCCCCCCCCCCCCC5CCCCCCCCCCCCCCC*CCCCCCC5CCCCCCCCCCCCCCCCCCCCCCCCCCCC5CCCCCCCCCC5C5C5CCCCCCCCCCCCCC

Can you tell how many reads are there in this example output? And what are their IDs, base sequences, and lengths?

Challenge

Count the number of forward and reverse reads of sample AMB201Kp in a single command-line.

Hint: Use zgrep (grep for gzipped files) to catch only the sequence ID lines (starting with @)

Correct output: forward = reverse = 617,535 reads

Reads Quality Control (QC)

We shouldn’t trust our data blindly. We will use FastQC to generate a report on things like base quality and adapter contamination.

Since we have many samples, we will then use MultiQC to summarize all outputs from FastQC into a single, interactive HTML report for all samples.

The following command-lines call the two programs on our short-read data:

in_dir="raw_shortread/" # directory for input FASTQ files

fastqc_outdir="fastqc_shortread/" # directory for FastQC outputs
mkdir -p $fastqc_outdir

multiqc_outdir="multiqc_shortread/" # directory for MultiQC output
mkdir -p $multiqc_outdir

# activate the environment with fastqc & multiqc installed
conda activate multiqc 

nthreads=5 # allows analysing 5 samples in parallel

# run fastqc on all fastq files
fastqc -o $fastqc_outdir -t $nthreads $in_dir/*.fastq.gz 

# run multiqc on all fastqc outputs
multiqc --interactive --force -o $multiqc_outdir $fastqc_outdir

MultiQC report

Now you can open the file multiqc_shortread/multiqc_report.html in your browser to view the report. Or in the VSCode, right click on the file and select “Open Preview”.

Let’s go through the report and answer the following questions:

What was the coverage?
What is the overall quality of the raw reads?
Are there any adapter contamination?
Are there any issues with the sequencing?

Making decision

Based on the MultiQC report, do you want to improve the raw reads quality?

If yes, what specific modification to the raw reads would you like to perform? If no, then why?

Assembly

Assembly is the process of building the full genome by finding overlaps between short reads. We will use Unicycler, which is highly regarded for bacterial assembly because it acts as an optimizer for the SPAdes assembler.

The syntax for a command that run unicycler on a single sample with paired-end reads is as follows:

unicycler -1 $forward -2 $reverse -o $outdir --threads 4

where

-1 is the forward read file
-2 is the reverse read file
-o is the output directory
--threads is the number of threads to use, higher number of threads will speed up the assembly process but will also consume more computing resources.

To do the assembly for our batch of 12 samples in a time-efficient manner, we will automate this using GNU Parallel.

# create a directory for assembly output
mkdir -p unicycler_shortread

# create a tsv file with sample names and their corresponding fastq files
while read -r sample; do
    echo -e ${sample}"\t"raw_shortread/${sample}_1.fastq.gz"\t"raw_shortread/${sample}_2.fastq.gz >> samples_shortread.tsv;
done < <(grep "Illumina MiSeq" PRJNA1260529.metadata.tsv | cut -f28)

# run unicycler on 6 samples at a time
conda activate unicycler
parallel -j 6 --colsep "\t" unicycler -1 {2} -2 {3} -o unicycler_shortread/{1} --threads 4 :::: samples_shortread.tsv

Self-Exploration

Improve reads quality for assembly

Trimming and adapter removal can be done with FASTP or Trimmomatic

Example command-lines:

Is my isolate really a K. pneumoniae?

Species check can be done directly on the raw short-reads sequencing data, thanks to techniques from metagenomics.

SYLPH_DB="sylph-db/gtdb-r220-c200-dbv1.syldb"
sylph profile $SYLPH_DB -1 raw_shortread/*_1.fastq.gz -2 raw_shortread/*_2.fastq.gz -o sylph_shortread.tsv