0.2 Commands & Environments

Author

Nguyen MTS

Published

May 11, 2026

Why Linux?

You might wonder why we aren’t using Windows or macOS for this. There are three main reasons:

  1. Big Data: Genomic datasets are often too large for Excel or standard text editors. And big data requires big computers, and big computers are usually Linux-based.
  2. Automation: In genomics, we often need to run the same analysis on hundreds of samples. CLI (Command Line Interface) allows us to automate this with scripts.
  3. Reproducibility: Our analyses are our commands to the Linux computer, and the Linux commands are just texts. This enables easy recording of exactly what we did, i.e. all the commands that we have run, making it easier to share and reproduce our analyses.

The interface for us to give commands to the Linux computer is called the Terminal. For many, the “black screen” of the terminal can be intimidating, but it is the most powerful tool in a bioinformatician’s toolkit.

Intro to Linux

The Terminal

The Terminal (or console) is your window into the operating system. It provides a Shell, which is the program that interprets the commands you type and tells the computer what to do. On most modern systems, the default shell is bash or zsh.

  • The Prompt: Usually looks like user@computer:~$. The $ indicates that the shell is waiting for your input.
  • Case Sensitivity: Linux is case-sensitive. File.txt and file.txt are two different files.

The Command Line

A command usually follows this structure: command [options] [arguments]

Examples:

# show Linux OS name, version, and hardware architecture
hostnamectl 

# list CPU information
lscpu
echo "Hello World"
  • echo is the command (print text).
  • "Hello World" is the argument (the text to print).
ls -l genomes/
  • ls is the command (list files).

  • -l is an option (long format).

  • genomes/ is the argument (the folder to look into).

Getting Help

You don’t need to memorize every command. Here is how to find out what a command does:

  • command --help: Most tools have a built-in help menu (e.g., ls --help).
  • man command: Opens the “manual” page (e.g., man ls). Press q to exit.
  • Search & AI: Google, Stack Overflow, and AI tools are your friends when you get stuck!

The File System

Think of the Linux file system as a giant tree.

  • Root (/): The base of the entire system.

  • Home Directory (~): Your personal space where you have full control.

  • Path: The “address” of a file or folder.

    • Absolute Path: Starts from the root (e.g., /home/user/data/sample.fastq).

    • Relative Path: Starts from where you currently are (e.g., data/sample.fastq).

Files in Linux

Anatomy of a File Name

In Windows, the file extension (.docx, .exe) tells the system what the file is. In Linux, extensions are mostly for humans. Linux determines what a file is by looking at its content (specifically “magic bytes” at the beginning), not its name.

You can name a file anything:

  • genome.fasta (Helpful)
  • ingredients_list.banhmi (Valid)
  • data_source.nuocmam (Valid)

However, we usually keep standard extensions so we don’t confuse ourselves or our colleagues!

Types of Files (in Genomics)

  1. Text Files: Plain text you can read. Examples: .fasta, .fastq, .csv, .sh.
  2. Compressed Files: To save space, we often zip files. Common extensions: .gz, .zip, .tar.gz. Most Linux tools can “peek” inside .gz files without unzipping them.
  3. Binary Files: Not human-readable. These are programs or optimized data formats (e.g., compiled tools, .bam or .bcf files).
  4. Executable Files: Files that can be run as programs. Examples: .sh, .py, .R.

Peeking into File Contents

We rarely open huge genomics files in editors. We use these instead:

  • cat: Print the whole file (only for small files!).
  • head: Show the first few lines.
  • tail: Show the last few lines.
  • less: The “gold standard” for viewing large files. Use arrows to scroll, / to search, and q to quit.
Practice
# Print the whole file
cat README.txt

# Show the first 10 lines
head -n 10 README.txt

# Show the last 5 lines
tail -n 5 README.txt

# Less is More
less README.txt

Creating & Managing Files

  • mkdir: Create a new directory (which is just a special type of file!).
    • mkdir raw_reads
  • cp: Copy files or directories.
    • cp source.txt destination.txt
  • mv: Move or rename files.
    • mv old_name.txt new_name.txt
  • rm: Remove (delete) files. Be careful! There is no “Trash”.
    • rm -r folder_name: Recursively delete a directory.
NoteChallenge 1: Exploring the Directory
  1. Navigate to the day-0/materials/ directory.
  2. List all files and find out which one is the largest.
  3. Create a new directory called results.
  4. Copy file README.txt to results directory.
  5. Rename README.txt to README_backup.txt.
  6. Remove README_backup.txt.

Text Analysis & Manipulation

grep: Search for a pattern

wc: Count lines, words, and characters

cut: Cut out sections from each line of files

paste: Merge lines of files

tr: Translate or delete characters

sed: Stream editor for text manipulation

NoteChallenge 2: Extracting Information
  1. Print the first 20 entries of metadata.csv.
  2. Find all entries in metadata.csv that belong to the year “2024”.
  3. Get all sample names.
  4. Remove _1 and _2 from the sample names.
  5. Change filenames from .fq.gz to .fastq.gz.

Building Complexity from Simple Operations

Pipes (|)

The pipe takes the output of one command and sends it as input to another. This allows you to build complex workflows by chaining simple commands.

Example: Count how many sequences are in a FASTA file. FASTA headers start with >. We can find all lines starting with > and count them:

grep "^>" genome.fasta | wc -l

Standard Input, Output, and Error

  • Standard Out (stdout): Where the command sends its regular results (the screen).
  • Standard Error (stderr): Where the command sends error messages.
  • Redirection:
    • command > file: Save output (stdout) to a file (overwrites).
    • command >> file: Append output (stdout) to a file (adds to the end).
    • command 2> file: Redirect error messages (stderr) to a file.
    • command > file 2>&1: Redirect both stdout and stderr to a single file.

Example: Save the list of files to a text file.

ls -lh > file_list.txt

Variables

Variables store information for later use. In scripts, we often store filenames in variables.

Example:

MY_SAMPLE="Sample_A1"
echo $MY_SAMPLE
NoteChallenge 3: Extracting sample names
  1. Extract sample names in year “2024”
  2. Save results to a file called sample_2024.txt.
NoteChallenge 4: Extracting sequences
  1. We have a file called ebola_virus.fasta.
  2. Find the sequence of the record with accession number NC_045512.2
  3. Find the length of that sequence.
  4. Extract the above sequence to a file called NC_045512.2.fasta.

Bash Script - A Flow of Commands

An example script below will download the SARS-CoV-2 reference genome from NCBI.

# View the content of the script
cat download_covid_genome.sh
download_covid_genome.sh
#!/bin/bash
# Download the SARS-CoV-2 reference genome from NCBI

# Create a directory to store the genome
mkdir -p sars-cov-2

# Download the genome
wget https://www.ncbi.nlm.nih.gov/nuccore/MN908947.3/fasta -O sars-cov-2/MN908947.3.fasta

# List the downloaded file
ls -lh sars-cov-2/MN908947.3.fasta
# Run the script by calling the bash interpreter
bash download_covid_genome.sh

# Make the script executable
chmod +x download_covid_genome.sh

# Run the script by calling the script directly
./download_covid_genome.sh
NoteChallenge 5: Downloading a BioProject dataset of raw sequencing data

Write a script named download_dataset.sh that will download the raw sequencing data for the BioProject PRJNA1260529.

Downloadable links are provided in file named PRJNA1260529_links.txt.

Download all files into a directory named PRJNA1260529_raw

Software Environments: Conda & Mamba

Different bioinformatics tools often require different versions of the same software. If you install everything together, they might clash. We use Mamba (the faster version of Conda) to create isolated “bubbles” or environments for each tool.

Example: Installing Kingfisher - a fast downloading tool for public sequence files and their metadata annotations.

# Create an environment for the kingfisher program
mamba create -n kingfisher -c conda-forge -c bioconda kingfisher

# Enter the bubble
mamba activate kingfisher

# Get the program's help message
kingfisher

# Leave the bubble
mamba deactivate

Self Exploration

Once you are comfortable with the basics, these tools will make your life much easier:

tmux & screen

Bioinformatics jobs often take hours or days to run. If your internet disconnects, the job might die. tmux allows you to run “sessions” that keep going even if you log out.

Loops

Automate tasks across hundreds of samples.

for file in *.fastq; do 
    echo "Processing $file..."; 
done

find

Find files based on name, size, or modification date.

find . -name "*.vcf" -size +10M

GNU parallel

Run many commands at once using multiple CPU cores. A true game-changer for processing large cohorts.

# Example: Run fastqc on all samples parallelly on 4 CPU cores
ls *.fasta | parallel -j 4 grep -c '>' {} 

# Do you remember how to know the number of CPU cores on your system?

Knowledge Check

Ready to test your Linux skills? Take the Final Quiz (10 mins) to see how much you’ve learned!

Back to top