0.2 Commands & Environments
Why Linux?
You might wonder why we aren’t using Windows or macOS for this. There are three main reasons:
- Big Data: Genomic datasets are often too large for Excel or standard text editors. And big data requires big computers, and big computers are usually Linux-based.
- Automation: In genomics, we often need to run the same analysis on hundreds of samples. CLI (Command Line Interface) allows us to automate this with scripts.
- Reproducibility: Our analyses are our commands to the Linux computer, and the Linux commands are just texts. This enables easy recording of exactly what we did, i.e. all the commands that we have run, making it easier to share and reproduce our analyses.
The interface for us to give commands to the Linux computer is called the Terminal. For many, the “black screen” of the terminal can be intimidating, but it is the most powerful tool in a bioinformatician’s toolkit.
Intro to Linux
The Terminal
The Terminal (or console) is your window into the operating system. It provides a Shell, which is the program that interprets the commands you type and tells the computer what to do. On most modern systems, the default shell is bash or zsh.
- The Prompt: Usually looks like
user@computer:~$. The$indicates that the shell is waiting for your input. - Case Sensitivity: Linux is case-sensitive.
File.txtandfile.txtare two different files.
The Command Line
A command usually follows this structure: command [options] [arguments]
Examples:
echois the command (print text)."Hello World"is the argument (the text to print).
lsis the command (list files).-lis an option (long format).genomes/is the argument (the folder to look into).
Getting Help
You don’t need to memorize every command. Here is how to find out what a command does:
command --help: Most tools have a built-in help menu (e.g.,ls --help).man command: Opens the “manual” page (e.g.,man ls). Pressqto exit.- Search & AI: Google, Stack Overflow, and AI tools are your friends when you get stuck!
The File System
Think of the Linux file system as a giant tree.
Root (
/): The base of the entire system.Home Directory (
~): Your personal space where you have full control.Path: The “address” of a file or folder.
Absolute Path: Starts from the root (e.g.,
/home/user/data/sample.fastq).Relative Path: Starts from where you currently are (e.g.,
data/sample.fastq).
Files in Linux
Anatomy of a File Name
In Windows, the file extension (.docx, .exe) tells the system what the file is. In Linux, extensions are mostly for humans. Linux determines what a file is by looking at its content (specifically “magic bytes” at the beginning), not its name.
You can name a file anything:
genome.fasta(Helpful)ingredients_list.banhmi(Valid)data_source.nuocmam(Valid)
However, we usually keep standard extensions so we don’t confuse ourselves or our colleagues!
Types of Files (in Genomics)
- Text Files: Plain text you can read. Examples:
.fasta,.fastq,.csv,.sh. - Compressed Files: To save space, we often zip files. Common extensions:
.gz,.zip,.tar.gz. Most Linux tools can “peek” inside.gzfiles without unzipping them. - Binary Files: Not human-readable. These are programs or optimized data formats (e.g., compiled tools,
.bamor.bcffiles). - Executable Files: Files that can be run as programs. Examples:
.sh,.py,.R.
Peeking into File Contents
We rarely open huge genomics files in editors. We use these instead:
cat: Print the whole file (only for small files!).head: Show the first few lines.tail: Show the last few lines.less: The “gold standard” for viewing large files. Use arrows to scroll,/to search, andqto quit.
Creating & Managing Files
mkdir: Create a new directory (which is just a special type of file!).mkdir raw_reads
cp: Copy files or directories.cp source.txt destination.txt
mv: Move or rename files.mv old_name.txt new_name.txt
rm: Remove (delete) files. Be careful! There is no “Trash”.rm -r folder_name: Recursively delete a directory.
- Navigate to the
day-0/materials/directory. - List all files and find out which one is the largest.
- Create a new directory called
results. - Copy file
README.txttoresultsdirectory. - Rename
README.txttoREADME_backup.txt. - Remove
README_backup.txt.
Text Analysis & Manipulation
grep: Search for a pattern
wc: Count lines, words, and characters
cut: Cut out sections from each line of files
paste: Merge lines of files
tr: Translate or delete characters
sed: Stream editor for text manipulation
- Print the first 20 entries of
metadata.csv. - Find all entries in
metadata.csvthat belong to the year “2024”. - Get all sample names.
- Remove
_1and_2from the sample names. - Change filenames from
.fq.gzto.fastq.gz.
Building Complexity from Simple Operations
Pipes (|)
The pipe takes the output of one command and sends it as input to another. This allows you to build complex workflows by chaining simple commands.
Example: Count how many sequences are in a FASTA file. FASTA headers start with >. We can find all lines starting with > and count them:
Standard Input, Output, and Error
- Standard Out (stdout): Where the command sends its regular results (the screen).
- Standard Error (stderr): Where the command sends error messages.
- Redirection:
command > file: Save output (stdout) to a file (overwrites).command >> file: Append output (stdout) to a file (adds to the end).command 2> file: Redirect error messages (stderr) to a file.command > file 2>&1: Redirect both stdout and stderr to a single file.
Example: Save the list of files to a text file.
Variables
Variables store information for later use. In scripts, we often store filenames in variables.
Example:
- Extract sample names in year “2024”
- Save results to a file called
sample_2024.txt.
- We have a file called
ebola_virus.fasta. - Find the sequence of the record with accession number
NC_045512.2 - Find the length of that sequence.
- Extract the above sequence to a file called
NC_045512.2.fasta.
Bash Script - A Flow of Commands
An example script below will download the SARS-CoV-2 reference genome from NCBI.
download_covid_genome.sh
Write a script named download_dataset.sh that will download the raw sequencing data for the BioProject PRJNA1260529.
Downloadable links are provided in file named PRJNA1260529_links.txt.
Download all files into a directory named PRJNA1260529_raw
Software Environments: Conda & Mamba
Different bioinformatics tools often require different versions of the same software. If you install everything together, they might clash. We use Mamba (the faster version of Conda) to create isolated “bubbles” or environments for each tool.
Example: Installing Kingfisher - a fast downloading tool for public sequence files and their metadata annotations.
Self Exploration
Once you are comfortable with the basics, these tools will make your life much easier:
tmux & screen
Bioinformatics jobs often take hours or days to run. If your internet disconnects, the job might die. tmux allows you to run “sessions” that keep going even if you log out.
Loops
Automate tasks across hundreds of samples.
find
Find files based on name, size, or modification date.
GNU parallel
Run many commands at once using multiple CPU cores. A true game-changer for processing large cohorts.
Knowledge Check
Ready to test your Linux skills? Take the Final Quiz (10 mins) to see how much you’ve learned!