Published

30 March, 2026

Introduction to Command Line

Welcome to the UKRI Digital Research Skills Catalyst introductory guide to the command line!

This short 90 minute tutorial will provide the basics of everything you need to know to start using the command line, including:

What is a file system?
What is the shell?
How can I use the shell commands to navigate the file system?
How can I use the shell commands to use software and manipulate data?

Before we begin there is a brief setup required, as detailed below:

Getting started

The first thing we need to do is set up our working files on your computer.

1. Download and unzip the course folder

📂Click here to download the zipped course folder.

Download the provided course folder to your computer. This folder contains all the files and directories you will need for the exercises.

Once downloaded:

Locate the file (usually in your Downloads folder)
Right-click and select Extract All (Windows) or double-click to unzip (Mac)
Choose a suitable location (we recommend your Desktop for easy access)

After extracting, you should see a folder containing several subfolders and files.

2. Open your command line interface

Next, open the command line (also known as the terminal or shell) and navigate to your extracted folder.

Windows users:

Open the folder in File Explorer
Right-click inside the folder
Select Git Bash Here (or open Command Prompt/PowerShell and navigate using cd)

Mac users:

Open Terminal
Type cd (with a space after it)
Drag and drop the extracted folder into the Terminal window
Press Enter

You should now be working inside your course directory. Let’s begin!

Part One: What is a file system?

A file system is exactly what it sounds like - a way for your computer to store your data in an organised way using files.

You will definitely have come across files before. All the data stored on your computer is split into separate files, making it much easier to keep track of than if it was all in one big blob.

There are lots of different file types; we can often find out something about what kind of data a file contains by looking at its filename extension. For example:

.txt tells us that the file contains text
.exe tells us that the file contains a program to be run
.html tells us that the file contains a webpage, and should be run inside a web browser

We will come across various different file types during this course, some of which you may not have seen before. Do not worry. We will introduce them to you and explain how to use them when necessary.

In a file system, files are organised into directories, which can also be called folders. Hopefully you will have used folders to organise your files before! Folders can contain sub-folders, which can contain their own sub-folders, and so on almost without limit.

It is easiest to picture a file system, or part of it, as a tree that starts at a directory and branches out from there. This is called a hierarchical structure. The figure below shows an example of a hierarchical file structure that starts at the “home directory” of the user named user1:

A file hierarchy containing 4 levels of folders and files.

The directory you are working inside is called your working directory. For example, if you were editing doc2.txt in the diagram above, your working directory would be the folder called docs.

Exercise

Think about your own computer and how your files and directories are organised. Sketch a tree diagram like the one above for your file system.

Hint: remind yourself of your file system’s layout using a file manager application such as:

OS	Icon
File Explorer (Windows)
Finder (Mac)

You probably already use these applications regularly to find, open and organise files.

File paths

It is not practical to draw out a tree diagram every time we want to refer to a file’s location. Instead, we can represent the information as a file path.

In a file path, each directory is represented as a separate component separated by a character such as \ or /. It is like writing an address or set of instructions for someone to follow if they want to find a specific file.

For example, the path for the file called doc3.txt in the file system above looks like this: user1/docs/data/doc3.txt in a Unix or Linux computer.

It is useful to note that Windows uses backslashes (\) to separate path components, while Unix, Linux and Mac use forward slashes (/).

Root and home directories

The root is the top-level of directories, which contains all other directories further down the tree.

The root is represented as a / in Unix, Linux, and Mac operating systems.

In the Windows operating system, the root directory is also known as a drive. In most cases, this will be the C:\ drive.

Even though the root directory is at the base of the file tree (or the top, depending on how you view it), it is not necessarily where our journey through the file system starts when we launch a new session on our computer. Instead our journey begins in the so called “home directory”.

In Windows, Mac, Unix, and Linux, the “home directory” is a folder named with your username. Your personal files and directories can be found inside this folder. This is where your computer assumes you want to start when you open your file manager.

On Windows and Mac your home directory is a directory inside directory called Users and named with your username. On Unix/Linux systems it is called home.

From the root, the file system is:

A file hierarchy containing with root and home directories labelled.

Your home directory is:

C:\Users\user1\ on Windows
/Users/user1/ on Mac
/home/user1/ on Linux

In Linux (the operating system we will use later in the course), a tilde symbol (~) is used as a shortcut for your home directory. So, for example, the path ~/docs/doc2.txt is equivalent to /home/user1/docs/doc2.txt.

Absolute vs relative paths

There are two ways of writing a file path - absolute paths and relative paths.

An absolute path contains the complete list of directories needed to locate a file on your computer. This allows you to reach the file no matter where you are.

A relative path describes the location of a file relative to your current working directory. For example, if you were already in the folder called docs, the relative path for doc3.txt would be data/doc3.txt. There is no need to give instructions to navigate a route you have already taken.

If, however, you were in the folder called docs and you wanted to open one of the .exe files, you would need to give the path to .exe relative to the docs folder or give the absolute path.

Challenge

Use the file system above to answer these questions.

What is the absolute path for the document doc4.txt on a Linux computer?
Assuming you are currently in the directory called docs, what is the relative path for the document doc2.txt?

Solution

The absolute path is /home/user1/docs/data/doc4.txt or ~/docs/data/doc4.txt.
The relative path is doc2.txt (as you are already in the directory where doc2.txt is stored).

Part Two: What is a shell and why should I care?

A shell is a computer program that has a command line where you type commands to do things on your computer rather than using menus and buttons on a Graphical User Interface (GUI).

There are many reasons to learn how to use the shell/command line:

Software access - many bioinformatics tools can only be used through a command line interface, or have extra capabilities in the command line version that are not available in the GUI (this is true of most of the software used in this course).
Cloud access - bioinformatics tasks which require large amounts of computing power (like the ones we’ll do later in this course!) are best performed on remote computers or cloud computing platforms, which are accessed via a shell.
Automation - repetitive tasks (e.g. doing the same set of tasks on a large number of files) can be easily automated in the shell, saving you time and preventing human error.
Reproducibility - when using the shell your computer keeps a record of every step that you’ve carried out, which you can use to re-do your work when you need to.

In this lesson you will learn how to use the command line interface to move around in your file system.

How to access the shell

To recap how to access your shell:

Windows users:
- Right click anywhere inside the blank space of the file manager, then select Git Bash Here. A new window will open - this is your command line interface, also known as the shell or the terminal. It will automatically open with your ybcmd directory as the working directory.
Mac users, you have two options:
- EITHER: Open Terminal in one window and type cd followed by a space. Do not press enter! Now open Finder in another window. Drag and drop the ybcmd folder from the Finder to the Terminal. You should see the file path leading to your ybcmd folder appear. Now press enter to navigate to the folder.
- OR: Open Terminal and type cd followed by the absolute path that leads to your ybcmd folder. Press enter.

Navigating your file system

Several commands are frequently used to create, inspect, rename, and delete files and directories.

Let’s find out where we are by running a command called pwd(which stands for “print working directory”). At any moment, our current working directory is our current default directory, i.e., the directory that the computer assumes we want to run commands in,unless we explicitly specify something else. Here, the computer’s response is /home/csuser:

Code

pwd

Output

/c/Users/jpm513/Desktop/ybcmd

This ybcmd directory is our working directory within our folder system.

Let’s look at how our file system is organised. We can see what files and subdirectories are in this directory by running ls, which stands for “listing”:

Code

ls

Output

bin/  cs_course/  software/

ls prints the names of the files and directories in the current directory in alphabetical order, arranged neatly into columns. We’ll be working within the cs_course subdirectory, and creating new subdirectories, throughout this workshop.

The command to change locations in our file system is cd, followed by a directory name to change our working directory. cd stands for “change directory”.

Let’s say we want to navigate to the cs_course directory we saw above. We can use the following command to get there:

Code

cd cs_course

Let’s look at what is in this directory:

Code

ls

Output

data/   databases/

We can make the ls output easier to understand by using the flag -F, which tells ls to add a trailing / to the names of directories:

Code

ls -F

Output

data/   databases/

Anything with a / after it is a directory. Things with a * after them are programs. If there are no decorations, it’s a file.

ls has lots of other options. To find out what they are, we can type:

Code

man ls

man (short for manual) displays detailed documentation (also referred as man page or man file) for bash commands. It is a powerful resource to explore bash commands, understand their usage and flags. Some manual files are very long. You can scroll through the file using your keyboard’s down arrow or use the Space key to go forward one page and the b key to go backwards one page. When you are done reading, hit q to quit.

Challenge

Use the -l option for the ls command to display more information for each item in the directory. What is one piece of additional information this long format gives you that you don’t see with the bare ls command?

Solution

Code

ls -l

Output

total 0
drwxr-xr-x 1 ITSYORK+jpm513 4096 0 Mar 30 09:30 data/
drwxr-xr-x 1 ITSYORK+jpm513 4096 0 Mar 30 09:30 databases/

The additional information given includes the name of the owner of the file, when the file was last modified, and whether the current user has permission to read and write to the file.

No one can possibly learn all of these arguments - that’s what the manual page is for! It does take practice to get used to using and understanding the information in the manual. Often, you don’t need to understand completely what it is saying to be able to guess what to try.

Let’s go into the data directory and see what is in there using the cd and ls commands.

Code

cd data 
ls -F

Output

illumina_fastq/   nano_fastq/

This directory contains two subdirectories. We can tell they are directories and not files because of the trailing ‘/’. They contain all of the raw data we will need for the rest of the course.

For now, let’s have a look in illumina_fastq. We can do this without changing directories using the ls command followed by the name of the directory.

Code

ls illumina_fastq

Output

ERR4998593_1.fastq  ERR4998593_2.fastq

This directory contains two files with .fastq extensions. FASTQ is a format for storing information about sequencing reads and their quality. We will be learning more about FASTQ files in a later lesson.

Let’s also have a look in the nano_fastq directory.

Code

ls nano_fastq

Output

ERR5000342_sub12.fastq

Learning to navigate a new file directory can be confusing at first. To help, here is a tree diagram showing what we have explored so far.

First we moved from our home directory at csuser into the cs_course directory, which is one level down. From there we opened up the data directory, which contains subdirectories - illumina_fastq and nano_fastq. We had a peek inside both of these directories and found that illumina_fastq contained two files, while nano_fastq contained one.

Shortcut: Tab Completion

It is very easy to make mistakes typing our filenames and commands. Thankfully, “tab completion” can help us! When you start typing out the name of a directory or file, then hit the tab key, the shell will try to fill in the rest of the directory or file name.

First of all, typing cd after the prompt and pressing enter will always take you back to your home directory. Let’s do this:

Code

cd

then type:

Code

cd Desktop/ybcmd/

then type:

Code

cd cs

and press tab.

The shell will fill in the rest of the directory name for cs_course. Press enter to execute the command and move directories.

Now change directories again to data.

Code

cd data

And again into illumina_fastq.

Code

cd illumina_fastq

Using tab complete can be very helpful. However, it will only autocomplete a file or directory name if you’ve typed enough characters to provide a unique identifier for the file or directory you are trying to access.

For example, if we now try to list the files in illumina_fastq with names starting with ERR by using tab complete:

Code

ls ERR<tab>

The shell auto-completes your command to ERR4998593_, because all file names in the directory begin with this prefix. When you hit tabagain, the shell will list the possible choices.

Output

ERR4998593_1.fastq  ERR4998593_2.fastq

Tab completion can also fill in the names of programs, which can be useful if you remember the beginning of a program name.

Code

pw<tab><tab>

Output

pwck      pwconv    pwd       pwdx      pwunconv

Displays the name of every program that starts with pw.

Tip

You might find it useful to keep a note of the commands you learn in this course, so you can easily remember them in future. This will be faster than scrolling through the course each time you forget a command. While using your Cloud-SPAN AWS instance you can also type csguide into the command prompt and hit enter for a text-based guide to the command line, including frequently used commands.

Moving around the file system

Now we’re going to learn some additional commands for moving around within our file system.

Use the commands we’ve learned so far to navigate to the illumina_fastq directory from our home:

Code

cd 
cd Desktop
cd ybcmd
cd cs_course 
cd data 
cd illumina_fastq

What if we want to move back up and out of this directory and to our top level directory? Can we type cd data? Try it and see what happens.

Code

cd data

Output

-bash: cd: shell_data: No such file or directory

Your computer looked for a directory or file called data within the directory you were already in. It didn’t know you wanted to look at a directory level above the one you were located in.

We have a special command to tell the computer to move us back or up one directory level.

Code

cd ..

Now we can use pwd to make sure that we are in the directory we intended to navigate to, and ls to check that the contents of the directory are correct.

Code

pwd

Output

/c/Users/jpm513/Desktop/ybcmd/cs_course/data

Code

ls

Output

illumina_fastq/  nano_fastq/

From this output, we can see that .. did indeed take us back one level in our file system, to data.

You can chain these together like so:

Code

ls ../../

Output

bin  cs_course  software

This prints the contents of the folder called csuser (our home folder).

Finding hidden directories

There is a hidden directory inside the cs_course directory. Explore the options for ls in the man page to find out how to see hidden directories. List the contents of the directory and identify the name of the text file in the hidden directory.

Hint 1: hidden files and folders in Unix start with ., for example .my_hidden_directory

Hint 2: cs_course is one level above data, our current working directory.

Solution

First use the man command to look at the options for ls.

Code

man ls

The -a option is short for all and says that it causes ls to “not ignore entries starting with .” This is the option we want.

We can use .. to view the contents of the directory above our current working directory.

Code

ls -a ..

Output

./  ../  .hidden/  data/  databases/

The name of the hidden directory is .hidden. We can navigate to that directory using cd.

Code

cd .. 
cd .hidden

And then list the contents of the directory using ls.

Code

ls

Output

youfoundit.txt

The name of the text file is youfoundit.txt.

In most commands the flags can be combined together in no particular order to obtain the desired results/output.

Code

ls -Fa 
ls -laF

Examining the contents of other directories

In the above section we learned how to use pwd to find our current location within our file system. We also learned how to use cd to change locations and ls to list the contents of a directory.

By default, the ls commands lists the contents of the working directory (i.e. the directory you are in). You can always find the directory you are in using the pwd command. However, you can also give ls the names of other directories to view. Navigate to your home directory if you are not already there.

Code

cd
cd Desktop/ybcmd/

Then enter the command:

Code

ls cs_course

Output

data/   databases/

This will list the contents of the cs_course directory without you needing to navigate there.

The cd command works in a similar way.

Try entering:

Code

cd 
cd ~/Desktop/ybcmd/cs_course/data/

This will take you to the data directory without having to go through the intermediate cs_course directory.

Navigating practice

Navigate to your home directory. From there, list the contents of the illumina_fastq directory.

Solution

Code

cd 
ls ~/Desktop/ybcmd/cs_course/data/illumina_fastq

Output

ERR4998593_1.fastq  ERR4998593_2.fastq

Full vs. Relative Paths

The cd command takes an argument which is a directory name. Directories can be specified using either a relative path or a full absolute path. The directories on the computer are arranged into a hierarchy. The full path tells you where a directory is in that hierarchy. Navigate to the home directory, then enter the pwd command.

Code

cd   
pwd

You will see:

Output

/c/Users/jpm513

This is the full name of your home directory. This tells you that you are in a directory called the same as your username, which sits inside a directory called Users which sits inside the very top directory in the hierarchy. The very top of the hierarchy is a directory called C which is usually referred to as the root directory.

Now enter the following command:

Code

cd ~/Desktop/ybcmd/cs_course/.hidden

This jumps forward multiple levels to the .hidden directory. Now go back to the home directory.

Code

cd

You can also navigate to the .hidden directory using:

Code

cd Desktop/ybcmd/cs_course/.hidden

These two commands have the same effect - they both take us to the .hidden directory. The first uses the absolute path, giving the full address from the root directory /. The second uses a relative path, giving only the address from the working directory. A absolute (full) path always starts with a /. A relative path does not.

You can usually use either a full path or a relative path depending on what is most convenient. If you want to reach a directory further down the same branch as your current working directory, it’s easiest to use the relative path since it involves less typing. If you’re trying to get to a directory in a different branch, it might be more convenient to use the full path instead of navigating “backwards” and then forwards.

Over time, it will become easier for you to keep a mental note of the structure of the directories that you are using and how to quickly navigate amongst them. We will be using the same directory structure for this whole course so navigating it should get easier as you progress.

Relative path resolution

Using the file system diagram below, if pwd displays /home/csuser/cs_course/data/illumina_fastq, what will ls ../nano_fastq display?

Can you explain why?

Blank instance file tree. — Instance file tree.

Solution

Output

ERR5000342_sub12.fastq

The command ls .. moves us up a folder level before we list the contents of nano_fastq.

Navigational Shortcuts

The root directory is the highest level directory in your file system and contains files that are important for your computer to perform its daily work. While you will be using the root (C) at the beginning of your absolute paths, it is important that you avoid working with data in these higher-level directories, as your commands can permanently alter files that the operating system needs to function. In many cases, trying to run commands in root directories will require special permissions which are not discussed here, so it’s best to avoid them and work within your home directory.

Dealing with your home directory is very common. The tilde character, ~, is a shortcut for your home directory. In our case, the root directory is two levels above our home directory, so cd or cd ~ will take you toc/Users/<username> and cd /c/ will take you to c.

Navigate to the illumina_fastq directory:

Code

cd 
cd ~/Desktop/ybcmd/cs_course/data/illumina_fastq

Then enter the command:

Code

ls ~

Output

AppData/
'Application Data'@
 Contacts/
 Cookies@
 Desktop/
 Documents/
 Downloads/
 Favorites/
 IntelGraphicsProfiles/
 Links/
'Local Settings'@
 Music/
'My Documents'@
 NTUSER.DAT
 NTUSER.DAT{a2332f17-cdbf-11ec-8680-002248483d79}.TxR.0.regtrans-ms
 NTUSER.DAT{a2332f17-cdbf-11ec-8680-002248483d79}.TxR.1.regtrans-ms
 NTUSER.DAT{a2332f17-cdbf-11ec-8680-002248483d79}.TxR.2.regtrans-ms
 NTUSER.DAT{a2332f17-cdbf-11ec-8680-002248483d79}.TxR.blf
 NTUSER.DAT{a2332f18-cdbf-11ec-8680-002248483d79}.TM.blf
 NTUSER.DAT{a2332f18-cdbf-11ec-8680-002248483d79}.TMContainer00000000000000000001.regtrans-ms
 NTUSER.DAT{a2332f18-cdbf-11ec-8680-002248483d79}.TMContainer00000000000000000002.regtrans-ms
 NetHood@
 OneDrive/
 PSAppDeployToolkit/
 Pictures/
 PrintHood@
 Recent@
'Saved Games'/
 Searches/
 SendTo@
'Start Menu'@
 Templates@
'Ubuntu-22.04 (wsl.localhost) - Shortcut.lnk'*
 Videos/
 _tmp_cloudspan_aws/
 ntuser.dat.LOG1
 ntuser.dat.LOG2
 ntuser.ini

This prints the contents of your home directory, without you needing to type the full path.

The commands cd, and cd ~ are very useful for quickly navigating back to your home directory. We will be using the ~ character in later lessons to specify our home directory.

Part Three: Working with Files

Our data set: FASTQ files

Now that we know how to navigate around our directory structure, let’s start working with our sequencing files. We are looking at the results from a short-read sequencing experiment, which are stored in our illumina_fastq directory.

Wildcards

Navigate to your illumina_fastq directory:

Code

cd ~/Desktop/ybcmd/cs_course/data/illumina_fastq

We are interested in looking at the fastq files in this directory. We can list all files with the .fastq extension using the command:

Code

ls *.fastq

Output

ERR4998593_1.fastq  ERR4998593_2.fastq

The * character is a special type of character called a wildcard, which can be used to represent any number of any type of character. Thus, *.fastq matches every file that ends with .fastq.

This command:

Code

ls ../../ 
ls *_2.fastq

Output

ERR4998593_2.fastq

lists only the file that ends with _2.fastq.

This command:

Code

ls ~/Desktop/ybcmd/bin/*.sh

Output

/c/Users/jpm513/Desktop/ybcmd/bin/colours_functions.sh*
/c/Users/jpm513/Desktop/ybcmd/bin/csguide.sh*
/c/Users/jpm513/Desktop/ybcmd/bin/csguideeditnprint.sh*
/c/Users/jpm513/Desktop/ybcmd/bin/textcolorsandstyles.sh*

Lists every file in /usr/bin that ends in the characters .sh. Note that the output displays full paths to files, since each result starts with /.

Exercise

What command would you use for each of the following tasks? Start from your current directory using a singlelscommand for each:

List all of the files in /bin that start with the letter ‘c’.
List all of the files in /bin that contain the letter ‘a’.
List all of the files in /bin that end with the letter ‘o’.
List all of the files in /bin that contain the letter ‘a’ or the letter ‘c’.

Bonus: What would the output look like if a wildcard could not be matched? Try listing all files that start with ‘missing’.

Hint: Question 4 requires a Unix wildcard that we haven’t talked about yet. Try searching the internet for information about Unix wildcards to find what you need to solve the bonus problem.

Solution

ls ~/Desktop/ybcmd/bin/c*
ls ~/Desktop/ybcmd/bin/*a*
ls ~/Desktop/ybcmd/bin/*o
ls ~/Desktop/ybcmd/bin/*[ac]*

Bonus: ls: cannot access 'missing*': No such file or directory

Command History

If you want to repeat a command that you’ve run recently, you can access previous commands using the up arrow on your keyboard to go back to the most recent command. Likewise, the down arrow takes you forward in the command history.

You can also review your recent commands with the history command, by entering:

Code

history

to see a numbered list of recent commands. You can reuse one of these commands directly by referring to the number of that command.

For example, if your history looked like this:

Output

259  ls * 
260  ls ~/Desktop/ybcmd/bin/*.sh 
261  ls *R1*fastq

then you could repeat command #260 by entering:

Code

!260

Type ! (exclamation point) and then the number of the command from your history. You will be glad you learned this when you need to re-run very complicated commands. For more information on advanced usage of history, read section 9.3 of Bash manual.

Examining Files — the `less` program

We now know how to switch directories, run programs, and look at the contents of directories, but how do we look at the contents of files?

One way to examine a file is to open the file in a read-only format and navigate through it using a program called less. The commands for navigating less are the same as the man program:

key	action
`Space`	to go forward
`b`	to go backward
`g`	to go to the beginning
`G`	to go to the end
`q`	to quit

Enter the following command from within the illumina_fastq directory:

Code

cd ~/Desktop/ybcmd/cs_course/data/illumina_fastq 
less ERR4998593_1.fastq

FASTQ format

The contents might look a bit confusing. That’s because they are in FASTQ format, a popular way to store sequencing data in text-based format. These files contain both sequences and information about each sequence’s read accuracy.

Each sequence is described in four lines:

Line	Description
1	Always begins with ‘@’ and gives the sequence identifier and an optional description
2	The actual DNA sequence
3	Always begins with a ‘+’ and sometimes the same info in line 1
4	Has a string of characters which represent the PHRED quality score for each of the bases in line 2; must have same number of characters as line 2

PHRED score

Quality encoding: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ                   |         |         |         |         | Quality score:    01........11........21........31........41

Quality is interpreted as the probability of an incorrect base call. To make it possible to line up each individual nucleotide with its quality score, the numerical score is encoded by a single character. The quality score represents the probability that the corresponding nucleotide call is incorrect. It is a logarithmic scale so a quality score of 10 reflects a base call accuracy of 90%, but a quality score of 20 reflects a base call accuracy of 99%.

Exercise

Open the ~/cs_course/data/illumina_fastq/ERR4998593_1.fastq file in less. What is the last line of the file? (Hint: use the shortcuts above to speed things up)

Solution

Output

1. :FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF:FFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFF:FFFFFFFFFFFFFF:FFFFFF:FF:FFFFFFFFFFFFFFFFFFFFFF,F,FFF,FFFFFF,FFFF

Other programs to look into files: `cat`, `more`, `head`, and `tail`

Another way to look at files is using the command cat. This command prints out the entire contents of the file to the console. In large files, like the ones we’re working with today, this can take a long time and should generally be avoided. For small files, it can be a useful tool.

The more command prints to the console only as much content of a file as it fits in the screen, and waits for you to press the space bar to print the following portion of the file likweise, and so on until either the last portion of the file is printed or you press the q key (for quit) to exit more.

There’s another final way that we can look at files, and in this case, just look at part of them. This can be particularly useful if we just want to see the beginning or end of the file, or see how it’s formatted.

The commands are head and tail and they let you look at the beginning and end of a file, respectively.

Code

head ERR4998593_1.fastq

Output

@ERR4998593.40838091 40838091 length=151 CCACATGCTTTAAGTGCATGTGGTACTGCTCCAGGACCAGCATTGTAGGTCGCCAATGCTTTGGCGTAGGTGCCATCAAACATATTCGTGTAATGAGCCATGAGATGGGCTGCTCCCTTCAATGCATCAACCGGATTCCACGGATCAATGC + :FFFFFFF:,FFF:FFFFFFFFFFFFFFFFFFFF:FFFF:FFFFFFFFFFF:FFFFFFFFF:FFFF:FFFFFFFFFFFFFFFFFFF:FFFF,FFFFFFF,FFF:FF,FFFFFFF::FF::FFFF:FFFFF:,:FFFFF,,FFFFFFFF,,F @ERR4998593.57624042 57624042 length=151 CCTTACCACACCGGGGCTGTGGCGTTCGACCCCATCGGCAAGGCACTCTGGGTTTCCGATAGCTCGCACCATCGGCTGCTGCGCGTCCGCAATCCGGACGGCTGGGAGAGCAAACTGCTCGTGGACACGGTCATCGGTCAGAAGGACAGGT + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF @ERR4998593.3 3 length=151 GNGGTGCTCGACGGTGGCTCGGCGGATGCGCATGGCGTCGGGCCTGCGGTCCAGCCGCTCCCGCATGGCGTCGATCACCGCCTCATGCTCCCAGCGTTTGATGCGGCGCTCCTTGCCGCTCGTACACCGGCTCTTCAGCGGGCAGCCGGCGta

Code

tail ERR4998593_1.fastq

Output

+ FFF:F,FF:FFFFFF:F,:FFF,FF:FF,FFF::F:F,FF,FF,FFF,FFFFFF:FFFFFFF,F,,FFF:FFFF:,FFFF:FF::F:FFF,F:FFFF,:FFFFF,F:F:FF,FF:F:FFFF:FFF:FFF::FF:FF:,::FF,FF:,F,FF @ERR4998593.55595926 55595926 length=151 CAGTACAACGTTCGCTCCCTGAATTTCTGTTTCTCGGCCGGCGAAGCAATTGCTGTGGCTATCCAGGAGCGGTTCAAGCGGATGTTCGGCGTCGAAATTACGGAAGGCTGCGGGATGACCGAACTGCAAATTTACTCCATGAATCCGCCAT + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FFFFFFF:FF:FFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF @ERR4998593.34263610 34263610 length=151 ACGCCCCACAGGGCGGCACCGACGCCGCCGCCCGGGCCCGCCGGCCCGCCCCGGTGGGCACCGGTTGCCACTGCGGCTTGCTCGGCCGTCTCACTCACTTGGACACACTTCCGTTCTTCACCGTCTCCACTGGCCGGCTAGACCGGTCCCG + FFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFF,FFFF:FFFFFFFF,FFFFFF::F:F,FFFFFF,FFFFFFFFFFFFF,F:F:FF,FFFF

The -n option to either of these commands can be used to print the first or last n lines of a file.

Code

head -n 1 ERR4998593_1.fastq

Output

@ERR4998593.40838091 40838091 length=151

Code

tail -n 1 ERR4998593_1.fastq

Output

F:FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF:FFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFF:FFFFFFFFFFFFFF:FFFFFF:FF:FFFFFFFFFFFFFFFFFFFFFF,F,FFF,FFFFFF,FFFF

Creating, moving, copying, and removing files

Now we can move around in the file structure, look at files, and search files. But what if we want to copy files or move them around or get rid of them? Most of the time, you can do these sorts of file manipulations without the command line, but there will be some cases (like when you’re working with a remote computer like we are for this lesson) where it will be impossible. You’ll also find that you may be working with hundreds of files and want to do similar manipulations to all of those files. In cases like this, it’s much faster to do these operations at the command line.

We’ll continue looking at our large Illumina sequencing files for the next part of the lesson.

Copying Files

When working with computational data, it’s important to keep a safe copy of that data that can’t be accidentally overwritten or deleted. For this lesson, our raw data is our FASTQ files.

First, let’s make a copy of one of our FASTQ files using the cp command.

Navigate to the illumina_fastq directory and enter:

Code

cp ERR4998593_1.fastq ERR4998593_1_copy.fastq 
ls -F

Output

ERR4998593_1.fastq  ERR4998593_1_copy.fastq  ERR4998593_2.fastq

The prompt will disappear for up to two minutes and reappear when the command is completed and the backup is made.

We now have two copies of the ERR4998593_1.fastq file, one of them named ERR4998593_1_copy.fastq. We’ll move this file to a new directory called backup where we’ll store our backup data files.

Creating Directories

The mkdir command is used to make a directory. Enter mkdirfollowed by a space, then the directory name you want to create:

Code

mkdir backup

Moving / Renaming files and directories

We can now move our backup file to this directory. We can move files around using the command mv:

Code

mv ERR4998593_1_copy.fastq backup 
ls backup

Output

ERR4998593_1_copy.fastq

The mv command is also how you rename files. Let’s rename this file to make it clear that this is a backup:

Code

cd backup 
mv ERR4998593_1_copy.fastq ERR4998593_1_backup.fastq 
ls

Output

ERR4998593_1_backup.fastq

Removing files and directories

You can delete or remove files with the rm command:

Code

rm ERR4998593_1_backup.fastq

Important: The rm command permanently removes the file. Be careful with this command. It doesn’t just nicely put the files in the recycling. They’re really gone.

By default, rm will not delete directories. You can tell rm to delete a directory using the -r (recursive) option. Let’s delete the backup directory we just made:

Code

cd .. 
rm -r backup

This will delete not only the directory, but all files within the directory.

Exercise

Starting in the illumina_fastq directory, do the following:

Make sure that you have deleted your backup directory and all files it contains.
Create a backup of each of your FASTQ files using cp. (Note: You’ll need to do this individually for each of the two FASTQ files. We haven’t learned yet how to do this with a wildcard.)
Use a wildcard to move all of your backup files to a new backup directory.
It doesn’t make sense to keep our backup directory inside the directory it is backing up. What if we accidentally delete the illumina_fastq directory? To fix this, move your new backup directory out of illumina_fastq and into the parent folder, data.

Solution

rm -r backup
cp ERR4998593_1.fastq ERR4998593_1_backup.fastq and cp ERR4998593_2.fastq ERR4998593_2_backup.fastq
mkdir backup and mv *_backup.fastq backup
mv backup .. or mv backup ~Desktop/ybcmd/cs_course/data/ (note that you do not need to use the -r flag to move directories like you do when deleting them)

It’s always a good idea to check your work. Move to the data folder with cd .. and then list the contents of backup with ls -l backup. You should see something like:

Output

-rw-rw-r-- 1 csuser csuser 2811886584 Feb 22 11:25 ERR4998593_1_backup.fastq -rw-rw-r-- 1 csuser csuser 2302264784 Feb 22 11:29 ERR4998593_2_backup.fastq

Here is what your file structure should look like at the end of this episode:

Searching files

We discussed in a previous episode how to search within a file using less. We can also search within files without even opening them, using grep. grep is a command-line utility for searching plain-text files for lines matching a specific set of characters (sometimes called a string) or a particular pattern (which can be specified using something called regular expressions). We’re not going to work with regular expressions in this lesson, and are instead going to specify the strings we are searching for. Let’s give it a try!

Nucleotide abbreviations

The four nucleotides that appear in DNA are abbreviated A, C, T and G. Unknown nucleotides are represented with the letter N. An N appearing in a sequencing file represents a position where the sequencing machine was not able to confidently determine the nucleotide in that position. You can think of an N as being aNy nucleotide at that position in the DNA sequence.

We’ll search for strings inside of our fastq files. Let’s first make sure we are in the correct directory:

Code

cd ~Desktop/ybcmd/cs_course/data/illumina_fastq

However, suppose we want to see how many reads in our file have bad segments containing three or more unclassified nucleotides (N) in a row.

Determining quality

In this lesson, we’re going to be manually searching for strings of bases within our sequence results to illustrate some principles of file searching. It can be really useful to do this type of searching to get a feel for the quality of your sequencing results, however, in your research you will most likely use a bioinformatics tool that has a built-in program for filtering out low-quality reads. You’ll learn how to use one such tool in a later lesson.

To search files we use a new command called grep. The name “grep” comes from an abbreviation of global regular expression print.

Let’s search for the string NNN in the ERR4998593_1 file:

Code

grep N ERR4998593_1.fastq

This command returns quite a lot of output to the terminal. Every single line in the ERR4998593_1 file that contains at least one consecutive N is printed to the terminal, regardless of how long or short the file is.

We may be interested not only in the actual sequence which contains this string, but in the name (or identifier) of that sequence. Think back to the FASTQ format we discussed previously - if you need a reminder, you can click to reveal one below.

Hint

To get all of the information about each read, we will return the line immediately before each match and the two lines immediately after each match.

We can use the -B argument for grep to return a specific number of lines before each match. The -A argument returns a specific number of lines after each matching line. Here we want the line before and the two lines after each matching line, so we add -B1 -A2 to our grep command:

Code

grep -B1 -A2 N ERR4998593_1.fastq

One of the sets of lines returned by this command is:

Output

@ERR4998593.3 3 length=151
GNGGTGCTCGACGGTGGCTCGGCGGATGCGCATGGCGTCGGGCCTGCGGTCCAGCCGCTCCCGCATGGCGTCGATCACCGCCTCATGCTCCCAGCGTTTGATGCGGCGCTCCTTGCCGCTCGTACACCGGCTCTTCAGCGGGCAGCCGGCG
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

Exercise

Search for the sequence AACC in the ERR4998593_1.fastq file. Have your search return all matching lines and the name (or identifier) for each sequence that contains a match. What is the output?
Search for the sequence CCGGTT in both FASTQ files. Have your search return all matching lines and the name (or identifier) for each sequence that contains a match. How many matching sequences do you get? (this may take up to 4 minutes to complete so be patient!)

Solution

grep -B1 AACC ERR4998593_1.fastq

Output

@ERR4998593.40838091 40838091 length=151
CCACATGCTTTAAGTGCATGTGGTACTGCTCCAGGACCAGCATTGTAGGTCGCCAATGCTTTGGCGTAGGTGCCATCAAACATATTCGTGTAATGAGCCATGAGATGGGCTGCTCCCTTCAATGCATCAACCGGATTCCACGGATCAATGC
--
@ERR4998593.34862640 34862640 length=151
CAGGCAGATTGCCGCCCGACACTTTCTTGGTCGGCGAAGAGTCGTCATTGCCAAGCCAGACTCCGGCGACCAGATAGCTCGTGTAGCCGAGGAACCAGGCGTCGCGCCAATCCTGGCTGGTCCCCGTTTTGCCCGCGGCCTGCCACCCCGG
--
@ERR4998593.63611176 63611176 length=151
GTTGTGGCCACTCGGGCGGGAAAACGTGTCGGTCGAGGAGGCTAACGAGATGGGTTATTCGGGCCTCGGCGTCCCGTCGTAGTCGCCGCCGCGAGCTAACCTACCGCGCCATGCCTGGGATGGGTGCTCGCCGCGCCCGGTCGGCCGGGAT
--
@ERR4998593.47948452 47948452 length=151
GTCCAGAGACAAACGACATCTGCTACGCGACGCAAAATCGACAGATGGCAGTACGTGAGCTAACCAAGATTGCCGATGTGATATTAGTTGTTGGGGCCAAAAACAGCTCGAATTCCAATCGGCTAAGAGAGATCGGCGAAGAGGCTGGCGT
--
@ERR4998593.44850488 44850488 length=151
TCTGATCTGCGGCTGGGCCATCACCTCCAAGGACGGCACCCCGATGAAGATGGGCGCATGGCCGGCCATGGCGCCGGAGACGGTGCGCTTCGTCGGCCAGGCGGTCGCGGTGGTGATCGCCGAGACCAAGAACCAGGCCAAGGACGCGGCC
--
@ERR4998593.67878655 67878655 length=151
CTCGTCGTCGGCGAGACATTCGCAGCGCGGATGACCGCCTACGACATCGCCGGCGACGGCGGCCTCTCGAACCGCCGCGTCTGGGCGGCGCTGCCGCAGGGCGCGGTGCCGGACGGCTGCTGCCTCGACGCCGAGGGCGCCATATGGGTGG

grep -B1 CCGGTT *.fastq

Output

ERR4998593_2.fastq-@ERR4998593.34862640 34862640 length=151
ERR4998593_2.fastq:CATCCGACCTTGATCCGGTTCCGTCGATCGCGCTTGGCTCCTCAGACGTGACGCCTCTGGAGATGGTATCGGCCTATGCGGCTTTCGCCAACGGCGGCCTGGGCGTCCAACCGCATGTGATCGCACGCGTGCGCACGGCGAACGGCAAGCA
--
ERR4998593_2.fastq-@ERR4998593.47948452 47948452 length=151
ERR4998593_2.fastq:CCTTCAGCTTTTGCTTGATGAGATAGGACCCGATCCGTCTTTCCTTGAAAAACGGCAATGCCATCGCTGTACTCCGGTTTGTAAAGATATCGCGCTGCGACGCGTGGTGATTTGGTTTCGGCGTTATCTGAGAGGAGTTAGATCTTAGAGC
--
ERR4998593_2.fastq-@ERR4998593.67878655 67878655 length=151
ERR4998593_2.fastq:CGGCAACCCCGCCCCAGGCACATCCACCTGCACCGCCTCGATGCGTCCGGTTCGGCTCGCGCGGCACTCTGCCGGGTCGGAGCTGGCGGCGGTGAGCAGGTAGAGGGTCCGGCGGTCGGCGCCGCCCAGCATGCAGGCATATACGCCCTGG

Redirecting output

grep allowed us to identify sequences in our FASTQ files that match a particular pattern. All of these sequences were printed to our terminal screen, but in order to work with these sequences and perform other operations on them, we will need to capture that output in some way.

We can do this with something called “redirection”. The idea is that we are taking what would ordinarily be printed to the terminal screen and redirecting it to another location. In our case, we want to print this information to a file so that we can look at it later and use other commands to analyse this data.

Redirecting output to a file — the `>` command

The command for redirecting output to a file is >.

Let’s try out this command and copy all the records (including all four lines of each record) in our FASTQ files that contain ‘NNN’ to another file called bad_reads.txt. The new flag --no-group-separator stops grep from putting a dashed line (–) between matches. The reason this is necessary will become apparent shortly.

Code

grep -B1 -A2 N --no-group-separator ERR4998593_1.fastq > bad_reads.txt

The prompt should sit there a little bit, and then it should look like nothing happened. But type ls. You should see a new file called bad_reads.txt.

File extensions

You might be confused about why we’re naming our output file with a .txt extension. After all, it will be holding FASTQ formatted data that we’re extracting from our FASTQ files.

Won’t it also be a FASTQ file?

The answer is, yes - it will be a FASTQ file and it would make sense to name it with a .fastq extension.

However, using an extension such as .txt makes it easy to distinguish the files you may generate through some exploratory processing, as the one we just made with the grep program, from the original sequencing files of your project. So you can easily select all the files with a specific extension for further processing using the wildcard * character, for example: grep .. *.fastq or mv *.txt newlocation.

Counting number of lines in files — the `wc` program

We can check the number of lines in our new file using a command called wc. wc stands for word count. This command counts the number of words, lines, and characters in a file.

Code

wc bad_reads.txt

Output

20   78 4511 bad_reads.txt

This will tell us the number of lines, words and characters in the file. If we want only the number of lines, we can use the -l flag for lines.

Code

wc -l bad_reads.txt

Output

20 bad_reads.txt

The --no-group-separator flag used above prevents grep from adding unnecessary extra lines to the file which would alter the number of lines present.

Exercise

How many sequences are there in ERR4998593_1.fastq? Remember that every sequence is formed by four lines.
How many sequences in ERR4998593_1.fastq contain at least 5 consecutive Ns?

Solution

Code

wc -l ERR4998593_1.fastq

Output

64 ERR4998593_1.fastq

Now you can divide this number by four to get the number of sequences in your fastq file (16).

Code

grep NNNNN ERR4998593_1.fastq > bad_reads.txt 
wc -l bad_reads.txt

Output

0 bad_reads.txt

The command `>` redirects and overwrites

We might want to search multiple FASTQ files for sequences that match our search pattern. However, we need to be careful, because each time we use the > command to redirect output to a file, the new output will replace the output that was already present in the file. This is called “overwriting” and, just like you don’t want to overwrite your video recording of your kid’s first birthday party, you also want to avoid overwriting your data files.

Code

grep -B1 -A2 N --no-group-separator ERR4998593_1.fastq > bad_reads.txt wc -l bad_reads.txt

Output

20 bad_reads.txt

Code

grep -B1 -A2 NNNNNNNNNN --no-group-separator ERR4998593_1.fastq > bad_reads.txt wc -l bad_reads.txt

Output

0 bad_reads.txt

Here, the output of our second call to wc shows that we no longer have any lines in our bad_reads.txt file. This is because the second string we searched (NNNNNNNNNN) does not match any strings in the file. So our file was overwritten and is now empty.

The command `>>` redirects and appends

We can avoid overwriting our files by using the command >>. >> is known as the “append redirect” and will append new output to the end of a file, rather than overwriting it.

Code

grep -B1 -A2 N --no-group-separator ERR4998593_1.fastq > bad_reads.txt wc -l bad_reads.txt

Output

20 bad_reads.txt

Code

grep -B1 -A2 NNNNNNNNNN --no-group-separator ERR4998593_1.fastq >> bad_reads.txt wc -l bad_reads.txt

Output

52 bad_reads.txt

The output of our second call to wc shows that we have not overwritten our original data.

Redirecting output as input to other program — the pipe `|` command

Since we might have multiple different criteria we want to search for, creating a new output file each time has the potential to clutter up our workspace. We also thus far haven’t been interested in the actual contents of those files, only in the number of reads that we’ve found. We created the files to store the reads and then counted the lines in the file to see how many reads matched our criteria. There’s a way to do this, however, that doesn’t require us to create these intermediate files - the pipe command (|).

This is probably not a key on your keyboard you use very much, so let’s all take a minute to find that key. For the standard QWERTY keyboard layout, the | character can be found using the key combination

Shift+\

What | does is take the output that is scrolling by on the terminal and uses that output as input to another command. When our output was scrolling by, we might have wished we could slow it down and look at it, like we can with less. Well it turns out that we can! We can redirect our output from our grep call through the less command.

Code

grep -B1 -A2 N ERR4998593_1.fastq | less

We can now see the output from our grep call within the less interface. We can use the up and down arrows to scroll through the output and use q to exit less.

If we don’t want to create a file before counting lines of output from our grep search, we could directly pipe the output of the grep search to the command wc -l. This can be helpful for investigating your output if you are not sure you would like to save it to a file.

Code

grep -B1 -A2 N --no-group-separator ERR4998593_1.fastq | wc -l

Output

Custom `grep` control

Use man grep to read more about other options to customize the output of grep including extended options, anchoring characters, and much more.

Redirecting output is often not intuitive, and can take some time to get used to. Once you’re comfortable with redirection, however, you’ll be able to combine any number of commands to do all sorts of exciting things with your data!

We’ll be using the redirect > and pipe | later in the course as part of our analysis workflow, so you will get lots of practice using them.

None of the command line programs we’ve been learning do anything all that impressive on their own, but when you start chaining them together, you can do some really powerful things very efficiently.

Introduction to Command Line

Getting started

1. Download and unzip the course folder

2. Open your command line interface

Part One: What is a file system?

File paths

Root and home directories

Absolute vs relative paths

Part Two: What is a shell and why should I care?

How to access the shell

Navigating your file system

Challenge

Shortcut: Tab Completion

Moving around the file system

Finding hidden directories

Examining the contents of other directories

Navigating practice

Full vs. Relative Paths

Relative path resolution

Navigational Shortcuts

Part Three: Working with Files

Our data set: FASTQ files

Wildcards

Exercise

Command History

Examining Files — the less program

FASTQ format

PHRED score

Exercise

Other programs to look into files: cat, more, head, and tail

Creating, moving, copying, and removing files

Copying Files

Creating Directories

Moving / Renaming files and directories

Removing files and directories

Exercise

Solution

Searching files

Nucleotide abbreviations

Determining quality

Exercise

Redirecting output

Redirecting output to a file — the > command

Counting number of lines in files — the wc program

Exercise

The command > redirects and overwrites

The command >> redirects and appends

Redirecting output as input to other program — the pipe | command

Custom grep control

Examining Files — the `less` program

Other programs to look into files: `cat`, `more`, `head`, and `tail`

Redirecting output to a file — the `>` command

Counting number of lines in files — the `wc` program

The command `>` redirects and overwrites

The command `>>` redirects and appends

Redirecting output as input to other program — the pipe `|` command

Custom `grep` control