Regular Expressions
Syntax
Regular expressions are a great way to find/manipulate patterns in files. Regular expressions follow a specific syntax:
.used for matching any character*used for mathcing zero or more times+used for matching one or more times?used for matching zero or one times|or operator^used for matching the start of a line$used for matching at the end of a line
So if we wanted to list all files that end with txt we would use:
ls *txt
output
accList.txt meta.txt
However, to use the rest of the symbols we need to use special commands which we will describe below.
Search Files
Regular expressions are a great way to find/manipulate patterns in files. Let's learn how to search for files starting with a with grep:
ls | grep ^a
output
accList.txt
We can also search inside files for lines with certain patterns. Let's try to find the line that contains the word "Run" in our meta data file:
grep "Run" meta.txt
output
Run analyte_type Assay.Type body_site
Replace Patterns
Sometimes you may want to replace a pattern in a file and we can accomplish this with the sed command! Let's replace the word "body_site" with "tissue" in our meta.txt file:
sed s/body_site/tissue/g meta.txt
output
Run analyte_type Assay.Type tissue
SRR1219879 DNA WGS Peripheral blood
SRR1219880 DNA WGS Peripheral blood
The sed command follows the pattern s/pattern/replacement/g.
Manipulating Structured Data
If we have structured data, like a data frame (think csv, tsv, excel file), then we can manipulate the data with awk which follows the following pattern:
awk '/pattern/ { action }' file
Let's print out the first column of our meta.txt file with awk as an example without a pattern:
awk '{print $1}' meta.txt
output
Run
SRR1219879
SRR1219880
You can pick other columns with the $ and the number of the column. If you choose $0 the entire file will be printed. For example if we wanted to print any line that had the pattern SRR we would use:
awk '/SRR/ {print $0}' meta.txt
output
Run
SRR1219879 DNA WGS Peripheral blood
SRR1219880 DNA WGS Peripheral blood