As system administrators, developers, QA engineers, support engineers, etc. one needs to find a particular pattern, like a set of IP addresses belonging to certain range or a range of time-stamps or groups of domain or subdomain names, from files. One might also need to find a word spelled in a particular way or find possible typos in a file. This is where regular expressions come in.
Regular expressions are templates to match patterns (or sometimes not to match patterns). They provide a way to describe and parse text. This tutorial will give an insight to regular expressions without going into particularities of any language. We will simply use egrep to explain the concepts.
Regular expressions consists of two types of characters:
the regular literal characters and
the metacharacters
These metacharacters are the ones which give the power to the regular expressions.
Consider the following country.txt file where the first column is the country name, the the second column is the population of the country, and the third column is the continent.
$ cat country.txt
India,1014003817,Asia
Italy,57634327,Europe
Yemen,1184300,Asia
Argentina,36955182,Latin America
Brazil,172860370,Latin America
Cameroon,15421937,Africa
Japan,126549976,Asia
The first group of “metacharacter” we will discuss are ^ and $. ^ and $ matches the start and end of a pattern respectively and are called anchor metacharacters.
To find out the name of all the countries whose country name starts with I, we use the expression:
$ egrep '^I' country.txt
India,1014003817,Asia
Italy,57634327,Europe
or to find out all the countries which have continent names ending with e, we do:
$ egrep 'e$' country.txt
Italy,57634327,Europe
The next metacharacter is the dot (.), which matches any one character. To match all the lines in which the country name is exactly 5 characters long:
$ egrep '^.....,' country.txt
India,1014003817,Asia
Italy,57634327,Europe
Yemen,1184300,Asia
Japan,126549976,Asia
How about finding all lines in which country name starts with either I or J and the country name is 5 characters long?
$ egrep '^[IJ]....,' country.txt
India,1014003817,Asia
Italy,57634327,Europe
Japan,126549976,Asia
[…] is called as a character set or a character class. Inside a character set only one of the given characters is matched.
An ^ inside the character set negates the character set. The following example will match country names five characters long but which do not start with either I or J.
$ egrep '^[^IJ]....,' country.txt
Yemen,1184300,Asia
To match all the line containing Asia or Africa:
$ egrep 'Asia|Africa' country.txt
India,1014003817,Asia
Yemen,1184300,Asia
Cameroon,15421937,Africa
Japan,126549976,Asia
This can be also done by taking A and a common.
$ egrep 'A(si|fric)a' country.txt
India,1014003817,Asia
Yemen,1184300,Asia
Cameroon,15421937,Africa
Japan,126549976,Asia
Instead of writing
$ egrep '^[IJ]....,' country.txt
we can write
$ egrep '^[IJ].{4},' country.txt
where {} are called as the quantifiers. They determine how many times the character before them should occur.
We can give a range too:
$ egrep '^[IJ].{4,6},' country.txt
India,1014003817,Asia
Italy,57634327,Europe
Japan,126549976,Asia
This will match country names starting with I or J and having 4 to 6 character after it.
There are some shortcuts available for the quantifiers. For example,
{0,1} is equivalent to ?
$ egrep '^ab{0,1}c$' filename
is the same as
$ egrep '^ab?c' filename
{0,} is equivalent to *
$ egrep '^ab{0,}c$' filename
is the same as
$ egrep '^ab*c' filename
{1,} is equivalent to +
$ egrep '^ab{1,}c$' filename
is the same as
$ egrep '^ab+c' filename
Let us see some examples involving the expressions we have seen so far. Here instead of searching from a file, we search from standard input. The trick we use is that we know grep (or egrep) searches for a pattern, and if a pattern is found, then the entire line containing the pattern is shown.
We would like to find out all the possible ways to spell the sentence the grey colour suit was his favourite.
The expression would be:
$ egrep 'the gr[ea]y colou?r suit was his favou?rite'
the grey color suit was his favourite
the grey color suit was his favourite
the gray colour suit was his favorite
the gray colour suit was his favorite
Looking at the expression above, we can see that:
grey can be spelled as grey or gray
colour can be written as colour or color, that means u is optional so we use u?
similarly favourite or favorite can be written favou?rite
How about matching a US zip code?
$ egrep '^[0-9]{5}(-[0-9]{4})?$'
83456
83456
83456-
834562
92456-1234
92456-1234
10344-2342-345
One more example of matching all valid times in a 24 hour clock.
$ egrep '^([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]'
23:44:02
23:44:02
33:45:11
15:45:33
15:45:33
In the above example we said that, if the first digit of the hour is either 0 or 1, then the second one will be any from 0 to 9. But if the first digit is 2, then the allowed values for second digit are 0,1, 2 or 3.
To write a pattern to match the words ending with color such that unicolor, watercolor, multicolor etc. is matched but not colorless or colorful. Try these examples yourself, to get familiar with them:
$ egrep 'color\>'
Next, to match colorless and colorful, but not unicolor, watercolor, multicolor, etc.
$ egrep '\<color'
Thereby to match the exact word color, we do:
$ egrep '\<color\>'
Suppose we want to match all words which were double typed, like the the or before before, we have to use backreferences. Backreferences are used for remembering patterns.
Here’s an example:
$ egrep "\<the\> \1"
Or the generic way:
$ egrep "\<(.*)\> \1"
The above example can be used to find all names in which the first and the last names are the same. In case there are more than one set of parentheses, then the second, third fourth etc. can be referenced with \2, \3, \4 etc.
This is just an introduction to the power of regular expressions.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!
Thanks guys!!
Awesome tutorial, very well explained. Thank you very much and keep writing.
Thanks for the encouragement. Regards, Shantanu
This is a very nice tutorial on regular expression. There have been may more extensive tutorials on which at some point I got confused. Thanks again keep writing your doing an excellent job.