This tutorial is out of date and no longer maintained.
A regular expression is simply a sequence of characters that define a pattern.
When you want to match a string to perhaps validate an email or password, or even extract some data, a regex is an indispensable tool.
Everything in regex is a character. Even (an empty space character (
).
While Unicode characters can be used to match any international text, most patterns use normal ASCII (letters, digits, punctuation, and keyboard symbols like $@%#!.
)
Regular expressions are everywhere. Here are some of the reasons why you should learn them:
Are there any real-world applications?
Common applications of regex are:
Also, regex is used for text matching in spreadsheets, text editors, IDEs, and Google Analytics.
We are going to use Python to write some regex. Python is known for its readability so it makes it easier to implement them.
In Python, the re module provides full support for regular expressions.
A GitHub repo contains code and concepts we’ll use here.
Python uses raw string notations to write regular expressions – r"write-expression-here"
First, we’ll import the re
module. Then write out the regex pattern.
The purpose of the compile
method is to compile the regex pattern which will be used for matching later.
It’s advisable to compile regex when it’ll be used several times in your program. Resaving the resulting regular expression object for reuse, which re.compile
does, is more efficient.
To add some regular expression inside the raw string notation, we’ll put some special sequences to make our work easier.
They are simply a sequence of characters that have a backslash character (\
).
For instance:
\d
is a match for one digit [0-9]\w
is a match for one alphanumeric character.This means any ASCII character that’s either a letter or a number: [a-z A-Z 0-9]
It’s important to know them since they help us write simpler and shorter regex.
Here’s a table with more special sequences
Element | Description |
---|---|
. |
This element matches any character except \n |
\d |
This matches any digit [0-9] |
\D |
This matches non-digit characters [^0-9] |
\s |
This matches whitespace character [ \t\n\r\f\v] |
\S |
This matches non-whitespace character [^ \t\n\r\f\v] |
\w |
This matches alphanumeric character [a-zA-Z0-9_] |
\W |
This matches any non-alphanumeric character [^a-zA-Z0-9] |
Points to note:
[0-9]
is the same as [0123456789]
\d
is short for [0-9]
\w
is short for [a-zA-Z0-9]
[7-9]
is the same as [789]
Having learned something about special sequences, let’s continue with our coding. Write down and run the code below.
The match method returns a match object, or None
if no match was found.
We are printing a result.group()
. The group()
is a match object method that returns an entire match. If not, it returns a NoneType
, which means there was no match to our compiled pattern.
You may wonder why the output is only a letter and not the whole word. It’s simply because \w
sequence matches only the first letter or digit at the start of the string.
We’ve just wrote our first regex program!
We want to do more than simply matching a single letter. So we amend our code to look like this:
The plus symbol (+
) on our second pattern is what we call a quantifier.
Quantifiers simply specify the number of characters to match.
Here are some other regex quantifiers and how to use them.
Quantifier | Description | Example | Sample match |
---|---|---|---|
+ |
one or more | \w+ |
ABCDEF097 |
{2} |
exactly 2 times | \d{2} |
01 |
{1,} |
one or more times | \w{1,} |
smiling |
{2,4} |
2, 3 or 4 times | \w{2,4} |
1234 |
* |
0 or more times | A*B |
AAAAB |
? |
once or none (lazy) | \d+? |
1 in 12345 |
Let’s write some more quantifiers in our program!
The above regex uses a quantifier to match at least one digit.
Calling the function will print this output: '007'
^
and $
?You may have noticed that a regex usually has the caret (^
) and dollar sign ($
) characters. For example, r"^\w+$"
.
Here’s why.
^
and $
are boundaries or anchors. ^
marks the start, while $
marks the end of a regular expression.
However, when used in square brackets [^ … ] it means not
. For example, [^\s$]
or just [^\s]
will tell regex to match anything that is not a whitespace character.
Let’s write some code to prove this
First, notice there’s no re.compile
this time. Programs that use only a few regular expressions at a time don’t have to compile a regex. We, therefore, don’t need re.compile
for this.
Next, re.match()
takes in an optional string argument as well, so we fed it with the line
variable.
Moving on swiftly!
Let’s look at a new concept: search.
The match method checks for a match only at the beginning of the string, while a re.search()
checks for a match anywhere in the string.
Let’s write some search functionality.
The search
method, like the match method, can also take an extra argument.
The re.MULTILINE
simply tells our method to search on multiple lines that have been separated by the new line space character if any.
Let’s take a look at another example of how search
works:
This will print out <html>
.
The re.split()
splits a string into a list delimited by the passed pattern.
For example, consider having names read from a file that we want to put in a list:
We can use split to read each line and split them into an array as such:
But what if we wanted to find all instances of words in a string?
Enter re.findall
.
re.findall()
finds all the matches of all occurrences of a pattern, not just the first one as re.search()
does. Unlike search which returns a match object, findall
returns a list of matches.
Let’s write and run this functionality.
The output will be a list: ['finding', 'dory']
Let’s say we want to search for people with 5 or 6-figure salaries.
Regex will make it easy for us. Let’s try it out:
Suppose we wanted to do some string replacement. The re.sub
method will help us do that.
It simply returns a string that has undergone some replacement using a matched pattern.
Let’s write code to do some string replacement:
The program’s aim is to replace any digit in the string with the _
character.
Therefore, the print output will be there is only __ thing __ do
Let’s try out another example. Write down the following code:
We have managed to replace the words in the input string with the word “regex”. Regex is very powerful in string manipulations.
Sometimes you might encounter this (?=)
in regex. This syntax defines a look ahead.
Instead of matching from the start of the string, match an entity that’s followed by the pattern.
For instance, r"a (?=b)"
will return a match a
only if it’s followed by b
.
Let’s write some code to elaborate on that.
The pattern tries to match the closest string that is followed by a space character and the word fox
.
Let’s look at another example. Go ahead and write this snippet:
The above regex tries to match all instances
of characters that are followed by a comma
When we run this, we should print out a list containing: [ 'Me', 'myself' ]
What if you wanted to match a string that has a bunch of these special regex characters?
A backlash is used to define special characters in regex. So to cover them as characters in our pattern string, we need to escape them and use \\
.
Here’s an example.
Let’s try it one more time just to get it – Suppose we want to include a +
(a reserved quantifier) in a string to be matched by a pattern. We’ll do something like this:
We have successfully escaped the +
character so that regex might not mistake it for being a quantifier.
For a real-world application, here’s a function that monetizes a number using thousands separator commas.
As you might have noticed, the pattern uses a look-ahead mechanism. The brackets are responsible for grouping the digits into clusters, which can be separated by the commas.
For example, the number 1223456
will become 1,223,456
.
Congratulations on making it to the end of this intro! From the special sequences of characters, matching and searching, to finding all using reliable look aheads and manipulating strings in regex – we’ve covered quite a lot.
There are some advanced concepts in regex such as backtracking and performance optimization which we can continue to learn as we grow. A good resource for more intricate details would be the re module documentation.
Great job for learning something that many consider difficult!
If you found this helpful, spread the word.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!