Regular Expressions in Python: Theory and Practice

Let’s take a look at regular expressions in Python, starting with syntax and ending with usage examples.

Note You are reading an improved version of an article we once released.

  • Regular Expression Basics
  • Regular Expressions in Python
  • Tasks
  • Regular Expression Basics
  • Regular expressions are patterns that are used to look up a corresponding piece of text and match characters.

Roughly speaking, we have an input field into which an email address should be entered. But until we check the validity of the entered email address, this line can contain absolutely any set of characters, and we don’t need it.

To detect an error when an invalid e-mail address is entered, we can use the following regular expression:
r’^[a-zA-Z0-9_.+-][email protected][a-zA-Z0-9-]+(?:.[a-zA-Z0-9-]+)+$’

Basically, our pattern is a set of characters that checks the string against a given rule. Let’s see how it works.

The syntax of RegEx

The syntax of the regulars is unusual. The characters can be either letters or digits, or meta-characters, which define the pattern of the string:

Syntax of regular expressions

There are also additional constructs that allow you to abbreviate regular expressions:

  • \d – corresponds to any one digit and replaces the expression [0-9];
  • \D – excludes all digits and replaces [^0-9];
  • \w – replaces any digit, letter, and underscore;
  • \W – any character except Latin, numbers or underscores;
  • \s – corresponds to any space character;
  • \S – describes any nonwhite character.
  • what regular expressions are used for
  • to specify the desired format, such as a phone number or email address;
  • to break strings into substrings;
  • for searching, substituting and extracting characters;
  • for fast execution of non-trivial operations.

The syntax of such expressions is mostly standardized, so you should only understand them once to use them in any programming language.

Note Don’t forget that regular expressions are not always optimal and Python built-in functions are often sufficient for simple operations.

Regular Expressions in Python

Python has a re module for regular expressions. You just have to import it:

And here are the most popular methods the module provides:

  • re.match()
  • re.findall()
  • re.split()
  • re.sub()
  • re.compile()

Let’s take a closer look at each of them.

This method searches for a given pattern at the beginning of the string. For example, if we call the match() method on the string “AV Analytics AV” with the pattern “AV”, it will complete successfully. But if we search for “Analytics”, the result will be negative:

The substring was found. To output its contents, let’s apply the group() method (we use “r” before the pattern string to show that it is a “raw” string in Python)

Returns a list of all matches found. The findall() method has no restrictions on searching at the beginning or end of a string. If we search for “AV” in our string, it will return all occurrences of “AV”. It is recommended to use findall(), because it can work both as and as re.match().

In the example, we split the word “Analytics” by the letter “y”. The split() method also takes a maxsplit argument with a default value of 0. In this case, it will split the string as many times as possible, but if you specify this argument, it will split no more than the specified number of times.

Let’s look at some examples:

  • We set the maxsplit parameter to 1 and the result was that the string was split into two parts instead of three.
  • Looks for a pattern in the string and replaces it with the specified substring. If no pattern is found, the string remains unchanged.

So far, we have looked at finding a specific sequence of characters. But what if we don’t have a certain pattern and we need to return a character set from a string that meets certain rules? Such a problem is often encountered when retrieving information from strings. This can be done by writing an expression using special characters.

Operator Description

  • . One any character other than the new string \n.
  • ? 0 or 1 occurrence of a pattern on the left
  • + 1 or more occurrences of a pattern on the left
  • * 0 or more occurrences of the pattern on the left
  • \w Any digit or letter (\W is anything but a letter or a digit)
  • \d Any digit [0-9] (\D – all but a digit)
  • \s Any non-space character (\S – any non-space character)
  • \b Word boundary
  • […] One of the characters in brackets ([^…] – any character except those in brackets)
  • \ Shielding special characters (. indicates a dot or + indicates a plus sign)
  • ^ and $ Start and end of line, respectively
  • {n,m} n to m occurrences ({,m} – 0 to m)
  • a|b Corresponds to a or b
  • () Groups the expression and returns the found text
  • \t, \n, \r Tab, newline, and carriage return characters, respectively

More information on special characters can be found in the documentation for regular expressions in Python 3.