❮ Python Os Pathconf Ref Math Gcd ❯

Python3 Regular Expressions

A regular expression is a special sequence of characters that helps you easily check whether a string matches a certain pattern.

Python has added the re module since version 1.5, which provides Perl-style regular expression patterns.

The re module gives Python the full capability of regular expressions.

The compile function generates a regular expression object from a pattern string and optional flags. This object has a series of methods for matching and substitution.

The re module also provides functions that are identical in functionality to these methods, using a pattern string as their first parameter.

This section mainly introduces the commonly used regular expression processing functions in Python. If you are unfamiliar with regular expressions, you can refer to our Regular Expressions - Tutorial.

re.match Function

re.match attempts to match a pattern starting from the beginning of the string. If the match is not successful at the beginning, match() returns None.

Function Syntax:

re.match(pattern, string, flags=0)

Function parameter description:

Parameter	Description
pattern	The regular expression to be matched.
string	The string to be matched.
flags	Flags for controlling the matching mode of the regular expression, such as case sensitivity, multi-line matching, etc. See: Regular Expression Modifiers - Optional Flags

If the match is successful, re.match returns a match object; otherwise, it returns None.

We can use the group(num) or groups() match object functions to retrieve the matched expressions.

Match object method	Description
group(num=0)	The string of the entire matched expression. `group()` can take multiple group numbers, and it will return a tuple containing the values corresponding to those groups.
groups()	Returns a tuple containing all subgroup strings, from 1 to the number of groups.

Example

#!/usr/bin/python

import re
print(re.match('www', 'www.tutorialpro.org').span())  # Matches at the start
print(re.match('com', 'www.tutorialpro.org'))         # Does not match at the start

The output of the above example is:

(0, 3)
None

Example

#!/usr/bin/python3
import re

line = "Cats are smarter than dogs"
# .* matches any character except a newline (0 or more times)
# (.*?) indicates a "non-greedy" mode, saving only the first matched substring
matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)

if matchObj:
   print ("matchObj.group() : ", matchObj.group())
   print ("matchObj.group(1) : ", matchObj.group(1))
   print ("matchObj.group(2) : ", matchObj.group(2))
else:
   print ("No match!!")

The output of the above example is:

matchObj.group() :  Cats are smarter than dogs
matchObj.group(1) :  Cats
matchObj.group(2) :  smarter

re.search Method

re.search scans the entire string and returns the first successful match.

Function syntax:

re.search(pattern, string, flags=0)

Function parameter description:

Parameter	Description
pattern	The regular expression to be matched.
string	The string to be matched.
flags	Flags for controlling the matching mode of the regular expression, such as case sensitivity, multi-line matching, etc. See: Regular Expression Modifiers - Optional Flags

If the match is successful, re.search returns a match object; otherwise, it returns None.

We can use the group(num) or groups() match object functions to retrieve the matched expressions.

Match object method	Description
group(num=0)	The entire matched string of the expression. group() can take multiple group numbers as input, in which case it returns a tuple containing the values corresponding to those groups.
groups()	Returns a tuple containing all the subgroup strings, from 1 to the number of groups.

Example

#!/usr/bin/python3

import re

print(re.search('www', 'www.tutorialpro.org').span())  # Matches at the start
print(re.search('com', 'www.tutorialpro.org').span())         # Does not match at the start

The output of the above example is:

(0, 3)
(11, 14)

Example

#!/usr/bin/python3

import re

line = "Cats are smarter than dogs"

searchObj = re.search( r'(.*) are (.*?) .*', line, re.M|re.I)

if searchObj:
   print ("searchObj.group() : ", searchObj.group())
   print ("searchObj.group(1) : ", searchObj.group(1))
   print ("searchObj.group(2) : ", searchObj.group(2))
else:
   print ("Nothing found!!")

The output is:

searchObj.group() :  Cats are smarter than dogs
searchObj.group(1) :  Cats
searchObj.group(2) :  smarter

Difference between re.match and re.search

re.match only matches at the beginning of the string. If the string does not match the regex at the start, it fails and returns None. re.search, on the other hand, matches anywhere in the string until it finds a match.

Example

#!/usr/bin/python3

import re

line = "Cats are smarter than dogs"

matchObj = re.match( r'dogs', line, re.M|re.I)
if matchObj:
   print ("match --> matchObj.group() : ", matchObj.group())
else:
   print ("No match!!")

matchObj = re.search( r'dogs', line, re.M|re.I)
if matchObj:
   print ("search --> matchObj.group() : ", matchObj.group())
else:
   print ("No match!!")

The output is:

No match!!
search --> matchObj.group() :  dogs

Search and Replace

Python's re module provides re.sub for replacing matched parts of the string.

Syntax:

re.sub(pattern, repl, string, count=0, flags=0)

Parameters:

pattern: The regex pattern string.
repl: The replacement string or a function.
string: The original string to be searched and replaced.
count: The maximum number of pattern occurrences to be replaced. Default is 0, which replaces all occurrences.
flags: Compilation flags, numeric form.

The first three are required parameters, and the last two are optional.

Example

#!/usr/bin/python3
import re

phone = "2004-959-559 # This is a phone number"

# Remove comments
num = re.sub(r'#.*$', "", phone)
print ("Phone Number : ", num)

# Remove non-digit characters
num = re.sub(r'\D', "", phone)
print ("Phone Number : ", num)

The output is:

Phone Number :  2004-959-559 
Phone Number :  2004959559

repl parameter as a function

In the following example, the matched numbers are multiplied by 2:

Example

#!/usr/bin/python

import re

# Multiply matched numbers by 2
def double(matched):
    value = int(matched.group('value'))
    return str(value * 2)

s = 'A23G4HFD567'
print(re.sub('(?P<value>\d+)', double, s))

The output is:

A46G8HFD1134

A46G8HFD1134

compile Function

The compile function is used to compile a regular expression into a regular expression pattern object, which can be used by match() and search() functions.

Syntax:

re.compile(pattern[, flags])

Parameters:

pattern: A string representing a regular expression.
flags (optional): Flags that specify matching modes, such as ignoring case, multi-line mode, etc. The possible values are:
- re.L: Makes \w, \W, \b, \B, \s, \S dependent on the current locale.
- re.M: Multi-line mode.
- re.S: Makes the '.' special character match any character including a newline.
- re.U: Makes \w, \W, \b, \B, \d, \D, \s, \S dependent on the Unicode character property database.
- re.X: Ignores whitespace and comments for better readability.

Example

>>> import re
>>> pattern = re.compile(r'\d+')                    # Matches at least one digit
>>> m = pattern.match('one12twothree34four')        # Tries to match at the beginning, no match
>>> print(m)
None
>>> m = pattern.match('one12twothree34four', 2, 10) # Tries to match starting from 'e', no match
>>> print(m)
None
>>> m = pattern.match('one12twothree34four', 3, 10) # Tries to match starting from '1', matches
>>> print(m)                                        # Returns a Match object
&lt;_sre.SRE_Match object at 0x10a42aac0>
>>> m.group(0)   # Can omit 0
'12'
>>> m.start(0)   # Can omit 0
3
>>> m.end(0)     # Can omit 0
5
>>> m.span(0)    # Can omit 0
(3, 5)

When a match is successful, a Match object is returned, which includes:

group([group1, ...]): Retrieves one or more subgroups of the match. Use group() or group(0) to get the entire matched substring.
start([group]): Returns the starting index of the matched substring in the entire string. The default parameter is 0.
end([group]): Returns the ending index + 1 of the matched substring in the entire string. The default parameter is 0.
span([group]): Returns a tuple (start(group), end(group)).

Another example:

>>> import re
>>> pattern = re.compile(r'([a-z]+) ([a-z]+)', re.I)   # re.I ignores case
>>> m = pattern.match('Hello World Wide Web')
>>> print(m)                            # Match successful, returns a Match object
&lt;_sre.SRE_Match object at 0x10bea83e8>
>>> m.group(0)                            # Returns the entire matched substring
'Hello World'
>>> m.span(0)                             # Returns the indices of the entire matched substring
(0, 11)
>>> m.group(1)                            # Returns the first matched subgroup
'Hello'
>>> m.span(1)                             # Returns the indices of the first matched subgroup
(0, 5)
>>> m.group(2)                            # Returns the second matched subgroup
'World'
>>> m.span(2)                             # Returns the indices of the second matched subgroup
(6, 11)

('Hello', 'World')
>>> m.group(3)                            # No third group exists
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: no such group

findall

Finds all substrings in the string where the regex pattern matches and returns them as a list. If multiple patterns are matched, it returns a list of tuples. If no matches are found, it returns an empty list.

Note: match and search match once, while findall matches all occurrences.

Syntax:

re.findall(pattern, string, flags=0)
or
pattern.findall(string[, pos[, endpos]])

Parameters:

pattern: The matching pattern.
string: The string to be matched.
pos: Optional parameter, specifies the starting position in the string, default is 0.
endpos: Optional parameter, specifies the ending position in the string, default is the length of the string.

Finding all numbers in a string:

Example

import re

result1 = re.findall(r'\d+', 'tutorialpro 123 google 456')

pattern = re.compile(r'\d+')   # Find numbers
result2 = pattern.findall('tutorialpro 123 google 456')
result3 = pattern.findall('run88oob123google456', 0, 10)

print(result1)
print(result2)
print(result3)

Output:

['123', '456']
['123', '456']
['88', '12']

Multiple matching patterns, returning a list of tuples:

Example

import re

result = re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10')
print(result)

Output:

[('width', '20'), ('height', '10')]

re.finditer

Similar to findall, it finds all substrings in the string where the regex pattern matches and returns them as an iterator.

re.finditer(pattern, string, flags=0)

Parameters:

Parameter	Description
pattern	The regex pattern to match
string	The string to match against
flags	Flags to control the regex matching, such as case-insensitivity, multiline, etc. See: Regex Modifiers - Optional Flags

Example

import re

it = re.finditer(r"\d+", "12a32bc43jf3")
for match in it:
    print(match.group())

Output:

re.split

The split method splits the string by the occurrences of the pattern and returns a list.

re.split(pattern, string[, maxsplit=0, flags=0])

Parameters:

Parameter	Description
pattern	The regex pattern to match
string	The string to match against
maxsplit	Number of splits to perform, default is 0, which means no limit.
flags	Flags to control the regex matching, such as case-insensitivity, multiline, etc. See: Regex Modifiers - Optional Flags

Example

>>> import re
>>> re.split('\W+', 'tutorialpro, tutorialpro, tutorialpro.')
['tutorialpro', 'tutorialpro', 'tutorialpro', '']
>>> re.split('(\W+)', ' tutorialpro, tutorialpro, tutorialpro.')

>>> re.split('\W+', ' tutorialpro, tutorialpro, tutorialpro.', 1) 
['', 'tutorialpro, tutorialpro, tutorialpro.']

>>> re.split('a*', 'hello world')   # For strings that do not find a match, split does not perform any splitting
['hello world']

Regular Expression Object

re.RegexObject

re.compile() returns a RegexObject object.

re.MatchObject

The group() method returns the string matched by the RE.

start() returns the starting position of the match
end() returns the ending position of the match
span() returns a tuple containing the (start, end) positions of the match

Regular Expression Modifiers - Optional Flags

Regular expressions may include optional flags modifiers to control the matching mode. Modifiers are specified as an optional flag. Multiple flags can be specified by bitwise OR (|) them. For example, re.I | re.M sets the I and M flags:

Modifier	Description
re.I	Makes the match case-insensitive
re.L	Performs locale-aware matching
re.M	Makes $ match the end of a line and makes ^ match the start of any line
re.S	Makes a period (dot) match any character, including a newline
re.U	Interprets characters according to the Unicode character set. This flag affects the behavior of \w, \W, \b, \B.
re.X	This flag allows you to write regular expressions that look nicer and are more readable by allowing you to visually separate logical sections of the pattern and add comments

Regular Expression Patterns

Pattern strings use a special syntax to represent a regular expression.

Letters and numbers represent themselves. A regular expression pattern's letters and numbers match the same strings.

Most letters and numbers have special meanings when preceded by a backslash.

Punctuation characters only match themselves when escaped, otherwise, they have special meanings.

The backslash itself needs to be escaped.

Since regular expressions often contain backslashes, it's best to use raw strings to represent them. Pattern elements (like r'\t', equivalent to \\t) match corresponding special characters.

The following table lists the special elements in the syntax of regular expression patterns. If you use patterns with the optional flags argument, the meaning of some pattern elements may change.

Pattern	Description
^	Matches the start of the string
$	Matches the end of the string.
.	Matches any character except a newline. With re.DOTALL, it can match any character including a newline.
[...]	Represents a set of characters. [amk] matches 'a', 'm', or 'k'
[^...]	Matches characters not in the list. [^abc] matches any character except 'a', 'b', or 'c'.
re*	Matches 0 or more repetitions of the preceding expression.
re+	Matches 1 or more repetitions of the preceding expression.
re?	Matches 0 or 1 repetition of the preceding expression, in non-greedy mode
re{ n}	Matches exactly n repetitions of the preceding expression. For example, "o{2}" does not match "o" in "Bob" but matches the two o's in "food".
re{ n,}	Matches at least n repetitions of the preceding expression. For example, "o{2,}" does not match "o" in "Bob" but matches all the o's in "foooood". "o{1,}" is equivalent to "o+". "o{0,}" is equivalent to "o*".
re{ n, m}	Matches at least n and at most m repetitions of the preceding expression, in greedy mode
a	b	Matches either a or b
(re)	Groups regular expressions and remembers the matched text
(?imx)	Temporarily toggles on i, m, or x options within the parentheses.
(?-imx)	Temporarily toggles off i, m, or x options within the parentheses.
(?: re)	Groups regular expressions without remembering matched text
(?imx: re)	Temporarily toggles on i, m, or x options within the parentheses
(?-imx: re)	Temporarily toggles off i, m, or x options within the parentheses
(?#...)	Comment
(?= re)	Specifies position using a pattern. Doesn't have a range.
(?! re)	Specifies position using pattern negation. Doesn't have a range.
(?> re)	Matches independent pattern without backtracking

``` | \w | Matches digits, letters, and underscores | | \W | Matches non-digits, non-letters, and non-underscores | | \s | Matches any whitespace character, equivalent to [\t\n\r\f] | | \S | Matches any non-whitespace character | | \d | Matches any digit, equivalent to [0-9] | | \D | Matches any non-digit | | \A | Matches the start of the string | | \Z | Matches the end of the string, if a newline exists, it only matches before the newline | | \z | Matches the end of the string | | \G | Matches the position where the last match finished | | \b | Matches a word boundary, which is the position between a word and a space. For example, 'er\b' can match "never" but not "verb" | | \B | Matches a non-word boundary. 'er\B' can match "verb" but not "never" | | \n, \t, etc. | Matches a newline character. Matches a tab character, etc. | | \1...\9 | Matches the content of the nth group | | \10 | Matches the content of the nth group if it has been matched. Otherwise, it refers to an octal character code expression |

Regular Expression Examples

Character Matching

Example	Description
python	Matches "python"

Character Classes

Example	Description
[Pp]ython	Matches "Python" or "python"
rub[ye]	Matches "ruby" or "rube"
[aeiou]	Matches any one of the enclosed letters
[0-9]	Matches any digit. Similar to [0123456789]
[a-z]	Matches any lowercase letter
[A-Z]	Matches any uppercase letter
[a-zA-Z0-9]	Matches any letter and digit
[^aeiou]	Matches any character except the enclosed letters
[^0-9]	Matches any character except digits

Special Character Classes

Example	Description
.	Matches any single character except "\n". To match any character including '\n', use a pattern like '[.\n]'
\d	Matches a digit character. Equivalent to [0-9]
\D	Matches a non-digit character. Equivalent to [^0-9]
\s	Matches any whitespace character, including spaces, tabs, form feeds, etc. Equivalent to [ \f\n\r\t\v]
\S	Matches any non-whitespace character. Equivalent to [^ \f\n\r\t\v]
\w	Matches any word character including underscores. Equivalent to '[A-Za-z0-9_]'
\W	Matches any non-word character. Equivalent to '[^A-Za-z0-9_]'

❮ Python Os Pathconf Ref Math Gcd ❯