Python3 Regular Expressions
A regular expression is a special sequence of characters that helps you easily check whether a string matches a certain pattern.
Python has added the re
module since version 1.5, which provides Perl-style regular expression patterns.
The re
module gives Python the full capability of regular expressions.
The compile
function generates a regular expression object from a pattern string and optional flags. This object has a series of methods for matching and substitution.
The re
module also provides functions that are identical in functionality to these methods, using a pattern string as their first parameter.
This section mainly introduces the commonly used regular expression processing functions in Python. If you are unfamiliar with regular expressions, you can refer to our Regular Expressions - Tutorial.
re.match Function
re.match
attempts to match a pattern starting from the beginning of the string. If the match is not successful at the beginning, match()
returns None
.
Function Syntax:
re.match(pattern, string, flags=0)
Function parameter description:
Parameter | Description |
---|---|
pattern | The regular expression to be matched. |
string | The string to be matched. |
flags | Flags for controlling the matching mode of the regular expression, such as case sensitivity, multi-line matching, etc. See: Regular Expression Modifiers - Optional Flags |
If the match is successful, re.match
returns a match object; otherwise, it returns None
.
We can use the group(num)
or groups()
match object functions to retrieve the matched expressions.
Match object method | Description |
---|---|
group(num=0) | The string of the entire matched expression. group() can take multiple group numbers, and it will return a tuple containing the values corresponding to those groups. |
groups() | Returns a tuple containing all subgroup strings, from 1 to the number of groups. |
Example
#!/usr/bin/python
import re
print(re.match('www', 'www.tutorialpro.org').span()) # Matches at the start
print(re.match('com', 'www.tutorialpro.org')) # Does not match at the start
The output of the above example is:
(0, 3)
None
Example
#!/usr/bin/python3
import re
line = "Cats are smarter than dogs"
# .* matches any character except a newline (0 or more times)
# (.*?) indicates a "non-greedy" mode, saving only the first matched substring
matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)
if matchObj:
print ("matchObj.group() : ", matchObj.group())
print ("matchObj.group(1) : ", matchObj.group(1))
print ("matchObj.group(2) : ", matchObj.group(2))
else:
print ("No match!!")
The output of the above example is:
matchObj.group() : Cats are smarter than dogs
matchObj.group(1) : Cats
matchObj.group(2) : smarter
re.search Method
re.search
scans the entire string and returns the first successful match.
Function syntax:
re.search(pattern, string, flags=0)
Function parameter description:
Parameter | Description |
---|---|
pattern | The regular expression to be matched. |
string | The string to be matched. |
flags | Flags for controlling the matching mode of the regular expression, such as case sensitivity, multi-line matching, etc. See: Regular Expression Modifiers - Optional Flags |
If the match is successful, re.search
returns a match object; otherwise, it returns None
.
We can use the group(num)
or groups()
match object functions to retrieve the matched expressions.
Match object method | Description |
---|---|
group(num=0) | The entire matched string of the expression. group() can take multiple group numbers as input, in which case it returns a tuple containing the values corresponding to those groups. |
groups() | Returns a tuple containing all the subgroup strings, from 1 to the number of groups. |
Example
#!/usr/bin/python3
import re
print(re.search('www', 'www.tutorialpro.org').span()) # Matches at the start
print(re.search('com', 'www.tutorialpro.org').span()) # Does not match at the start
The output of the above example is:
(0, 3)
(11, 14)
Example
#!/usr/bin/python3
import re
line = "Cats are smarter than dogs"
searchObj = re.search( r'(.*) are (.*?) .*', line, re.M|re.I)
if searchObj:
print ("searchObj.group() : ", searchObj.group())
print ("searchObj.group(1) : ", searchObj.group(1))
print ("searchObj.group(2) : ", searchObj.group(2))
else:
print ("Nothing found!!")
The output is:
searchObj.group() : Cats are smarter than dogs
searchObj.group(1) : Cats
searchObj.group(2) : smarter
Difference between re.match and re.search
re.match only matches at the beginning of the string. If the string does not match the regex at the start, it fails and returns None. re.search, on the other hand, matches anywhere in the string until it finds a match.
Example
#!/usr/bin/python3
import re
line = "Cats are smarter than dogs"
matchObj = re.match( r'dogs', line, re.M|re.I)
if matchObj:
print ("match --> matchObj.group() : ", matchObj.group())
else:
print ("No match!!")
matchObj = re.search( r'dogs', line, re.M|re.I)
if matchObj:
print ("search --> matchObj.group() : ", matchObj.group())
else:
print ("No match!!")
The output is:
No match!!
search --> matchObj.group() : dogs
Search and Replace
Python's re module provides re.sub for replacing matched parts of the string.
Syntax:
re.sub(pattern, repl, string, count=0, flags=0)
Parameters:
- pattern: The regex pattern string.
- repl: The replacement string or a function.
- string: The original string to be searched and replaced.
- count: The maximum number of pattern occurrences to be replaced. Default is 0, which replaces all occurrences.
- flags: Compilation flags, numeric form.
The first three are required parameters, and the last two are optional.
Example
#!/usr/bin/python3
import re
phone = "2004-959-559 # This is a phone number"
# Remove comments
num = re.sub(r'#.*$', "", phone)
print ("Phone Number : ", num)
# Remove non-digit characters
num = re.sub(r'\D', "", phone)
print ("Phone Number : ", num)
The output is:
Phone Number : 2004-959-559
Phone Number : 2004959559
repl parameter as a function
In the following example, the matched numbers are multiplied by 2:
Example
#!/usr/bin/python
import re
# Multiply matched numbers by 2
def double(matched):
value = int(matched.group('value'))
return str(value * 2)
s = 'A23G4HFD567'
print(re.sub('(?P<value>\d+)', double, s))
The output is:
A46G8HFD1134
A46G8HFD1134
compile Function
The compile
function is used to compile a regular expression into a regular expression pattern object, which can be used by match()
and search()
functions.
Syntax:
re.compile(pattern[, flags])
Parameters:
pattern
: A string representing a regular expression.flags
(optional): Flags that specify matching modes, such as ignoring case, multi-line mode, etc. The possible values are:re.L
: Makes \w, \W, \b, \B, \s, \S dependent on the current locale.re.M
: Multi-line mode.re.S
: Makes the '.' special character match any character including a newline.re.U
: Makes \w, \W, \b, \B, \d, \D, \s, \S dependent on the Unicode character property database.re.X
: Ignores whitespace and comments for better readability.
Example
>>> import re
>>> pattern = re.compile(r'\d+') # Matches at least one digit
>>> m = pattern.match('one12twothree34four') # Tries to match at the beginning, no match
>>> print(m)
None
>>> m = pattern.match('one12twothree34four', 2, 10) # Tries to match starting from 'e', no match
>>> print(m)
None
>>> m = pattern.match('one12twothree34four', 3, 10) # Tries to match starting from '1', matches
>>> print(m) # Returns a Match object
<_sre.SRE_Match object at 0x10a42aac0>
>>> m.group(0) # Can omit 0
'12'
>>> m.start(0) # Can omit 0
3
>>> m.end(0) # Can omit 0
5
>>> m.span(0) # Can omit 0
(3, 5)
When a match is successful, a Match object is returned, which includes:
group([group1, ...])
: Retrieves one or more subgroups of the match. Usegroup()
orgroup(0)
to get the entire matched substring.start([group])
: Returns the starting index of the matched substring in the entire string. The default parameter is 0.end([group])
: Returns the ending index + 1 of the matched substring in the entire string. The default parameter is 0.span([group])
: Returns a tuple(start(group), end(group))
.
Another example:
>>> import re
>>> pattern = re.compile(r'([a-z]+) ([a-z]+)', re.I) # re.I ignores case
>>> m = pattern.match('Hello World Wide Web')
>>> print(m) # Match successful, returns a Match object
<_sre.SRE_Match object at 0x10bea83e8>
>>> m.group(0) # Returns the entire matched substring
'Hello World'
>>> m.span(0) # Returns the indices of the entire matched substring
(0, 11)
>>> m.group(1) # Returns the first matched subgroup
'Hello'
>>> m.span(1) # Returns the indices of the first matched subgroup
(0, 5)
>>> m.group(2) # Returns the second matched subgroup
'World'
>>> m.span(2) # Returns the indices of the second matched subgroup
(6, 11)
('Hello', 'World')
>>> m.group(3) # No third group exists
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: no such group
findall
Finds all substrings in the string where the regex pattern matches and returns them as a list. If multiple patterns are matched, it returns a list of tuples. If no matches are found, it returns an empty list.
Note: match and search match once, while findall matches all occurrences.
Syntax:
re.findall(pattern, string, flags=0)
or
pattern.findall(string[, pos[, endpos]])
Parameters:
pattern
: The matching pattern.string
: The string to be matched.pos
: Optional parameter, specifies the starting position in the string, default is 0.endpos
: Optional parameter, specifies the ending position in the string, default is the length of the string.
Finding all numbers in a string:
Example
import re
result1 = re.findall(r'\d+', 'tutorialpro 123 google 456')
pattern = re.compile(r'\d+') # Find numbers
result2 = pattern.findall('tutorialpro 123 google 456')
result3 = pattern.findall('run88oob123google456', 0, 10)
print(result1)
print(result2)
print(result3)
Output:
['123', '456']
['123', '456']
['88', '12']
Multiple matching patterns, returning a list of tuples:
Example
import re
result = re.findall(r'(\w+)=(\d+)', 'set width=20 and height=10')
print(result)
Output:
[('width', '20'), ('height', '10')]
re.finditer
Similar to findall, it finds all substrings in the string where the regex pattern matches and returns them as an iterator.
re.finditer(pattern, string, flags=0)
Parameters:
Parameter | Description |
---|---|
pattern | The regex pattern to match |
string | The string to match against |
flags | Flags to control the regex matching, such as case-insensitivity, multiline, etc. See: Regex Modifiers - Optional Flags |
Example
import re
it = re.finditer(r"\d+", "12a32bc43jf3")
for match in it:
print(match.group())
Output:
12
32
43
3
re.split
The split method splits the string by the occurrences of the pattern and returns a list.
re.split(pattern, string[, maxsplit=0, flags=0])
Parameters:
Parameter | Description |
---|---|
pattern | The regex pattern to match |
string | The string to match against |
maxsplit | Number of splits to perform, default is 0, which means no limit. |
flags | Flags to control the regex matching, such as case-insensitivity, multiline, etc. See: Regex Modifiers - Optional Flags |
Example
>>> import re
>>> re.split('\W+', 'tutorialpro, tutorialpro, tutorialpro.')
['tutorialpro', 'tutorialpro', 'tutorialpro', '']
>>> re.split('(\W+)', ' tutorialpro, tutorialpro, tutorialpro.')
>>> re.split('\W+', ' tutorialpro, tutorialpro, tutorialpro.', 1)
['', 'tutorialpro, tutorialpro, tutorialpro.']
>>> re.split('a*', 'hello world') # For strings that do not find a match, split does not perform any splitting
['hello world']
Regular Expression Object
re.RegexObject
re.compile() returns a RegexObject object.
re.MatchObject
The group() method returns the string matched by the RE.
start()
returns the starting position of the matchend()
returns the ending position of the matchspan()
returns a tuple containing the (start, end) positions of the match
Regular Expression Modifiers - Optional Flags
Regular expressions may include optional flags modifiers to control the matching mode. Modifiers are specified as an optional flag. Multiple flags can be specified by bitwise OR (|) them. For example, re.I | re.M sets the I and M flags:
Modifier | Description |
---|---|
re.I | Makes the match case-insensitive |
re.L | Performs locale-aware matching |
re.M | Makes $ match the end of a line and makes ^ match the start of any line |
re.S | Makes a period (dot) match any character, including a newline |
re.U | Interprets characters according to the Unicode character set. This flag affects the behavior of \w, \W, \b, \B. |
re.X | This flag allows you to write regular expressions that look nicer and are more readable by allowing you to visually separate logical sections of the pattern and add comments |
Regular Expression Patterns
Pattern strings use a special syntax to represent a regular expression.
Letters and numbers represent themselves. A regular expression pattern's letters and numbers match the same strings.
Most letters and numbers have special meanings when preceded by a backslash.
Punctuation characters only match themselves when escaped, otherwise, they have special meanings.
The backslash itself needs to be escaped.
Since regular expressions often contain backslashes, it's best to use raw strings to represent them. Pattern elements (like r'\t'
, equivalent to \\t
) match corresponding special characters.
The following table lists the special elements in the syntax of regular expression patterns. If you use patterns with the optional flags argument, the meaning of some pattern elements may change.
Pattern | Description | |
---|---|---|
^ | Matches the start of the string | |
$ | Matches the end of the string. | |
. | Matches any character except a newline. With re.DOTALL, it can match any character including a newline. | |
[...] | Represents a set of characters. [amk] matches 'a', 'm', or 'k' | |
[^...] | Matches characters not in the list. [^abc] matches any character except 'a', 'b', or 'c'. | |
re* | Matches 0 or more repetitions of the preceding expression. | |
re+ | Matches 1 or more repetitions of the preceding expression. | |
re? | Matches 0 or 1 repetition of the preceding expression, in non-greedy mode | |
re{ n} | Matches exactly n repetitions of the preceding expression. For example, "o{2}" does not match "o" in "Bob" but matches the two o's in "food". | |
re{ n,} | Matches at least n repetitions of the preceding expression. For example, "o{2,}" does not match "o" in "Bob" but matches all the o's in "foooood". "o{1,}" is equivalent to "o+". "o{0,}" is equivalent to "o*". | |
re{ n, m} | Matches at least n and at most m repetitions of the preceding expression, in greedy mode | |
a | b | Matches either a or b |
(re) | Groups regular expressions and remembers the matched text | |
(?imx) | Temporarily toggles on i, m, or x options within the parentheses. | |
(?-imx) | Temporarily toggles off i, m, or x options within the parentheses. | |
(?: re) | Groups regular expressions without remembering matched text | |
(?imx: re) | Temporarily toggles on i, m, or x options within the parentheses | |
(?-imx: re) | Temporarily toggles off i, m, or x options within the parentheses | |
(?#...) | Comment | |
(?= re) | Specifies position using a pattern. Doesn't have a range. | |
(?! re) | Specifies position using pattern negation. Doesn't have a range. | |
(?> re) | Matches independent pattern without backtracking |
``` | \w | Matches digits, letters, and underscores | | \W | Matches non-digits, non-letters, and non-underscores | | \s | Matches any whitespace character, equivalent to [\t\n\r\f] | | \S | Matches any non-whitespace character | | \d | Matches any digit, equivalent to [0-9] | | \D | Matches any non-digit | | \A | Matches the start of the string | | \Z | Matches the end of the string, if a newline exists, it only matches before the newline | | \z | Matches the end of the string | | \G | Matches the position where the last match finished | | \b | Matches a word boundary, which is the position between a word and a space. For example, 'er\b' can match "never" but not "verb" | | \B | Matches a non-word boundary. 'er\B' can match "verb" but not "never" | | \n, \t, etc. | Matches a newline character. Matches a tab character, etc. | | \1...\9 | Matches the content of the nth group | | \10 | Matches the content of the nth group if it has been matched. Otherwise, it refers to an octal character code expression |
Regular Expression Examples
Character Matching
Example | Description |
---|---|
python | Matches "python" |
Character Classes
Example | Description |
---|---|
[Pp]ython | Matches "Python" or "python" |
rub[ye] | Matches "ruby" or "rube" |
[aeiou] | Matches any one of the enclosed letters |
[0-9] | Matches any digit. Similar to [0123456789] |
[a-z] | Matches any lowercase letter |
[A-Z] | Matches any uppercase letter |
[a-zA-Z0-9] | Matches any letter and digit |
[^aeiou] | Matches any character except the enclosed letters |
[^0-9] | Matches any character except digits |
Special Character Classes
Example | Description |
---|---|
. | Matches any single character except "\n". To match any character including '\n', use a pattern like '[.\n]' |
\d | Matches a digit character. Equivalent to [0-9] |
\D | Matches a non-digit character. Equivalent to [^0-9] |
\s | Matches any whitespace character, including spaces, tabs, form feeds, etc. Equivalent to [ \f\n\r\t\v] |
\S | Matches any non-whitespace character. Equivalent to [^ \f\n\r\t\v] |
\w | Matches any word character including underscores. Equivalent to '[A-Za-z0-9_]' |
\W | Matches any non-word character. Equivalent to '[^A-Za-z0-9_]' |