Regular Expressions - Matching Rules
Basic Pattern Matching
Everything starts with the basics. Patterns are the fundamental elements of regular expressions, which are sets of characters that describe string characteristics. Patterns can be simple, consisting of ordinary strings, or highly complex, often using special characters to represent ranges of characters, repetitions, or context. For example:
^once
This pattern includes a special character ^
, indicating that the pattern matches strings that start with once. For instance, it matches the string "once upon a time" but not "There once was a man from NewYork". Just as the ^
symbol denotes the beginning, the $
symbol is used to match strings that end with a given pattern.
bucket$
This pattern matches "Who kept all of this cash in a bucket" but not "buckets". When ^
and $
are used together, they denote an exact match (the string matches the pattern exactly). For example:
^bucket$
Matches only the string "bucket". If a pattern does not include ^
and $
, it matches any string that contains the pattern. For example, the pattern:
once
Matches the string:
There once was a man from NewYork
Who kept all of his cash in a bucket.
In this pattern, the letters (o-n-c-e) are literal characters, meaning they represent themselves, and numbers are the same. Other slightly more complex characters, such as punctuation and whitespace (spaces, tabs, etc.), require escape sequences. All escape sequences start with a backslash \
. The escape sequence for a tab is \t
. So if we want to check if a string starts with a tab, we can use this pattern:
^\t
Similarly, \n
represents a new line, and \r
represents a carriage return. Other special characters can be prefixed with a backslash, such as the backslash itself represented by \\
, the period .
by \.
, and so on.
Character Classes
In Internet applications, regular expressions are often used to validate user input. After a user submits a form, it is necessary to determine if the input phone number, address, email address, credit card number, etc., is valid. Plain literal characters are not sufficient for this.
Therefore, a more flexible method is needed to describe the patterns we want, and that is where character classes come in. To create a character class representing all vowel characters, place all the vowel characters inside square brackets:
[AaEeIiOoUu]
This pattern matches any vowel character but only represents one character. A hyphen can be used to denote a range of characters, such as:
[a-z] // Matches all lowercase letters
[A-Z] // Matches all uppercase letters
[a-zA-Z] // Matches all letters
[0-9] // Matches all digits
[0-9\.\-] // Matches all digits, periods, and hyphens
[ \f\r\t\n] // Matches all whitespace characters
Similarly, these only represent one character, which is crucial. To match a string composed of one lowercase letter followed by one digit, such as "z2", "t6", or "g7", but not "ab2", "r2d3", or "b52", use this pattern:
^[a-z][0-9]$
Although [a-z]
represents the range of 26 letters, here it only matches strings where the first character is a lowercase letter.
Earlier, it was mentioned that ^
denotes the start of a string, but it has another meaning. When used inside square brackets, ^
denotes "not" or "exclude", often used to exclude specific characters. Using the previous example, we require the first character not to be a digit:
^[^0-9][0-9]$
This pattern matches "&5", "g7", and "-2" but not "12" or "66". Here are some examples of excluding specific characters:
[^a-z] // All characters except lowercase letters
[^\\\/\^] // All characters except (\)(/)(^)
[^\"\'] // All characters except double quotes (") and single quotes (')
The special character .
(dot, period) in regular expressions is used to represent any character except a "new line". Therefore, the pattern ^.5$
matches any two-character string ending with the digit 5 and starting with any character other than a "new line". The pattern .
can match any character string, except line breaks (\n, \r).
PHP's regular expressions have some built-in general character classes, listed as follows:
Character Class | Description |
---|---|
[[:alpha:]] | Any letter |
[[:digit:]] | Any digit |
[[:alnum:]] | Any letter and digit |
[[:space:]] | Any whitespace character |
[[:upper:]] | Any uppercase letter |
[[:lower:]] | Any lowercase letter |
[[:punct:]] | Any punctuation mark |
This is a Chinese to English translation, please provide the English translation for this text. Do not provide any explanations or text apart from the translation. Chinese: | [[:xdigit:]] | Any hexadecimal digit, equivalent to [0-9a-fA-F] |
Determining Repetition
So far, you have learned how to match a single letter or digit, but more often, you may need to match a word or a group of digits. A word consists of multiple letters, and a group of digits consists of multiple single digits. The curly braces ({}) following a character or character cluster are used to determine the number of times the preceding content repeats.
Character Cluster | Description |
---|---|
^[a-zA-Z_]$ | All letters and underscores |
^[[:alpha:]]{3}$ | All 3-letter words |
^a$ | Letter a |
^a{4}$ | aaaa |
^a{2,4}$ | aa, aaa, or aaaa |
^a{1,3}$ | a, aa, or aaa |
^a{2,}$ | Strings containing more than two a's |
^a{2,} | E.g., aardvark and aaab, but apple does not work |
a{2,} | E.g., baad and aaa, but Nantucket does not work |
\t{2} | Two tab characters |
.{2} | All two characters |
These examples illustrate three different uses of curly braces. A single number {x}
means the preceding character or character cluster appears exactly x times; a number followed by a comma {x,}
means the preceding content appears x or more times; two numbers separated by a comma {x,y}
indicate the preceding content appears at least x times but no more than y times. We can extend the pattern to more words or digits:
^[a-zA-Z0-9_]{1,}$ // Strings containing one or more letters, digits, or underscores
^[1-9][0-9]{0,}$ // All positive integers
^\-{0,1}[0-9]{1,}$ // All integers
^[-]?[0-9]+\.?[0-9]+$ // All floating-point numbers
The last example is a bit hard to understand, right? Here's a simpler way to look at it: it starts with an optional minus sign ([-]?
), followed by one or more digits ([0-9]+
), a decimal point (\.
), one or more digits ([0-9]+
), and nothing else after ($
). Below, you will learn about a simpler method you can use.
The special character ?
is equivalent to {0,1}
, both representing 0 or 1 of the preceding content or the preceding content is optional. So the previous example can be simplified to:
^\-?[0-9]{1,}\.?[0-9]{1,}$
The special character *
is equivalent to {0,}
, both representing 0 or more of the preceding content. Finally, the character +
is equivalent to {1,}
, indicating 1 or more of the preceding content. Therefore, the four examples above can be written as:
^[a-zA-Z0-9_]+$ // Strings containing one or more letters, digits, or underscores
^[1-9][0-9]*$ // All positive integers
^\-?[0-9]+$ // All integers
^[-]?[0-9]+(\.[0-9]+)?$ // All floating-point numbers
Although this does not technically reduce the complexity of the regular expressions, it makes them easier to read.