Regular Expressions - `Matching Rules`

Basic Pattern Matching

Everything starts with the basics. Patterns are the fundamental elements of regular expressions, which are a set of characters describing string characteristics. Patterns can be simple, consisting of ordinary strings, or highly complex, often using special characters to represent a range of characters, repetitions, or context. For example:

^once

This pattern contains a special character ^, indicating that the pattern matches only those strings that start with once. For instance, this pattern matches the string "once upon a time" but does not match "There once was a man from NewYork". Just as the ^ symbol denotes the beginning, the $ symbol is used to match strings that end with a given pattern.

bucket$

This pattern matches "Who kept all of this cash in a bucket" but does not match "buckets". When ^ and $ are used together, they denote an exact match (the string is identical to the pattern). For example:

^bucket$

Matches only the string "bucket". If a pattern does not include ^ and $, it matches any string containing that pattern. For example, the pattern:

once

Matches the string:

There once was a man from NewYork
Who kept all of his cash in a bucket.

The letters (o-n-c-e) in this pattern are literal characters, meaning they represent themselves, as do numbers. Other slightly more complex characters, such as punctuation and whitespace (spaces, tabs, etc.), require escape sequences. All escape sequences begin with a backslash \. The escape sequence for a tab is \t. So if we want to check if a string starts with a tab, we can use this pattern:

^\t

Similarly, \n represents a new line, and \r represents a carriage return. Other special characters can be prefixed with a backslash, such as the backslash itself represented by \\, and the period . represented by \., and so on.

Character Classes

In Internet applications, regular expressions are often used to validate user input. After a user submits a form, it is necessary to determine if the input phone numbers, addresses, email addresses, credit card numbers, etc., are valid, which cannot be achieved with plain literal characters.

Therefore, a more flexible method is needed to describe the patterns we want, and that is where character classes come in. To create a character class representing all vowel characters, place all the vowel characters within square brackets:

[AaEeIiOoUu]

This pattern matches any vowel character but represents only one character. A hyphen can be used to denote a range of characters, such as:

[a-z] // Matches all lowercase letters
[A-Z] // Matches all uppercase letters
[a-zA-Z] // Matches all letters
[0-9] // Matches all digits
[0-9\.\-] // Matches all digits, periods, and hyphens
[ \f\r\t\n] // Matches all whitespace characters

Similarly, these represent only one character, which is crucial. To match a string composed of a lowercase letter followed by a digit, such as "z2", "t6", or "g7", but not "ab2", "r2d3", or "b52", use this pattern:

^[a-z][0-9]$

Although [a-z] represents the range of 26 letters, here it can only match strings where the first character is a lowercase letter.

Earlier, it was mentioned that ^ denotes the start of a string, but it has another meaning. When used within square brackets, ^ signifies "not" or "exclude", often used to exclude specific characters. Using the previous example, we require the first character not to be a digit:

^[^0-9][0-9]$

This pattern matches "&5", "g7", and "-2" but does not match "12" or "66". Here are a few examples of excluding specific characters:

[^a-z] // All characters except lowercase letters
[^\\\/\^] // All characters except (\)(/)(^)
[^\"\'] // All characters except double quotes (") and single quotes (')

The special character . (dot, period) in regular expressions is used to represent any character except a "new line". Therefore, the pattern ^.5$ matches any two-character string ending with the digit 5 and starting with any character other than a "new line". The pattern . matches any character string, excluding line breaks (\n, \r).

PHP's regular expressions include some built-in generic character classes, listed as follows:

Character Class	Description
[[:alpha:]]	Any letter
[[:digit:]]	Any digit
[[:alnum:]]	Any letter and digit
[[:space:]]	Any whitespace character
[[:upper:]]	Any uppercase letter
[[:lower:]]	Any lowercase letter
[[:punct:]]	Any punctuation mark

This is a Chinese to English translation. Please provide the English translation for this text. Do not provide any explanations or text apart from the translation.

Determining Repetitions

So far, you've learned how to match a single letter or digit, but more often, you might need to match a word or a group of digits. A word consists of multiple letters, and a group of digits consists of multiple single digits. The curly braces ({}) following a character or character class are used to specify the number of times the preceding element can repeat.

Character Class	Description
^[a-zA-Z_]$	All letters and underscores
^[[:alpha:]]{3}$	All 3-letter words
^a$	Letter a
^a{4}$	aaaa
^a{2,4}$	aa, aaa, or aaaa
^a{1,3}$	a, aa, or aaa
^a{2,}$	Strings with more than two a's
^a{2,}	Examples: aardvark and aaab, but apple doesn't work
a{2,}	Examples: baad and aaa, but Nantucket doesn't work
\t{2}	Two tab characters
.{2}	All two characters

These examples illustrate three different uses of curly braces. A single number {x} means the preceding character or character class appears exactly x times; a number followed by a comma {x,} means the preceding element appears x or more times; two numbers separated by a comma {x,y} indicate the preceding element appears at least x times but no more than y times. We can extend this pattern to more words or digits:

^[a-zA-Z0-9_]{1,}$      // All strings containing one or more letters, digits, or underscores
^[1-9][0-9]{0,}$        // All positive integers
^\-{0,1}[0-9]{1,}$      // All integers
^[-]?[0-9]+\.?[0-9]+$   // All floating-point numbers

The last example might be a bit confusing, right? Here's a simpler explanation: It starts with an optional minus sign ([-]?), followed by one or more digits ([0-9]+), a decimal point (\.), and one or more digits ([0-9]+), with nothing following ($). You'll learn a simpler method below.

The special character ? is equivalent to {0,1}, both meaning zero or one occurrence of the preceding element or the preceding element is optional. So the previous example can be simplified to:

^\-?[0-9]{1,}\.?[0-9]{1,}$

The special character * is equivalent to {0,}, both meaning zero or more occurrences of the preceding element. Lastly, the character + is equivalent to {1,}, indicating one or more occurrences of the preceding element. Therefore, the four examples can be written as:

^[a-zA-Z0-9_]+$      // All strings containing one or more letters, digits, or underscores
^[1-9][0-9]*$        // All positive integers
^\-?[0-9]+$          // All integers
^[-]?[0-9]+(\.[0-9]+)?$ // All floating-point numbers

This doesn't technically reduce the complexity of the regular expressions, but it makes them easier to read.

❮ Home Regexp Intro ❯

Regular Expressions - Matching Rules

Basic Pattern Matching

Character Classes

Determining Repetitions

Regular Expressions - `Matching Rules`