Lookahead and Lookbehind Assertions in Regular Expressions
Category Programming Technology
There are four forms of lookahead and lookbehind assertions in regular expressions:
(?=pattern)
Zero-width positive lookahead assertion(?!pattern)
Zero-width negative lookahead assertion(?<=pattern)
Zero-width positive lookbehind assertion(?<!pattern)
Zero-width negative lookbehind assertion
Here, pattern is a regular expression.
Similar to how ^
represents the beginning, $
represents the end, and \b
represents word boundaries, lookahead and lookbehind assertions match certain positions without consuming characters, hence they are called "zero-width". The positions refer to the left of the first character, the right of the last character, and between adjacent characters in a string (assuming left-to-right text direction).
Below are examples illustrating the meaning of these four assertions.
(?=pattern) Positive Lookahead Assertion
Represents a position in the string where the characters following this position match the pattern.
For example, in the string "a regular expression", to match "re" in "regular" but not in "expression", you can use re(?=gular)
. This expression limits the position to the right of "re" where "gular" follows, without consuming the characters "gular".
re(?=gular).
will match "reg", where the metacharacter .
matches any character.
(?!pattern) Negative Lookahead Assertion
Represents a position in the string where the characters following this position do not match the pattern.
For example, in the string "regex represents regular expression", to match "re" that is not followed by "g", you can use re(?!g)
. This expression limits the position to the right of "re" where "g" does not follow.
The difference between positive and negative assertions lies in whether the characters following the position match the pattern in the parentheses.
(?<=pattern) Positive Lookbehind Assertion
Represents a position in the string where the characters preceding this position match the pattern.
For example, in the string "regex represents regular expression", to match "re" within words but not at the beginning of words, you can use (?<=\w)re
. The "re" within words should have a word character before it.
The term "lookbehind" is used because the regex engine, while scanning characters from left to right, needs to check the characters it has already scanned when encountering this assertion, moving backward relative to the scanning direction.
(?
Represents a position in the string where the characters preceding this position do not match the pattern.
For example, in the string "regex represents regular expression", to match "re" at the beginning of words, you can use (?<!\w)re
. The "re" at the beginning of words, in this case, means "re" not within words, i.e., "re" not preceded by a word character. Alternatively, you can use \bre
to match.
Understanding these four assertions can be approached from two perspectives:
1. Lookahead and Lookbehind: The regex engine scans characters from left to right, with a hypothetical pointer moving along the text. Lookahead assertions attempt to match characters ahead of the pointer, hence the term "lookahead". Lookbehind assertions attempt to match characters behind the pointer, hence the term "lookbehind".
2. Positive and Negative: Positive assertions match the pattern in the parentheses, while negative assertions do not.
To remember these four assertion forms:
1. Lookahead and Lookbehind: In lookbehind assertions (?<=pattern)
and (?<!pattern)
, the less-than symbol acts as an arrow pointing backward, which aligns with common text direction. Removing the less-than symbol converts it to a lookahead assertion.
2. Positive and Negative: The !
symbol is used in !=
(not equal) and logical NOT (!
), so forms with !
denote negative or non-matching assertions. Replacing !
with =
denotes positive or matching assertions.
Often, regular expressions are used to check if a string contains a certain substring. To express that a string does not contain a certain character or sequence of characters, [^...]
can be used. However, to express that a string does not contain a specific substring, lookahead or lookbehind assertions, or both, are needed.
For example, to check if a sentence contains "this" but not "that":
Including "this" is straightforward. To exclude "that", you can ensure that no character is preceded or followed by "that". The following regex expressions can be used:
^((?<!that).)*this((?<!that).)*$
or
^(.(?!that))*this(.(?!that))*$
Both expressions will match "this is tutorialpro test" successfully and fail for "this and that is tutorialpro test".
Under normal circumstances, these expressions are sufficient. However, in extreme cases such as sentences starting or ending with "that" or where "that" and "this" are adjacent, these expressions may not suffice. For example, "tutorialpro thatthis is the case" or "this is the case, not that".
By flexibly using these assertions, the following expressions can be used:
^(.(?<!that))*this(.(?<!that))*$
^(.(?<!that))*this((?!that).)*$
^((?!that).)*this(.(?<!that))*$
^((?!that).)*this((?!that).)*$
These four regex expressions will match the desired sentences correctly.
The pattern in parentheses for these four assertions is itself a regular expression. However, there are restrictions for lookbehind assertions in Perl and Python: the expression must be fixed-length, meaning metacharacters like *
, +
, ?
cannot be used. For example, (?<=abc)
is valid, but (?<=a*bc)
is not supported. This is because the engine cannot determine how many steps to backtrack when checking lookbehind assertions. Java supports ?
, {m}
, {n,m}
, but not *
, +
. JavaScript does not support lookbehind assertions at all, though this is generally not a significant issue.
Lookahead and lookbehind assertions are somewhat analogous to using if statements to validate characters before and after a match.
Usage of ?=、?<=、?!、?
exp1(?=exp2)
: Find exp1 before exp2.
(?<=exp2)exp1
: Find exp1 after exp2.
exp1(?!exp2)
: Find exp1 not followed by exp2.
(?<!=exp2)exp1
: Find exp1 not preceded by exp2.
>
Reference link: https://blog.51cto.com/cnn237111/749047
** Share My Notes
-
-
-