Easy Tutorial
❮ Regexp Metachar Regexp Operator ❯

Regular Expressions - Syntax

A regular expression (regex) describes a pattern for matching strings, which can be used to check if a string contains a certain substring, replace matching substrings, or extract substrings that meet certain criteria from a string.

For example:

Constructing regular expressions is similar to creating mathematical expressions. By combining various metacharacters and operators, small expressions can be combined to create larger ones. Components of a regular expression can be individual characters, character sets, character ranges, character choices, or any combination of these components.

A regular expression is a textual pattern composed of ordinary characters (such as letters from a to z) and special characters (known as "metacharacters"). The pattern describes one or more strings to be matched when searching through text. A regular expression serves as a template to match a character pattern against the string being searched.


Ordinary Characters

Ordinary characters include all printable and non-printable characters that are not explicitly designated as metacharacters. This includes all uppercase and lowercase letters, all digits, all punctuation marks, and some other symbols.

Character Description Example
[ABC] Matches all characters in the [...] set. For example, [aeiou] matches all e, o, u, a letters in the string "google tutorialpro taobao". Try it »
[^ABC] Matches all characters except those in the [...] set. For example, [^aeiou] matches all letters in the string "google tutorialpro taobao" except e, o, u, a. Try it »
[A-Z] Indicates a range, matching all uppercase letters. [a-z] matches all lowercase letters. Try it »
. Matches any single character except newline (\n, \r), equivalent to [^\n\r]. Try it »
[\s\S] Matches all characters. \s matches all whitespace characters including newlines, \S matches all non-whitespace characters excluding newlines. Try it »
\w Matches letters, digits, and underscores. Equivalent to [A-Za-z0-9_]. Try it »

Testing Tool

Modifiers:

[0-9]+

Matching Text:

123abc456edf789

Non-Printable Characters

Non-printable characters can also be part of a regular expression. The following table lists the escape sequences for non-printable characters:

Character Description
\cx Matches the control character specified by x. For example, \cM matches a Control-M or carriage return. The value of x must be A-Z or a-z. Otherwise, c is treated as a literal 'c' character.
\f Matches a form feed. Equivalent to \x0c and \cL.
\n Matches a newline. Equivalent to \x0a and \cJ.
\r Matches a carriage return. Equivalent to \x0d and \cM.
\s Matches any whitespace character, including spaces, tabs, form feeds, etc. Equivalent to [ \f\n\r\t\v]. Note that Unicode regex will match full-width space characters.
\S Matches any non-whitespace character. Equivalent to [^ \f\n\r\t\v].
\t Matches a tab. Equivalent to \x09 and \cI.
\v Matches a vertical tab. Equivalent to \x0b and \cK.

Special Characters

Special characters are those with special meanings, such as the * in runoo*b, which simply means any string. To search for the * character in a string, it needs to be escaped by placing a \ before it, runo\*ob matches the string runo*ob. Many meta-characters require special treatment when attempting to match them. To match these special characters, you must first "escape" them, which means placing the backslash character \ in front of them. The table below lists the special characters in regular expressions:

Special Character Description
$ Matches the end position of the input string. If the Multiline property of the RegExp object is set, $ also matches '\n' or '\r'. To match the $ character itself, use \$.
( ) Marks the start and end positions of a subexpression. Subexpressions can be captured for later use. To match these characters, use ( and ).
* Matches the preceding subexpression zero or more times. To match the * character, use *.
+ Matches the preceding subexpression one or more times. To match the + character, use +.
. Matches any single character except the newline character \n. To match a ., use .
[ Marks the start of a bracket expression. To match [, use [.
? Matches the preceding subexpression zero or one time, or indicates a non-greedy quantifier. To match the ? character, use \?.
\ Marks the next character as either a special character, an literal character, a backreference, or an octal escape. For example, 'n' matches the character 'n'. '\n' matches the newline character. The sequence '\' matches "\", and '(' matches "(".
^ Matches the start position of the input string, unless used in a bracket expression, where it indicates the negation of the character set. To match the ^ character itself, use \^.
{ Marks the start of a quantifier expression. To match {, use {.
| Indicates a choice between two items. To match |, use |.

Quantifiers

Quantifiers specify how many instances of a given component must appear for a match to occur. There are six types: *, +, ?, {n}, {n,}, and {n,m}.

Quantifiers in regular expressions include:

Character Description
* Matches the preceding subexpression zero or more times. For example, zo* can match "z" and "zoo". * is equivalent to {0,}.
+ Matches the preceding subexpression one or more times. For example, zo+ can match "zo" and "zoo", but not "z". + is equivalent to {1,}.
? Matches the preceding subexpression zero or one time. For example, do(es)? can match "do", "does", and "doxy". ? is equivalent to {0,1}.
{n} n is a non-negative integer. Matches exactly n times. For example, o{2} cannot match "Bob" but can match "food".
{n,} n is a non-negative integer. Matches at least n times. For example, o{2,} cannot match "Bob" but can match "foooood". o{1,} is equivalent to o+. o{0,} is equivalent to o*.
{n,m} n and m are non-negative integers, where n <= m. Matches at least n times and at most m times. For example, o{1,3} will match the first three o's in "fooooood". o{0,1} is equivalent to o?. Note that there should be no spaces between the comma and the numbers.

The following regular expression matches a positive integer, [1-9] sets the first digit to be non-zero, and [0-9]* indicates any number of digits:

/[1-9][0-9]*/

Note that the quantifier appears after the range expression. Therefore, it applies to the entire range expression, in this case, specifying digits from 0 to 9 (inclusive).

The + quantifier is not used here because a digit is not necessarily required in the second or subsequent positions. The ? character is also not used because it would limit the integer to only two digits.

If you want to set a two-digit number from 0 to 99, you can use the following expression to specify at least one but no more than two digits:

/[0-9]{1,2}/

The above expression has the drawback of only matching two digits and can match 0, 00, 01, 10, and 99, but only matches the first two digits.

/[1-9][0-9]?/

or

/[1-9][0-9]{0,1}/

The * and + quantifiers are greedy, meaning they match as much text as possible. Adding a ? after them makes them non-greedy or minimal matches.

For example, you might search an HTML document to find content within <h1> tags. Given the following HTML code:

<h1>tutorialpro-tutorialpro.org</h1>

Greedy: The following expression matches everything from the opening less-than symbol (<) to the closing greater-than symbol (>) of the h1 tag:

/<.*>/

Non-greedy: If you only need to match the opening and closing h1 tags, the following non-greedy expression matches <h1>:

/<.*?>/

You can also use the following regular expression to match h1 tags, which is:

/<\w+?>/

By placing a ? after the quantifiers *, +, or ?, the expression converts from a "greedy" to a "non-greedy" or minimal match.


Anchors

Anchors allow you to anchor the regular expression to the start or end of a line. They also enable you to create regular expressions that appear within a word, at the beginning of a word, or at the end of a word.

Anchors describe the boundaries of a string or word, where ^ and $ refer to the start and end of a string, respectively, and \b describes the boundary before or after a word, while \B denotes a non-word boundary.

The anchors for regular expressions are:

Character Description
^ Matches the position at the start of the input string. If the Multiline property of the RegExp object is set, ^ also matches the position after \n or \r.
$ Matches the position at the end of the input string. If the Multiline property of the RegExp object is set, $ also matches the position before \n or \r.
\b Matches a word boundary, which is the position between a word and a space.
\B Matches a non-word boundary.

Note: Quantifiers cannot be used with anchors. Since there cannot be more than one position immediately before or after a newline or word boundary, expressions like ^* are not allowed.

To match text at the beginning of a line, use the ^ character at the start of the regular expression. Do not confuse this usage with its use inside square bracket expressions.

To match text at the end of a line, use the $ character at the end of the regular expression.

To use anchors when searching for chapter headings, the following regular expression matches a chapter heading that contains only two trailing digits and appears at the start of a line:

/^Chapter [1-9][0-9]{0,1}/

A true chapter heading not only appears at the start of a line but is also the only text on that line. It appears at both the start and end of the line. The following expression ensures that the match is only for chapters and not cross-references by creating a regular expression that matches the start and end of a line of text.

/^Chapter [1-9][0,1]$/

Matching word boundaries is different but adds significant capability to regular expressions. Word boundaries are the positions between words and spaces. Non-word boundaries are any other positions. The following expression matches the first three characters of the word "Chapter" because they appear after a word boundary:

/\bCha/

The position of the \b character is crucial. If it is at the start of the string to be matched, it looks for matches at the beginning of a word. If it is at the end of the string, it looks for matches at the end of a word. For example, the following expression matches the string "ter" in the word "Chapter" because it appears before a word boundary:

/ter\b/

The following expression matches the string "apt" in "Chapter" but not in "aptitude":

/\Bapt/

The string "apt" appears at a non-word boundary in "Chapter" but at a word boundary in "aptitude". For the \B non-word boundary operator, it does not match the start or end of a word. The following expression does not match "Cha" in "Chapter":

\BCha

Alternation

Enclose all alternative options in parentheses () and separate them with |.

() denotes a capturing group, which saves the matched value within the group. Multiple matched values can be accessed via a number n (where n is the number of the capturing group).

However, using parentheses has a side effect of caching related matches, which can be eliminated by using ?: at the beginning of the first option.

?: is one of the non-capturing operators, along with ?= and ?!. These have additional meanings: ?= is a positive lookahead, matching any position where the pattern inside the parentheses follows the main pattern, and ?! is a negative lookahead, matching any position where the pattern inside the parentheses does not follow the main pattern.

Differences in using ?=, ?<=, ?!, ?<!

exp1(?=exp2): Find exp1 preceded by exp2.

(?<=exp2)exp1: Find exp1 followed by exp2.

exp1(?!exp2): Find exp1 not followed by exp2.

(?<!exp2)exp1: Find exp1 not preceded by exp2.

For more details, refer to: Regular Expression Lookahead and Lookbehind


Backreferences

Backreferences allow you to refer back to previously matched groups within the same regular expression. Adding parentheses around a regular expression pattern or part of a pattern causes the related match to be stored in a temporary buffer. Each captured sub-match is stored in the order they appear from left to right in the regular expression pattern. The buffer numbers start from 1 and can store up to 99 captured sub-expressions. Each buffer can be accessed using \n, where n is a one or two-digit decimal number that identifies a specific buffer.

Non-capturing meta-characters ?:, ?=, or ?! can be used to rewrite the capture, ignoring the saving of the related match.

One of the simplest and most useful applications of back-references is the ability to find adjacent duplicate words in text. Take the following sentence as an example:

Is is the cost of of gasoline going up up?

The sentence clearly has multiple repeated words. It would be great to devise a method to locate this sentence without having to find every repeated word. The following regular expression uses a single sub-expression to achieve this:

Example

Find duplicate words:

var str = "Is is the cost of of gasoline going up up";
var patt1 = /\b([a-z]+) \1\b/igm;
document.write(str.match(patt1));

The captured expression, as specified by [a-z]+, includes one or more letters. The second part of the regular expression is a reference to the previously captured sub-match, i.e., the second occurrence of the word is exactly matched by the parentheses expression. \1 specifies the first sub-match.

The word boundary meta-character ensures that only whole words are detected. Otherwise, phrases like "is issued" or "this is" would not be correctly identified by this expression.

The global flag g at the end of the regular expression specifies that the expression should be applied to the input string to find as many matches as possible.

The case-insensitive flag i at the end of the expression specifies that the case should be ignored.

The multi-line flag m specifies that potential matches may appear on either side of a newline character.

Back-references can also decompose Uniform Resource Indicators (URIs) into their components. Suppose you want to decompose the following URI into its protocol (ftp, http, etc.), domain address, and page/path:

https://www.tutorialpro.org:80/html/html-tutorial.html

The following regular expression provides this functionality:

Example

Output all matching data:

var str = "https://www.tutorialpro.org:80/html/html-tutorial.html";
var patt1 = /(\w+):\/\/([^/:]+)(:\d*)?([^# ]*)/;
arr = str.match(patt1);
for (var i = 0; i < arr.length ; i++) {
    document.write(arr[i]);
    document.write("<br>");
}

The third line str.match(patt1) returns an array. The array in the example contains 5 elements, with index 0 corresponding to the entire string, index 1 corresponding to the first capturing group (inside the parentheses), and so on.

The first parentheses sub-expression captures the protocol part of the web address. This sub-expression matches any word before a colon and two forward slashes.

The second parentheses sub-expression captures the domain address part of the address. The sub-expression matches one or more characters after a non-: and non-/.

The third parentheses sub-expression captures the port number (if specified). This sub-expression matches zero or more digits after a colon. This sub-expression can only be repeated once.

Finally, the fourth parentheses sub-expression captures the path and/or page information specified by the web address. This sub-expression matches any sequence of characters that does not include a # or a space.

Applying the regular expression to the URI above, the sub-matches contain the following:

❮ Regexp Metachar Regexp Operator ❯