Easy Tutorial
❮ Regexp Example Regexp Operator ❯

Regular Expressions - Syntax

A regular expression (regex) describes a pattern for matching strings. It can be used to check if a string contains a certain substring, replace matched substrings, or extract substrings that meet specific criteria from a string.

For example:

Constructing a regular expression is similar to creating a mathematical expression. By combining various metacharacters and operators, small expressions can be combined to create larger ones. Components of a regex can be individual characters, character sets, character ranges, choices between characters, or any combination of these components.

A regular expression is a textual pattern composed of ordinary characters (e.g., characters from a to z) and special characters (called "metacharacters"). The pattern describes one or more strings to be matched in the text being searched. A regex serves as a template to match a character pattern in the searched string.


Ordinary Characters

Ordinary characters include all printable and non-printable characters that are not explicitly designated as metacharacters. This includes all uppercase and lowercase letters, all digits, all punctuation marks, and some other symbols.

Character Description Example
[ABC] Matches any character in the square brackets. For example, [aeiou] matches all e, o, u, a letters in the string "google tutorialpro taobao". Try it »
[^ABC] Matches any character except those in the square brackets. For example, [^aeiou] matches all letters in the string "google tutorialpro taobao" except e, o, u, a. Try it »
[A-Z] Represents a range that matches all uppercase letters. [a-z] matches all lowercase letters. Try it »
. Matches any single character except newline (\n, \r), equivalent to [^\n\r]. Try it »
[\s\S] Matches all characters. \s matches all whitespace characters including newlines, \S matches all non-whitespace characters excluding newlines. Try it »
\w Matches letters, digits, and underscores. Equivalent to [A-Za-z0-9_]. Try it »

Testing Tool

Modifiers:

[0-9]+

Matching text:

123abc456edf789

Non-Printable Characters

Non-printable characters can also be part of a regex. The table below lists the escape sequences for non-printable characters:

Character Description
\cx Matches the control character specified by x. For example, \cM matches a Control-M or carriage return. The value of x must be A-Z or a-z. Otherwise, c is treated as a literal 'c' character.
\f Matches a form feed character. Equivalent to \x0c and \cL.
\n Matches a newline character. Equivalent to \x0a and \cJ.
\r Matches a carriage return character. Equivalent to \x0d and \cM.
\s Matches any whitespace character, including spaces, tabs, form feeds, etc. Equivalent to [ \f\n\r\t\v]. Note that Unicode regexes can match full-width space characters.
\S Matches any non-whitespace character. Equivalent to [^ \f\n\r\t\v].
\t Matches a tab character. Equivalent to \x09 and \cI.
\v Matches a vertical tab character. Equivalent to \x0b and \cK.

Special Characters

Special characters are those with special meanings, such as * in runoo*b, which means any string. To search for a * in a string, it needs to be escaped with a \, runo\*ob matches the string runo*ob. Many metacharacters require special treatment when attempting to match them. To match these special characters, you must first "escape" them, which means placing the backslash character \ in front of them. The table below lists the special characters in regular expressions:

Special Character Description
$ Matches the end position of the input string. If the Multiline property of the RegExp object is set, $ also matches '\n' or '\r'. To match the $ character itself, use \$.
( ) Marks the start and end positions of a subexpression. Subexpressions can be captured for later use. To match these characters, use ( and ).
* Matches the preceding subexpression zero or more times. To match the * character, use *.
+ Matches the preceding subexpression one or more times. To match the + character, use +.
. Matches any single character except the newline character \n. To match ., use .
[ Marks the start of a bracket expression. To match [, use [.
? Matches the preceding subexpression zero or one time, or indicates a non-greedy quantifier. To match the ? character, use \?.
\ Marks the next character as either a special character, an literal character, a backreference, or an octal escape. For example, 'n' matches the character 'n'. '\n' matches a newline character. The sequence '\' matches "\" and '(' matches "(".
^ Matches the start position of the input string, unless used in a bracket expression, where it indicates the negation of the character set in the bracket expression. To match the ^ character itself, use \^.
{ Marks the start of a quantifier expression. To match {, use {.
| Indicates a choice between two items. To match |, use |.

Quantifiers

Quantifiers specify how many instances of a given component must appear for a match to occur. There are six types: *, +, ?, {n}, {n,}, and {n,m}.

The quantifiers in regular expressions are:

Character Description
* Matches the preceding subexpression zero or more times. For example, zo* can match "z" and "zoo". * is equivalent to {0,}.
+ Matches the preceding subexpression one or more times. For example, zo+ can match "zo" and "zoo", but not "z". + is equivalent to {1,}.
? Matches the preceding subexpression zero or one time. For example, do(es)? can match "do", "does", and "doxy". ? is equivalent to {0,1}.
{n} n is a non-negative integer. Matches exactly n times. For example, o{2} cannot match "Bob" but can match "food".
{n,} n is a non-negative integer. Matches at least n times. For example, o{2,} cannot match "Bob" but can match "foooood". o{1,} is equivalent to o+. o{0,} is equivalent to o*.
{n,m} n and m are non-negative integers, where n <= m. Matches at least n and at most m times. For example, o{1,3} will match the first three o's in "fooooood". o{0,1} is equivalent to o?. Note that there should be no spaces between the comma and the numbers.

The following regular expression matches a positive integer, [1-9] sets the first digit to be non-zero, and [0-9]* indicates any number of digits:

/[1-9][0-9]*/

Note that the quantifier appears after the range expression. Therefore, it applies to the entire range expression, in this case, specifying digits from 0 to 9 (inclusive).

The + quantifier is not used here because a digit is not necessarily required in the second or subsequent positions. The ? character is also not used because it would limit the integer to two digits.

If you want to set a two-digit number from 0 to 99, you can use the following expression to specify at least one but no more than two digits:

/[0-9]{1,2}/

The above expression has the drawback of only matching two digits and can match 0, 00, 01, 10, 99, etc.

/[1-9][0-9]?/

or

/[1-9][0-9]{0,1}/

The * and + quantifiers are greedy, as they match as much text as possible. Adding a ? after them makes them non-greedy or minimal matches.

For example, you might search an HTML document to find content within <h1> tags. The HTML code is as follows:

<h1>tutorialpro-tutorialpro.org</h1>

Greedy: The following expression matches everything from the opening less-than sign (<) to the closing greater-than sign (>) of the h1 tag.

/<.*>/

Non-greedy: If you only need to match the opening and closing h1 tags, the following non-greedy expression matches <h1>.

/<.*?>/

You can also use the following regular expression to match the h1 tag, which is:

/<\w+?>/

By placing a ? after the *, +, or ? quantifier, the expression converts from a "greedy" to a "non-greedy" or minimal match.


Anchors

Anchors allow you to anchor the regular expression to the start or end of a line. They also enable you to create expressions that appear within a word, at the beginning of a word, or at the end of a word.

Anchors describe the boundaries of a string or word. ^ and $ refer to the start and end of a string, respectively. \b describes a word boundary, while \B represents a non-word boundary.

The anchors for regular expressions are:

Character Description
^ Matches the position at the start of the input string. If the Multiline property of the RegExp object is set, ^ also matches the position following \n or \r.
$ Matches the position at the end of the input string. If the Multiline property of the RegExp object is set, $ also matches the position before \n or \r.
\b Matches a word boundary, the position between a word and a space.
\B Matches a non-word boundary.

Note: Quantifiers cannot be used with anchors. Since there cannot be more than one position immediately before or after a newline or word boundary, expressions like ^* are not allowed.

To match text at the start of a line, use the ^ character at the beginning of the regular expression. Do not confuse this usage with its use inside bracket expressions.

To match text at the end of a line, use the $ character at the end of the regular expression.

To use anchors when searching for chapter titles, the following regular expression matches a chapter title that contains only two trailing digits and appears at the start of a line:

/^Chapter [1-9][0-9]{0,1}/

A true chapter title not only appears at the start of a line but is also the only text on that line. It appears at both the start and end of the line. The following expression ensures that the match is only for chapters and not cross-references by creating a regular expression that matches the start and end of a line:

/^Chapter [1-9][0,1]$/

Matching word boundaries is different but adds significant capability to regular expressions. Word boundaries are the positions between words and spaces. Non-word boundaries are any other positions. The following expression matches the first three characters of the word "Chapter" because these characters appear after a word boundary:

/\bCha/

The position of the \b character is crucial. If it is at the start of the string to be matched, it looks for a match at the beginning of a word. If it is at the end of the string, it looks for a match at the end of a word. For example, the following expression matches the string "ter" in the word "Chapter" because it appears before a word boundary:

/ter\b/

The following expression matches the string "apt" in "Chapter" but not in "aptitude":

/\Bapt/

The string "apt" appears at a non-word boundary in "Chapter" but at a word boundary in "aptitude". For the \B non-word boundary operator, it cannot match the start or end of a word. For example, the following expression does not match "Cha" in "Chapter":

\BCha

Alternation

Use parentheses () to enclose all alternative options, separated by |.

() represents a capturing group, which saves the matched value within the group. Multiple matched values can be accessed by number n (where n is the number of the capturing group).

However, parentheses have a side effect of caching related matches. This can be eliminated by using ?: at the beginning of the first option.

Among non-capturing elements, ?:, ?=, and ?! are included. The latter two have additional meanings: ?= is a positive lookahead, matching any position where the pattern inside the parentheses follows, and ?! is a negative lookahead, matching any position where the pattern inside the parentheses does not follow.

Differences in using ?=, ?<=, ?!, and ?

exp1(?=exp2): Find exp1 preceded by exp2.

(?<=exp2)exp1: Find exp1 followed by exp2.

exp1(?!exp2): Find exp1 not followed by exp2.

(?<!exp2)exp1: Find exp1 not preceded by exp2.

For more details, refer to: Regular Expression Lookahead and Lookbehind


Backreferences

Backreferences allow you to refer back to previously matched groups within the same regular expression. Adding parentheses around a part of a regular expression pattern causes the matching content to be stored in a temporary buffer. Each captured sub-match is stored in the order they appear from left to right in the pattern. The buffer numbers start from 1 and can store up to 99 captured sub-expressions. Each buffer can be accessed using \n, where n is a one or two-digit decimal number that identifies a specific buffer.

Non-capturing meta-characters ?:, ?=, or ?! can be used to override capturing and ignore saving the related matches.

One of the simplest and most useful applications of back-references is the ability to match two identical adjacent words in text. Take the following sentence as an example:

Is is the cost of of gasoline going up up?

The sentence clearly has multiple repeated words. It would be great to devise a method to locate this sentence without having to find every repeated word. The following regular expression uses a single sub-expression to achieve this:

Example

Finding repeated words:

var str = "Is is the cost of of gasoline going up up";
var patt1 = /\b([a-z]+) \1\b/igm;
document.write(str.match(patt1));

The captured expression, as specified by [a-z]+, includes one or more letters. The second part of the regular expression is a reference to the previously captured sub-match, that is, the second occurrence of the word is exactly matched by the parentheses expression. \1 specifies the first sub-match.

The word boundary meta-character ensures that only whole words are detected. Otherwise, phrases like "is issued" or "this is" would not be correctly identified by this expression.

The global flag g at the end of the regular expression specifies that the expression should be applied to the input string to find as many matches as possible.

The case-insensitive flag i at the end of the expression specifies that the case should be ignored.

The multi-line flag m specifies that potential matches may appear on either side of a newline.

Back-references can also decompose Uniform Resource Identifiers (URIs) into their components. Suppose you want to decompose the following URI into its protocol (ftp, http, etc.), domain address, and page/path:

https://www.tutorialpro.org:80/html/html-tutorial.html

The following regular expression provides this functionality:

Example

Outputting all matching data:

var str = "https://www.tutorialpro.org:80/html/html-tutorial.html";
var patt1 = /(\w+):\/\/([^/:]+)(:\d*)?([^# ]*)/;
arr = str.match(patt1);
for (var i = 0; i < arr.length ; i++) {
    document.write(arr[i]);
    document.write("<br>");
}

The third line str.match(patt1) returns an array. In this example, the array contains 5 elements, with index 0 corresponding to the entire string, index 1 to the first capturing group (inside the parentheses), and so on.

The first parentheses sub-expression captures the protocol part of the web address. This sub-expression matches any word before a colon and two forward slashes.

The second parentheses sub-expression captures the domain address part of the address. This sub-expression matches one or more characters that are not : or /.

The third parentheses sub-expression captures the port number (if specified). This sub-expression matches zero or more digits following a colon. This sub-expression can only be repeated once.

Finally, the fourth parentheses sub-expression captures the path and/or page information specified by the web address. This sub-expression matches any sequence of characters that does not include # or a space.

Applying the regular expression to the URI above, the sub-matches contain the following:

❮ Regexp Example Regexp Operator ❯