Easy Tutorial
❮ Java Dataoutputstream Java String Subsequence ❯

Java Regular Expressions

Regular expressions define patterns for strings.

They can be used to search, edit, or process text.

Regular expressions are not limited to one language, but there are subtle differences in each language.

Regular Expression Examples

A string is a simple regular expression; for example, the Hello World regular expression matches the "Hello World" string.

. (dot) is also a regular expression that matches any single character, such as "a" or "1".

The table below lists some regular expression examples and their descriptions:

Regular Expression Description
this is text Matches the string "this is text"
this\s+is\s+text Note the \s+ in the string. The \s+ after "this" can match multiple spaces, followed by the "is" string, then \s+ matches multiple spaces again, followed by the "text" string. It can match this example: this is text
^\d+(.\d+)? ^ defines what starts the expression. \d+ matches one or more digits. ? makes the option within the parentheses optional. . matches the "." character. Examples it can match: "5", "1.5", and "2.21".

Java regular expressions are most similar to those in Perl.

The java.util.regex package primarily includes the following three classes:

-Pattern Class:

A Pattern object is a compiled representation of a regular expression. The Pattern class has no public constructor. To create a Pattern object, you must call its public static compile method, which returns a Pattern object. This method takes a regular expression as its first parameter.

-Matcher Class:

A Matcher object is the engine that interprets and matches the input string. Like the Pattern class, Matcher has no public constructor. You need to call the matcher method of the Pattern object to get a Matcher object.

-PatternSyntaxException:

PatternSyntaxException is a non-mandatory exception class that indicates a syntax error in a regular expression pattern.

The following example uses the regular expression .tutorialpro. to check if the string contains the tutorialpro substring:

Example

import java.util.regex.*;

class RegexExample1 {
   public static void main(String[] args) {
      String content = "I am noob " +
        "from tutorialpro.org.";

      String pattern = ".*tutorialpro.*";

      boolean isMatch = Pattern.matches(pattern, content);
      System.out.println("Does the string contain the 'tutorialpro' substring? " + isMatch);
   }
}

The output of the example is:

Does the string contain the 'tutorialpro' substring? true

Capturing Groups

Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses.

For example, the regular expression (dog) creates a single group containing "d", "o", and "g".

Capturing groups are numbered by counting their opening parentheses from left to right. For example, in the expression ((A)(B(C))), there are four such groups:

You can see how many groups are in the expression by calling the groupCount method on the matcher object. The groupCount method returns an int indicating how many capturing groups are present in the matcher.

There is also a special group (group(0)), which always represents the entire expression. This group is not included in the count returned by groupCount.

Example

The following example shows how to find numeric strings from a given string:

RegexMatches.java File Code:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexMatches {
    public static void main(String[] args) {
        // Search for a pattern in the string
        String line = "This order was placed for QT3000! OK?";
        String pattern = "(\\D*)(\\d+)(.*)";

        // Create a Pattern object
Pattern r = Pattern.compile(pattern);

// Now create matcher object
Matcher m = r.matcher(line);
if (m.find()) {
   System.out.println("Found value: " + m.group(0));
   System.out.println("Found value: " + m.group(1));
   System.out.println("Found value: " + m.group(2));
   System.out.println("Found value: " + m.group(3));
} else {
   System.out.println("NO MATCH");
}
}

The above example compiles and runs with the following output:

Found value: This order was placed for QT3000! OK?
Found value: This order was placed for QT
Found value: 3000
Found value: ! OK?

Regular Expression Syntax

In other languages, \\ means: I want to insert a literal backslash in the regular expression, please do not give it any special meaning.

In Java, \\ means: I want to insert a backslash for the regular expression, so the character following it has special meaning.

So, in other languages (like Perl), a single backslash \ is enough to escape, whereas in Java's regular expressions, two backslashes \\ are needed to be interpreted as an escape in other languages. You can also simply understand that in Java's regular expressions, two \\ represent one \ in other languages, which is why the regular expression for a single digit is \\d, and for a literal backslash is \\.

System.out.print("\\");    // Outputs \
System.out.print("\\\\");  // Outputs \\
Character Description
\ Marks the next character as either a special character, a literal, a back-reference, or an octal escape. For example, 'n' matches the character 'n'. '\n' matches a newline character. Sequence '\' matches '\' and '(' matches '('.
^ Matches the position at the beginning of the input string. If the multiline flag is set for the RegExp object, ^ also matches the position following '\n' or '\r'.
$ Matches the position at the end of the input string. If the multiline flag is set for the RegExp object, $ also matches the position before '\n' or '\r'.
* Matches zero or more times. For example, 'zo' matches 'z' and 'zoo'. '' is equivalent to {0,}.
+ Matches one or more times. For example, 'zo+' matches 'zo' and 'zoo', but not 'z'. '+' is equivalent to {1,}.
? Matches zero or one time. For example, 'do(es)?' matches 'do' or 'does'. '?' is equivalent to {0,1}.
{n} n is a non-negative integer. Matches exactly n times. For example, 'o{2}' does not match 'o' in "Bob," but matches the two 'o's in "food".
{n,} n is a non-negative integer. Matches at least n times. For example, 'o{2,}' does not match 'o' in "Bob" and matches all the 'o's in "foooood". 'o{1,}' is equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'.
{n,m} m and n are non-negative integers, where n <= m. Matches at least n and at most m times. For example, 'o{1,3}' matches the first three 'o's in "fooooood". 'o{0,1}' is equivalent to 'o?'. Note: You cannot insert a space between the comma and the numbers.
? When this character immediately follows any of the other quantifiers (*, +, ?, {n}, {n,}, {n,m}), the matching pattern is "non-greedy". A "non-greedy" pattern matches as few characters as possible, whereas the default "greedy" pattern matches as many characters as possible. For example, in the string "oooo", 'o+?' matches a single 'o', while 'o+' matches all 'o's.
. Matches any single character except "\r\n". To match any character including "\r\n", use a pattern like "[\s\S]".
(pattern) Matches pattern and captures the matched substring. You can retrieve the captured matches from the resulting "matches" collection using $0…$9 properties. To match parentheses characters ( ), use "(" or ")".
| (?:pattern) | Matches the pattern but does not capture the match, i.e., it is a non-capturing match, not storing the match for future use. This is useful for combining parts of patterns with the "or" character (|). For example, 'industr(?:y|ies)' is a more economical expression than 'industry|industries'. |
| (?=pattern) | Performs a forward lookahead search for a subexpression that matches at the beginning of the string that matches the pattern. It is a non-capturing match, i.e., it cannot capture the match for future use. For example, 'Windows (?=95|98|NT|2000)' matches "Windows" in "Windows 2000" but not in "Windows 3.1". The lookahead does not consume characters, i.e., after a match occurs, the next match search follows immediately after the previous match, not at the characters that make up the lookahead. |
| (?!pattern) | Performs a negative forward lookahead search for a subexpression that matches a string that is not at the beginning of the string that matches the pattern. It is a non-capturing match, i.e., it cannot capture the match for future use. For example, 'Windows (?!95|98|NT|2000)' matches "Windows" in "Windows 3.1" but not in "Windows 2000". The lookahead does not consume characters, i.e., after a match occurs, the next match search follows immediately after the previous match, not at the characters that make up the lookahead. |
| x|y | Matches either x or y. For example, 'z|food' matches "z" or "food". '(z|f)ood' matches "zood" or "food". |
| [xyz] | Character set. Matches any one of the included characters. For example, "[abc]" matches "a" in "plain". |
| [^xyz] | Negated character set. Matches any character not included. For example, "[^abc]" matches "p", "l", "i", "n" in "plain". |
| [a-z] | Character range. Matches any character within the specified range. For example, "[a-z]" matches any lowercase letter from "a" to "z". |
| [^a-z] | Negated range character. Matches any character not within the specified range. For example, "[^a-z]" matches any character not from "a" to "z". |
| \b | Matches a word boundary, i.e., the position between a word and a space. For example, "er\b" matches "er" in "never" but not in "verb". |
| \B | Non-word boundary match. "er\B" matches "er" in "verb" but not in "never". |
| \cx | Matches the control character indicated by x. For example, \cM matches Control-M or a carriage return. The value of x must be within A-Z or a-z. If not, c is assumed to be the character "c" itself. |
| \d | Digit character match. Equivalent to [0-9]. |
| \D | Non-digit character match. Equivalent to [^0-9]. |
| \f | Form feed character match. Equivalent to \x0c and \cL. |
| \n | Newline character match. Equivalent to \x0a and \cJ. |
| \r | Matches a carriage return. Equivalent to \x0d and \cM. |
| \s | Matches any whitespace character, including space, tab, form feed, etc. Equivalent to [ \f\n\r\t\v]. |
| \S | Matches any non-whitespace character. Equivalent to [^ \f\n\r\t\v]. |
| \t | Tab character match. Equivalent to \x09 and \cI. |
| \v | Vertical tab character match. Equivalent to \x0b and \cK. |
| \w | Matches any word character, including underscore. Equivalent to "[A-Za-z0-9_]". |
| \W | Matches any non-word character. Equivalent to "[^A-Za-z0-9_]". |
| \xn | Matches n, where n is a hexadecimal escape code. The hexadecimal escape code must be exactly two digits long. For example, "\x41" matches "A". "\x041" is equivalent to "\x04" & "1". ASCII codes are allowed in regular expressions. |
| \num | Matches num, where num is a positive integer. A back reference to the captured match. For example, "(.)\1" matches two consecutive identical characters. |
| \n | Identifies an octal escape code or a back reference. If \n is preceded by at least n captured subexpressions, then n is a back reference. Otherwise, if n is an octal number (0-7), then n is an octal escape code. |
| \nm | Identifies an octal escape code or a back reference. If \nm is preceded by at least nm captured subexpressions, then nm is a back reference. If \nm is preceded by at least n captured subexpressions, then n is a back reference followed by the character m. If neither of the preceding conditions exist, then \nm matches the octal value nm, where n and m are octal digits (0-7). |
| \nml | Matches the octal escape code nml when n is an octal digit (0-3) and m and l are octal digits (0-7). |
This is a Chinese to English translation. Please provide the English translation for this text. Do not provide any explanations or text apart from the translation.

Chinese: | \un | Matches n, where n is a Unicode character represented by a four-digit hexadecimal number. For example, \u00A9 matches the copyright symbol (©). |

>

According to the Java Language Specification, the backslash in a Java source code string is interpreted as a Unicode escape or other character escape. Therefore, two backslashes must be used in a string literal to indicate that the regular expression is protected from being interpreted by the Java bytecode compiler. For example, when interpreted as a regular expression, the string literal "\b" matches a single backspace character, while "\\b" matches a word boundary. The string literal "\(hello\)" is illegal and will result in a compile-time error; to match the string (hello), the string literal "\\(hello\\)" must be used.

---

## Matcher Class Methods

## Index Methods

Index methods provide useful index values that precisely indicate where in the input string a match can be found:

| Number | Method and Description |
| --- | --- |
| 1 | public int start() <br> Returns the start index of the previous match. |
| 2 | public int start(int group) <br> Returns the start index of the subsequence captured by the given group during the previous match operation. |
| 3 | public int end() <br> Returns the offset after the last character matched. |
| 4 | public int end(int group) <br> Returns the offset after the last character of the subsequence captured by the given group during the previous match operation. |

## Search Methods

Search methods are used to check the input string and return a boolean value indicating whether the pattern is found:

| Number | Method and Description |
| --- | --- |
| 1 | public boolean lookingAt() <br> Attempts to match the input sequence, starting at the beginning of the region, against the pattern. |
| 2 | public boolean find() <br> Attempts to find the next subsequence of the input sequence that matches the pattern. |
| 3 | public boolean find(int start) <br> Resets this matcher and then attempts to find the next subsequence of the input sequence that matches the pattern, starting at the specified index. |
| 4 | public boolean matches() <br> Attempts to match the entire region against the pattern. |

## Replacement Methods

Replacement methods are methods that replace text in the input string:

| Number | Method and Description |
| --- | --- |
| 1 | public Matcher appendReplacement(StringBuffer sb, String replacement) <br> Implements a non-terminal append and replace step. |
| 2 | public StringBuffer appendTail(StringBuffer sb) <br> Implements a terminal append and replace step. |
| 3 | public String replaceAll(String replacement) <br> Replaces every subsequence of the input sequence that matches the pattern with the given replacement string. |
| 4 | public String replaceFirst(String replacement) <br> Replaces the first subsequence of the input sequence that matches the pattern with the given replacement string. |
| 5 | public static String quoteReplacement(String s) <br> Returns a literal replacement string for the specified string. This method returns a string that will work as a literal replacement in the appendReplacement method of the Matcher class. |

## start and end Methods

Here is an example that counts the number of times the word "cat" appears in the input string:

## RegexMatches.java File Code:

```java
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexMatches {
    private static final String REGEX = "\\bcat\\b";
    private static final String INPUT = "cat cat cat cattie cat";

    public static void main(String[] args) {
        Pattern p = Pattern.compile(REGEX);
        Matcher m = p.matcher(INPUT); // Get the matcher object
        int count = 0;

        while (m.find()) {
            count++;
            System.out.println("Match number " + count);
            System.out.println("start(): " + m.start());
            System.out.println("end(): " + m.end());
        }
    }
}
count++;
System.out.println("Match number " + count);
System.out.println("start(): " + m.start());
System.out.println("end(): " + m.end());
}
}
}

The above example compiles and runs with the following results:

Match number 1
start(): 0
end(): 3
Match number 2
start(): 4
end(): 7
Match number 3
start(): 8
end(): 11
Match number 4
start(): 19
end(): 22

This example uses word boundaries to ensure that the letters "c", "a", "t" are not just substrings of longer words. It also provides useful information about the positions of matches in the input string.

The start method returns the initial index of the subsequence captured by the given group during the previous match operation, and the end method returns the index of the last matching character plus one.

matches and lookingAt Methods

Both matches and lookingAt methods attempt to match an input sequence against a pattern. The difference is that matches requires the entire sequence to match, while lookingAt does not.

The lookingAt method, although it does not require the entire string to match, must start matching from the first character.

These methods are often used at the beginning of the input string.

We illustrate this functionality with the following example:

RegexMatches.java File Code:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexMatches
{
    private static final String REGEX = "foo";
    private static final String INPUT = "fooooooooooooooooo";
    private static final String INPUT2 = "ooooofoooooooooooo";
    private static Pattern pattern;
    private static Matcher matcher;
    private static Matcher matcher2;

    public static void main(String[] args) {
        pattern = Pattern.compile(REGEX);
        matcher = pattern.matcher(INPUT);
        matcher2 = pattern.matcher(INPUT2);

        System.out.println("Current REGEX is: " + REGEX);
        System.out.println("Current INPUT is: " + INPUT);
        System.out.println("Current INPUT2 is: " + INPUT2);

        System.out.println("lookingAt(): " + matcher.lookingAt());
        System.out.println("matches(): " + matcher.matches());
        System.out.println("lookingAt(): " + matcher2.lookingAt());
    }
}

The above example compiles and runs with the following results:

Current REGEX is: foo
Current INPUT is: fooooooooooooooooo
Current INPUT2 is: ooooofoooooooooooo
lookingAt(): true
matches(): false
lookingAt(): false

replaceFirst and replaceAll Methods

The replaceFirst and replaceAll methods are used to replace text that matches a regular expression. The difference is that replaceFirst replaces the first match, and replaceAll replaces all matches.

The following example illustrates this functionality:

RegexMatches.java File Code:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexMatches
{
    private static String REGEX = "dog";
    private static String INPUT = "The dog says meow. All dogs say meow.";
    private static String REPLACE = "cat";

    public static void main(String[] args) {
        Pattern p = Pattern.compile(REGEX);
        Matcher m = p.matcher(INPUT);
        INPUT = m.replaceAll(REPLACE);
        System.out.println(INPUT);
    }
}

The above example compiles and runs with the following results:

The cat says meow. All cats say meow.
private static String REGEX = "dog";
private static String INPUT = "The dog says meow. " +
                                "All dogs say meow.";
private static String REPLACE = "cat";

public static void main(String[] args) {
   Pattern p = Pattern.compile(REGEX);
   // get a matcher object
   Matcher m = p.matcher(INPUT); 
   INPUT = m.replaceAll(REPLACE);
   System.out.println(INPUT);
}

The above example compiles and runs with the following result:

The cat says meow. All cats say meow.

appendReplacement and appendTail Methods

The Matcher class also provides appendReplacement and appendTail methods for text replacement:

See the following example to explain this functionality:

RegexMatches.java File Code:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexMatches
{
   private static String REGEX = "a*b";
   private static String INPUT = "aabfooaabfooabfoobkkk";
   private static String REPLACE = "-";
   public static void main(String[] args) {
      Pattern p = Pattern.compile(REGEX);
      // Get matcher object
      Matcher m = p.matcher(INPUT);
      StringBuffer sb = new StringBuffer();
      while(m.find()){
         m.appendReplacement(sb,REPLACE);
      }
      m.appendTail(sb);
      System.out.println(sb.toString());
   }
}

The above example compiles and runs with the following result:

-foo-foo-foo-kkk

PatternSyntaxException Class Methods

PatternSyntaxException is an unchecked exception that indicates a syntax error in a regular expression pattern.

PatternSyntaxException class provides the following methods to help us see what went wrong.

No. Method and Description
1 public String getDescription() <br>Gets the description of the error.
2 public int getIndex() <br>Gets the error index.
3 public String getPattern() <br>Gets the erroneous regular expression pattern.
4 public String getMessage() <br>Returns a multi-line string containing the description of the syntax error and its index, the erroneous regular expression pattern, and a visual indication of the error index within the pattern.
❮ Java Dataoutputstream Java String Subsequence ❯