Julia Regular Expressions
Regular expressions (regexes) describe a pattern for matching strings, which can be used to check if a string contains a certain substring, replace matching substrings, or extract substrings that meet certain criteria.
Julia supports Perl-compatible regular expressions (regexes).
There are three forms of regular expressions in Julia: matching, replacing, and transforming:
- Matching: m/
- Replacing: s/
- Transforming: tr/
These forms are typically used with =~ or !~, where =~ indicates a match and !~ indicates no match.
In Julia, regular expression inputs are prefixed with r
:
Example
julia> re = r"^\s*(?:#|$)"
r"^\s*(?:#|$)"
julia> typeof(re)
Regex
To check if a regular expression matches a string, use occursin
:
Example
julia> occursin(r"^\s*(?:#|$)", "not a comment")
false
julia> occursin(r"^\s*(?:#|$)", "# a comment")
true
occursin
only returns true or false, indicating whether the given regular expression is found in the string. However, often we want to know not just if there's a match, but how it matches. To capture match information, use the match
function:
Example
julia> match(r"^\s*(?:#|$)", "not a comment")
julia> match(r"^\s*(?:#|$)", "# a comment")
RegexMatch("#")
If the regular expression does not match the given string, match
returns nothing
—a special value that prints nothing in the interactive prompt. Despite not printing, it is a fully functional value and can be tested programmatically:
Example
m = match(r"^\s*(?:#|$)", line)
if m === nothing
println("not a comment")
else
println("blank or comment")
end
If the regular expression matches, the return value of match
is a RegexMatch
object. These objects record how the expression matched, including the substring that matched the pattern and any captured substrings. The above example only captures the matching part, but perhaps we want to capture any non-empty text following the comment character. We can do this:
Example
julia> m = match(r"^\s*(?:#\s*(.*?)\s*$|$)", "# a comment ")
RegexMatch("# a comment ", 1="a comment")
When calling match
, you can optionally specify the index to start the search. For example:
Example
julia> m = match(r"[0-9]","aaaa1aaaa2aaaa3",1)
RegexMatch("1")
julia> m = match(r"[0-9]","aaaa1aaaa2aaaa3",6)
RegexMatch("2")
julia> m = match(r"[0-9]","aaaa1aaaa2aaaa3",11)
RegexMatch("3")
You can extract the following information from a RegexMatch
object:
- The entire matched substring:
m.match
- Captured substrings as a string array:
m.captures
- The offset where the entire match begins:
m.offset
- Offsets of captured substrings as a vector:
m.offsets
When captures do not match, m.captures
does not contain a substring but rather nothing; additionally, m.offsets
has an offset of 0 (recall that Julia's indexing starts at 1, so a zero offset is invalid). Here are two contrived examples:
Example
julia> m = match(r"(a|b)(c)?(d)", "acd")
RegexMatch("acd", 1="a", 2="c", 3="d")
julia> m.match
"acd"
julia> m.captures
3-element Vector{Union{Nothing, SubString{String}}}:
"a"
"c"
"d"
julia> m.offset
1
julia> m.offsets
3-element Vector{Int64}:
1
2
3
julia> m = match(r"(a|b)(c)?(d)", "ad")
RegexMatch("ad", 1="a", 2=nothing, 3="d")
julia> m.match
"ad"
julia> m.captures
This is a 3-element Vector{Union{Nothing, SubString{String}}}: "a" nothing "d"
julia> m.offset 1
julia> m.offsets 3-element Vector{Int64}: 1 0 2
Returning captures as an array is convenient, allowing them to be bound to local variables using destructuring syntax. For convenience, the RegexMatch object implements an iterator method that passes through to the captures field, so you can directly destructure the match object:
Example
julia> first, second, third = m; first
"a"
Captures can also be accessed by indexing the RegexMatch object with the number or name of the capture group:
Example
julia> m = match(r"(?<hour>\d+):(?<minute>\d+)", "12:45")
RegexMatch("12:45", hour="12", minute="45")
julia> m[:minute]
"45"
julia> m[2]
"45"
Using \n to reference the nth capture group and prefixing the replacement string with 's' in replace allows referencing captures within the replacement string. Capture group 0 refers to the entire match object. You can use \g in the replacement for clarity.
julia> replace("first second", r"(\w+) (?<agroup>\w+)" => s"\g<agroup> \1")
"second first"
For clarity, numbered capture groups can also be referenced with \g.
julia> replace("a", r"." => s"\g<0>1")
"a1"
You can modify the regular expression by adding flags like i, m, s, and x after the double quotes.
For more information on regular expressions, refer to: Regular Expressions - Tutorial