pub struct Regex { /* private fields */ }
Expand description
A compiled regular expression for searching Unicode haystacks.
A Regex
can be used to search haystacks, split haystacks into substrings
or replace substrings in a haystack with a different substring. All
searching is done with an implicit (?s:.)*?
at the beginning and end of
an pattern. To force an expression to match the whole string (or a prefix
or a suffix), you must use an anchor like ^
or $
(or \A
and \z
).
Like the Regex
type in the parent module, matches with this regex return
byte offsets into the haystack. Unlike the parent Regex
type, these
byte offsets may not correspond to UTF-8 sequence boundaries since the
regexes in this module can match arbitrary bytes.
The only methods that allocate new byte strings are the string replacement methods. All other methods (searching and splitting) return borrowed references into the haystack given.
§Example
Find the offsets of a US phone number:
use regex::bytes::Regex;
let re = Regex::new("[0-9]{3}-[0-9]{3}-[0-9]{4}").unwrap();
let m = re.find(b"phone: 111-222-3333").unwrap();
assert_eq!(7..19, m.range());
§Example: extracting capture groups
A common way to use regexes is with capture groups. That is, instead of just looking for matches of an entire regex, parentheses are used to create groups that represent part of the match.
For example, consider a haystack with multiple lines, and each line has
three whitespace delimited fields where the second field is expected to be
a number and the third field a boolean. To make this convenient, we use
the Captures::extract
API to put the strings that match each group
into a fixed size array:
use regex::bytes::Regex;
let hay = b"
rabbit 54 true
groundhog 2 true
does not match
fox 109 false
";
let re = Regex::new(r"(?m)^\s*(\S+)\s+([0-9]+)\s+(true|false)\s*$").unwrap();
let mut fields: Vec<(&[u8], i64, bool)> = vec![];
for (_, [f1, f2, f3]) in re.captures_iter(hay).map(|caps| caps.extract()) {
// These unwraps are OK because our pattern is written in a way where
// all matches for f2 and f3 will be valid UTF-8.
let f2 = std::str::from_utf8(f2).unwrap();
let f3 = std::str::from_utf8(f3).unwrap();
fields.push((f1, f2.parse()?, f3.parse()?));
}
assert_eq!(fields, vec![
(&b"rabbit"[..], 54, true),
(&b"groundhog"[..], 2, true),
(&b"fox"[..], 109, false),
]);
§Example: matching invalid UTF-8
One of the reasons for searching &[u8]
haystacks is that the &[u8]
might not be valid UTF-8. Indeed, with a bytes::Regex
, patterns that
match invalid UTF-8 are explicitly allowed. Here’s one example that looks
for valid UTF-8 fields that might be separated by invalid UTF-8. In this
case, we use (?s-u:.)
, which matches any byte. Attempting to use it in a
top-level Regex
will result in the regex failing to compile. Notice also
that we use .
with Unicode mode enabled, in which case, only valid UTF-8
is matched. In this way, we can build one pattern where some parts only
match valid UTF-8 while other parts are more permissive.
use regex::bytes::Regex;
// F0 9F 92 A9 is the UTF-8 encoding for a Pile of Poo.
let hay = b"\xFF\xFFfoo\xFF\xFF\xFF\xF0\x9F\x92\xA9\xFF";
// An equivalent to '(?s-u:.)' is '(?-u:[\x00-\xFF])'.
let re = Regex::new(r"(?s)(?-u:.)*?(?<f1>.+)(?-u:.)*?(?<f2>.+)").unwrap();
let caps = re.captures(hay).unwrap();
assert_eq!(&caps["f1"], &b"foo"[..]);
assert_eq!(&caps["f2"], "💩".as_bytes());
Implementations§
source§impl Regex
impl Regex
Core regular expression methods.
sourcepub fn new(re: &str) -> Result<Regex, Error>
pub fn new(re: &str) -> Result<Regex, Error>
Compiles a regular expression. Once compiled, it can be used repeatedly to search, split or replace substrings in a haystack.
Note that regex compilation tends to be a somewhat expensive process, and unlike higher level environments, compilation is not automatically cached for you. One should endeavor to compile a regex once and then reuse it. For example, it’s a bad idea to compile the same regex repeatedly in a loop.
§Errors
If an invalid pattern is given, then an error is returned.
An error is also returned if the pattern is valid, but would
produce a regex that is bigger than the configured size limit via
RegexBuilder::size_limit
. (A reasonable size limit is enabled by
default.)
§Example
use regex::bytes::Regex;
// An Invalid pattern because of an unclosed parenthesis
assert!(Regex::new(r"foo(bar").is_err());
// An invalid pattern because the regex would be too big
// because Unicode tends to inflate things.
assert!(Regex::new(r"\w{1000}").is_err());
// Disabling Unicode can make the regex much smaller,
// potentially by up to or more than an order of magnitude.
assert!(Regex::new(r"(?-u:\w){1000}").is_ok());
sourcepub fn is_match(&self, haystack: &[u8]) -> bool
pub fn is_match(&self, haystack: &[u8]) -> bool
Returns true if and only if there is a match for the regex anywhere in the haystack given.
It is recommended to use this method if all you need to do is test whether a match exists, since the underlying matching engine may be able to do less work.
§Example
Test if some haystack contains at least one word with exactly 13 Unicode word characters:
use regex::bytes::Regex;
let re = Regex::new(r"\b\w{13}\b").unwrap();
let hay = b"I categorically deny having triskaidekaphobia.";
assert!(re.is_match(hay));
sourcepub fn find<'h>(&self, haystack: &'h [u8]) -> Option<Match<'h>>
pub fn find<'h>(&self, haystack: &'h [u8]) -> Option<Match<'h>>
This routine searches for the first match of this regex in the
haystack given, and if found, returns a Match
. The Match
provides access to both the byte offsets of the match and the actual
substring that matched.
Note that this should only be used if you want to find the entire
match. If instead you just want to test the existence of a match,
it’s potentially faster to use Regex::is_match(hay)
instead of
Regex::find(hay).is_some()
.
§Example
Find the first word with exactly 13 Unicode word characters:
use regex::bytes::Regex;
let re = Regex::new(r"\b\w{13}\b").unwrap();
let hay = b"I categorically deny having triskaidekaphobia.";
let mat = re.find(hay).unwrap();
assert_eq!(2..15, mat.range());
assert_eq!(b"categorically", mat.as_bytes());
sourcepub fn find_iter<'r, 'h>(&'r self, haystack: &'h [u8]) -> Matches<'r, 'h> ⓘ
pub fn find_iter<'r, 'h>(&'r self, haystack: &'h [u8]) -> Matches<'r, 'h> ⓘ
Returns an iterator that yields successive non-overlapping matches in
the given haystack. The iterator yields values of type Match
.
§Time complexity
Note that since find_iter
runs potentially many searches on the
haystack and since each search has worst case O(m * n)
time
complexity, the overall worst case time complexity for iteration is
O(m * n^2)
.
§Example
Find every word with exactly 13 Unicode word characters:
use regex::bytes::Regex;
let re = Regex::new(r"\b\w{13}\b").unwrap();
let hay = b"Retroactively relinquishing remunerations is reprehensible.";
let matches: Vec<_> = re.find_iter(hay).map(|m| m.as_bytes()).collect();
assert_eq!(matches, vec![
&b"Retroactively"[..],
&b"relinquishing"[..],
&b"remunerations"[..],
&b"reprehensible"[..],
]);
sourcepub fn captures<'h>(&self, haystack: &'h [u8]) -> Option<Captures<'h>>
pub fn captures<'h>(&self, haystack: &'h [u8]) -> Option<Captures<'h>>
This routine searches for the first match of this regex in the haystack
given, and if found, returns not only the overall match but also the
matches of each capture group in the regex. If no match is found, then
None
is returned.
Capture group 0
always corresponds to an implicit unnamed group that
includes the entire match. If a match is found, this group is always
present. Subsequent groups may be named and are numbered, starting
at 1, by the order in which the opening parenthesis appears in the
pattern. For example, in the pattern (?<a>.(?<b>.))(?<c>.)
, a
,
b
and c
correspond to capture group indices 1
, 2
and 3
,
respectively.
You should only use captures
if you need access to the capture group
matches. Otherwise, Regex::find
is generally faster for discovering
just the overall match.
§Example
Say you have some haystack with movie names and their release years, like “‘Citizen Kane’ (1941)”. It’d be nice if we could search for strings looking like that, while also extracting the movie name and its release year separately. The example below shows how to do that.
use regex::bytes::Regex;
let re = Regex::new(r"'([^']+)'\s+\((\d{4})\)").unwrap();
let hay = b"Not my favorite movie: 'Citizen Kane' (1941).";
let caps = re.captures(hay).unwrap();
assert_eq!(caps.get(0).unwrap().as_bytes(), b"'Citizen Kane' (1941)");
assert_eq!(caps.get(1).unwrap().as_bytes(), b"Citizen Kane");
assert_eq!(caps.get(2).unwrap().as_bytes(), b"1941");
// You can also access the groups by index using the Index notation.
// Note that this will panic on an invalid index. In this case, these
// accesses are always correct because the overall regex will only
// match when these capture groups match.
assert_eq!(&caps[0], b"'Citizen Kane' (1941)");
assert_eq!(&caps[1], b"Citizen Kane");
assert_eq!(&caps[2], b"1941");
Note that the full match is at capture group 0
. Each subsequent
capture group is indexed by the order of its opening (
.
We can make this example a bit clearer by using named capture groups:
use regex::bytes::Regex;
let re = Regex::new(r"'(?<title>[^']+)'\s+\((?<year>\d{4})\)").unwrap();
let hay = b"Not my favorite movie: 'Citizen Kane' (1941).";
let caps = re.captures(hay).unwrap();
assert_eq!(caps.get(0).unwrap().as_bytes(), b"'Citizen Kane' (1941)");
assert_eq!(caps.name("title").unwrap().as_bytes(), b"Citizen Kane");
assert_eq!(caps.name("year").unwrap().as_bytes(), b"1941");
// You can also access the groups by name using the Index notation.
// Note that this will panic on an invalid group name. In this case,
// these accesses are always correct because the overall regex will
// only match when these capture groups match.
assert_eq!(&caps[0], b"'Citizen Kane' (1941)");
assert_eq!(&caps["title"], b"Citizen Kane");
assert_eq!(&caps["year"], b"1941");
Here we name the capture groups, which we can access with the name
method or the Index
notation with a &str
. Note that the named
capture groups are still accessible with get
or the Index
notation
with a usize
.
The 0
th capture group is always unnamed, so it must always be
accessed with get(0)
or [0]
.
Finally, one other way to to get the matched substrings is with the
Captures::extract
API:
use regex::bytes::Regex;
let re = Regex::new(r"'([^']+)'\s+\((\d{4})\)").unwrap();
let hay = b"Not my favorite movie: 'Citizen Kane' (1941).";
let (full, [title, year]) = re.captures(hay).unwrap().extract();
assert_eq!(full, b"'Citizen Kane' (1941)");
assert_eq!(title, b"Citizen Kane");
assert_eq!(year, b"1941");
sourcepub fn captures_iter<'r, 'h>(
&'r self,
haystack: &'h [u8]
) -> CaptureMatches<'r, 'h> ⓘ
pub fn captures_iter<'r, 'h>( &'r self, haystack: &'h [u8] ) -> CaptureMatches<'r, 'h> ⓘ
Returns an iterator that yields successive non-overlapping matches in
the given haystack. The iterator yields values of type Captures
.
This is the same as Regex::find_iter
, but instead of only providing
access to the overall match, each value yield includes access to the
matches of all capture groups in the regex. Reporting this extra match
data is potentially costly, so callers should only use captures_iter
over find_iter
when they actually need access to the capture group
matches.
§Time complexity
Note that since captures_iter
runs potentially many searches on the
haystack and since each search has worst case O(m * n)
time
complexity, the overall worst case time complexity for iteration is
O(m * n^2)
.
§Example
We can use this to find all movie titles and their release years in some haystack, where the movie is formatted like “‘Title’ (xxxx)”:
use regex::bytes::Regex;
let re = Regex::new(r"'([^']+)'\s+\(([0-9]{4})\)").unwrap();
let hay = b"'Citizen Kane' (1941), 'The Wizard of Oz' (1939), 'M' (1931).";
let mut movies = vec![];
for (_, [title, year]) in re.captures_iter(hay).map(|c| c.extract()) {
// OK because [0-9]{4} can only match valid UTF-8.
let year = std::str::from_utf8(year).unwrap();
movies.push((title, year.parse::<i64>()?));
}
assert_eq!(movies, vec![
(&b"Citizen Kane"[..], 1941),
(&b"The Wizard of Oz"[..], 1939),
(&b"M"[..], 1931),
]);
Or with named groups:
use regex::bytes::Regex;
let re = Regex::new(r"'(?<title>[^']+)'\s+\((?<year>[0-9]{4})\)").unwrap();
let hay = b"'Citizen Kane' (1941), 'The Wizard of Oz' (1939), 'M' (1931).";
let mut it = re.captures_iter(hay);
let caps = it.next().unwrap();
assert_eq!(&caps["title"], b"Citizen Kane");
assert_eq!(&caps["year"], b"1941");
let caps = it.next().unwrap();
assert_eq!(&caps["title"], b"The Wizard of Oz");
assert_eq!(&caps["year"], b"1939");
let caps = it.next().unwrap();
assert_eq!(&caps["title"], b"M");
assert_eq!(&caps["year"], b"1931");
sourcepub fn split<'r, 'h>(&'r self, haystack: &'h [u8]) -> Split<'r, 'h> ⓘ
pub fn split<'r, 'h>(&'r self, haystack: &'h [u8]) -> Split<'r, 'h> ⓘ
Returns an iterator of substrings of the haystack given, delimited by a match of the regex. Namely, each element of the iterator corresponds to a part of the haystack that isn’t matched by the regular expression.
§Time complexity
Since iterators over all matches requires running potentially many
searches on the haystack, and since each search has worst case
O(m * n)
time complexity, the overall worst case time complexity for
this routine is O(m * n^2)
.
§Example
To split a string delimited by arbitrary amounts of spaces or tabs:
use regex::bytes::Regex;
let re = Regex::new(r"[ \t]+").unwrap();
let hay = b"a b \t c\td e";
let fields: Vec<&[u8]> = re.split(hay).collect();
assert_eq!(fields, vec![
&b"a"[..], &b"b"[..], &b"c"[..], &b"d"[..], &b"e"[..],
]);
§Example: more cases
Basic usage:
use regex::bytes::Regex;
let re = Regex::new(r" ").unwrap();
let hay = b"Mary had a little lamb";
let got: Vec<&[u8]> = re.split(hay).collect();
assert_eq!(got, vec![
&b"Mary"[..], &b"had"[..], &b"a"[..], &b"little"[..], &b"lamb"[..],
]);
let re = Regex::new(r"X").unwrap();
let hay = b"";
let got: Vec<&[u8]> = re.split(hay).collect();
assert_eq!(got, vec![&b""[..]]);
let re = Regex::new(r"X").unwrap();
let hay = b"lionXXtigerXleopard";
let got: Vec<&[u8]> = re.split(hay).collect();
assert_eq!(got, vec![
&b"lion"[..], &b""[..], &b"tiger"[..], &b"leopard"[..],
]);
let re = Regex::new(r"::").unwrap();
let hay = b"lion::tiger::leopard";
let got: Vec<&[u8]> = re.split(hay).collect();
assert_eq!(got, vec![&b"lion"[..], &b"tiger"[..], &b"leopard"[..]]);
If a haystack contains multiple contiguous matches, you will end up with empty spans yielded by the iterator:
use regex::bytes::Regex;
let re = Regex::new(r"X").unwrap();
let hay = b"XXXXaXXbXc";
let got: Vec<&[u8]> = re.split(hay).collect();
assert_eq!(got, vec![
&b""[..], &b""[..], &b""[..], &b""[..],
&b"a"[..], &b""[..], &b"b"[..], &b"c"[..],
]);
let re = Regex::new(r"/").unwrap();
let hay = b"(///)";
let got: Vec<&[u8]> = re.split(hay).collect();
assert_eq!(got, vec![&b"("[..], &b""[..], &b""[..], &b")"[..]]);
Separators at the start or end of a haystack are neighbored by empty substring.
use regex::bytes::Regex;
let re = Regex::new(r"0").unwrap();
let hay = b"010";
let got: Vec<&[u8]> = re.split(hay).collect();
assert_eq!(got, vec![&b""[..], &b"1"[..], &b""[..]]);
When the regex can match the empty string, it splits at every byte
position in the haystack. This includes between all UTF-8 code units.
(The top-level Regex::split
will only split
at valid UTF-8 boundaries.)
use regex::bytes::Regex;
let re = Regex::new(r"").unwrap();
let hay = "☃".as_bytes();
let got: Vec<&[u8]> = re.split(hay).collect();
assert_eq!(got, vec![
&[][..], &[b'\xE2'][..], &[b'\x98'][..], &[b'\x83'][..], &[][..],
]);
Contiguous separators (commonly shows up with whitespace), can lead to possibly surprising behavior. For example, this code is correct:
use regex::bytes::Regex;
let re = Regex::new(r" ").unwrap();
let hay = b" a b c";
let got: Vec<&[u8]> = re.split(hay).collect();
assert_eq!(got, vec![
&b""[..], &b""[..], &b""[..], &b""[..],
&b"a"[..], &b""[..], &b"b"[..], &b"c"[..],
]);
It does not give you ["a", "b", "c"]
. For that behavior, you’d want
to match contiguous space characters:
use regex::bytes::Regex;
let re = Regex::new(r" +").unwrap();
let hay = b" a b c";
let got: Vec<&[u8]> = re.split(hay).collect();
// N.B. This does still include a leading empty span because ' +'
// matches at the beginning of the haystack.
assert_eq!(got, vec![&b""[..], &b"a"[..], &b"b"[..], &b"c"[..]]);
sourcepub fn splitn<'r, 'h>(
&'r self,
haystack: &'h [u8],
limit: usize
) -> SplitN<'r, 'h> ⓘ
pub fn splitn<'r, 'h>( &'r self, haystack: &'h [u8], limit: usize ) -> SplitN<'r, 'h> ⓘ
Returns an iterator of at most limit
substrings of the haystack
given, delimited by a match of the regex. (A limit
of 0
will return
no substrings.) Namely, each element of the iterator corresponds to a
part of the haystack that isn’t matched by the regular expression.
The remainder of the haystack that is not split will be the last
element in the iterator.
§Time complexity
Since iterators over all matches requires running potentially many
searches on the haystack, and since each search has worst case
O(m * n)
time complexity, the overall worst case time complexity for
this routine is O(m * n^2)
.
Although note that the worst case time here has an upper bound given
by the limit
parameter.
§Example
Get the first two words in some haystack:
use regex::bytes::Regex;
let re = Regex::new(r"\W+").unwrap();
let hay = b"Hey! How are you?";
let fields: Vec<&[u8]> = re.splitn(hay, 3).collect();
assert_eq!(fields, vec![&b"Hey"[..], &b"How"[..], &b"are you?"[..]]);
§Examples: more cases
use regex::bytes::Regex;
let re = Regex::new(r" ").unwrap();
let hay = b"Mary had a little lamb";
let got: Vec<&[u8]> = re.splitn(hay, 3).collect();
assert_eq!(got, vec![&b"Mary"[..], &b"had"[..], &b"a little lamb"[..]]);
let re = Regex::new(r"X").unwrap();
let hay = b"";
let got: Vec<&[u8]> = re.splitn(hay, 3).collect();
assert_eq!(got, vec![&b""[..]]);
let re = Regex::new(r"X").unwrap();
let hay = b"lionXXtigerXleopard";
let got: Vec<&[u8]> = re.splitn(hay, 3).collect();
assert_eq!(got, vec![&b"lion"[..], &b""[..], &b"tigerXleopard"[..]]);
let re = Regex::new(r"::").unwrap();
let hay = b"lion::tiger::leopard";
let got: Vec<&[u8]> = re.splitn(hay, 2).collect();
assert_eq!(got, vec![&b"lion"[..], &b"tiger::leopard"[..]]);
let re = Regex::new(r"X").unwrap();
let hay = b"abcXdef";
let got: Vec<&[u8]> = re.splitn(hay, 1).collect();
assert_eq!(got, vec![&b"abcXdef"[..]]);
let re = Regex::new(r"X").unwrap();
let hay = b"abcdef";
let got: Vec<&[u8]> = re.splitn(hay, 2).collect();
assert_eq!(got, vec![&b"abcdef"[..]]);
let re = Regex::new(r"X").unwrap();
let hay = b"abcXdef";
let got: Vec<&[u8]> = re.splitn(hay, 0).collect();
assert!(got.is_empty());
sourcepub fn replace<'h, R: Replacer>(
&self,
haystack: &'h [u8],
rep: R
) -> Cow<'h, [u8]>
pub fn replace<'h, R: Replacer>( &self, haystack: &'h [u8], rep: R ) -> Cow<'h, [u8]>
Replaces the leftmost-first match in the given haystack with the
replacement provided. The replacement can be a regular string (where
$N
and $name
are expanded to match capture groups) or a function
that takes a Captures
and returns the replaced string.
If no match is found, then the haystack is returned unchanged. In that
case, this implementation will likely return a Cow::Borrowed
value
such that no allocation is performed.
When a Cow::Borrowed
is returned, the value returned is guaranteed
to be equivalent to the haystack
given.
§Replacement string syntax
All instances of $ref
in the replacement string are replaced with
the substring corresponding to the capture group identified by ref
.
ref
may be an integer corresponding to the index of the capture group
(counted by order of opening parenthesis where 0
is the entire match)
or it can be a name (consisting of letters, digits or underscores)
corresponding to a named capture group.
If ref
isn’t a valid capture group (whether the name doesn’t exist or
isn’t a valid index), then it is replaced with the empty string.
The longest possible name is used. For example, $1a
looks up the
capture group named 1a
and not the capture group at index 1
. To
exert more precise control over the name, use braces, e.g., ${1}a
.
To write a literal $
use $$
.
§Example
Note that this function is polymorphic with respect to the replacement. In typical usage, this can just be a normal string:
use regex::bytes::Regex;
let re = Regex::new(r"[^01]+").unwrap();
assert_eq!(re.replace(b"1078910", b""), &b"1010"[..]);
But anything satisfying the Replacer
trait will work. For example,
a closure of type |&Captures| -> String
provides direct access to the
captures corresponding to a match. This allows one to access capturing
group matches easily:
use regex::bytes::{Captures, Regex};
let re = Regex::new(r"([^,\s]+),\s+(\S+)").unwrap();
let result = re.replace(b"Springsteen, Bruce", |caps: &Captures| {
let mut buf = vec![];
buf.extend_from_slice(&caps[2]);
buf.push(b' ');
buf.extend_from_slice(&caps[1]);
buf
});
assert_eq!(result, &b"Bruce Springsteen"[..]);
But this is a bit cumbersome to use all the time. Instead, a simple
syntax is supported (as described above) that expands $name
into the
corresponding capture group. Here’s the last example, but using this
expansion technique with named capture groups:
use regex::bytes::Regex;
let re = Regex::new(r"(?<last>[^,\s]+),\s+(?<first>\S+)").unwrap();
let result = re.replace(b"Springsteen, Bruce", b"$first $last");
assert_eq!(result, &b"Bruce Springsteen"[..]);
Note that using $2
instead of $first
or $1
instead of $last
would produce the same result. To write a literal $
use $$
.
Sometimes the replacement string requires use of curly braces to delineate a capture group replacement when it is adjacent to some other literal text. For example, if we wanted to join two words together with an underscore:
use regex::bytes::Regex;
let re = Regex::new(r"(?<first>\w+)\s+(?<second>\w+)").unwrap();
let result = re.replace(b"deep fried", b"${first}_$second");
assert_eq!(result, &b"deep_fried"[..]);
Without the curly braces, the capture group name first_
would be
used, and since it doesn’t exist, it would be replaced with the empty
string.
Finally, sometimes you just want to replace a literal string with no
regard for capturing group expansion. This can be done by wrapping a
string with NoExpand
:
use regex::bytes::{NoExpand, Regex};
let re = Regex::new(r"(?<last>[^,\s]+),\s+(\S+)").unwrap();
let result = re.replace(b"Springsteen, Bruce", NoExpand(b"$2 $last"));
assert_eq!(result, &b"$2 $last"[..]);
Using NoExpand
may also be faster, since the replacement string won’t
need to be parsed for the $
syntax.
sourcepub fn replace_all<'h, R: Replacer>(
&self,
haystack: &'h [u8],
rep: R
) -> Cow<'h, [u8]>
pub fn replace_all<'h, R: Replacer>( &self, haystack: &'h [u8], rep: R ) -> Cow<'h, [u8]>
Replaces all non-overlapping matches in the haystack with the
replacement provided. This is the same as calling replacen
with
limit
set to 0
.
If no match is found, then the haystack is returned unchanged. In that
case, this implementation will likely return a Cow::Borrowed
value
such that no allocation is performed.
When a Cow::Borrowed
is returned, the value returned is guaranteed
to be equivalent to the haystack
given.
The documentation for Regex::replace
goes into more detail about
what kinds of replacement strings are supported.
§Time complexity
Since iterators over all matches requires running potentially many
searches on the haystack, and since each search has worst case
O(m * n)
time complexity, the overall worst case time complexity for
this routine is O(m * n^2)
.
§Fallibility
If you need to write a replacement routine where any individual replacement might “fail,” doing so with this API isn’t really feasible because there’s no way to stop the search process if a replacement fails. Instead, if you need this functionality, you should consider implementing your own replacement routine:
use regex::bytes::{Captures, Regex};
fn replace_all<E>(
re: &Regex,
haystack: &[u8],
replacement: impl Fn(&Captures) -> Result<Vec<u8>, E>,
) -> Result<Vec<u8>, E> {
let mut new = Vec::with_capacity(haystack.len());
let mut last_match = 0;
for caps in re.captures_iter(haystack) {
let m = caps.get(0).unwrap();
new.extend_from_slice(&haystack[last_match..m.start()]);
new.extend_from_slice(&replacement(&caps)?);
last_match = m.end();
}
new.extend_from_slice(&haystack[last_match..]);
Ok(new)
}
// Let's replace each word with the number of bytes in that word.
// But if we see a word that is "too long," we'll give up.
let re = Regex::new(r"\w+").unwrap();
let replacement = |caps: &Captures| -> Result<Vec<u8>, &'static str> {
if caps[0].len() >= 5 {
return Err("word too long");
}
Ok(caps[0].len().to_string().into_bytes())
};
assert_eq!(
Ok(b"2 3 3 3?".to_vec()),
replace_all(&re, b"hi how are you?", &replacement),
);
assert!(replace_all(&re, b"hi there", &replacement).is_err());
§Example
This example shows how to flip the order of whitespace (excluding line terminators) delimited fields, and normalizes the whitespace that delimits the fields:
use regex::bytes::Regex;
let re = Regex::new(r"(?m)^(\S+)[\s--\r\n]+(\S+)$").unwrap();
let hay = b"
Greetings 1973
Wild\t1973
BornToRun\t\t\t\t1975
Darkness 1978
TheRiver 1980
";
let new = re.replace_all(hay, b"$2 $1");
assert_eq!(new, &b"
1973 Greetings
1973 Wild
1975 BornToRun
1978 Darkness
1980 TheRiver
"[..]);
sourcepub fn replacen<'h, R: Replacer>(
&self,
haystack: &'h [u8],
limit: usize,
rep: R
) -> Cow<'h, [u8]>
pub fn replacen<'h, R: Replacer>( &self, haystack: &'h [u8], limit: usize, rep: R ) -> Cow<'h, [u8]>
Replaces at most limit
non-overlapping matches in the haystack with
the replacement provided. If limit
is 0
, then all non-overlapping
matches are replaced. That is, Regex::replace_all(hay, rep)
is
equivalent to Regex::replacen(hay, 0, rep)
.
If no match is found, then the haystack is returned unchanged. In that
case, this implementation will likely return a Cow::Borrowed
value
such that no allocation is performed.
When a Cow::Borrowed
is returned, the value returned is guaranteed
to be equivalent to the haystack
given.
The documentation for Regex::replace
goes into more detail about
what kinds of replacement strings are supported.
§Time complexity
Since iterators over all matches requires running potentially many
searches on the haystack, and since each search has worst case
O(m * n)
time complexity, the overall worst case time complexity for
this routine is O(m * n^2)
.
Although note that the worst case time here has an upper bound given
by the limit
parameter.
§Fallibility
See the corresponding section in the docs for Regex::replace_all
for tips on how to deal with a replacement routine that can fail.
§Example
This example shows how to flip the order of whitespace (excluding line terminators) delimited fields, and normalizes the whitespace that delimits the fields. But we only do it for the first two matches.
use regex::bytes::Regex;
let re = Regex::new(r"(?m)^(\S+)[\s--\r\n]+(\S+)$").unwrap();
let hay = b"
Greetings 1973
Wild\t1973
BornToRun\t\t\t\t1975
Darkness 1978
TheRiver 1980
";
let new = re.replacen(hay, 2, b"$2 $1");
assert_eq!(new, &b"
1973 Greetings
1973 Wild
BornToRun\t\t\t\t1975
Darkness 1978
TheRiver 1980
"[..]);
source§impl Regex
impl Regex
A group of advanced or “lower level” search methods. Some methods permit
starting the search at a position greater than 0
in the haystack. Other
methods permit reusing allocations, for example, when extracting the
matches for capture groups.
sourcepub fn shortest_match(&self, haystack: &[u8]) -> Option<usize>
pub fn shortest_match(&self, haystack: &[u8]) -> Option<usize>
Returns the end byte offset of the first match in the haystack given.
This method may have the same performance characteristics as
is_match
. Behaviorlly, it doesn’t just report whether it match
occurs, but also the end offset for a match. In particular, the offset
returned may be shorter than the proper end of the leftmost-first
match that you would find via Regex::find
.
Note that it is not guaranteed that this routine finds the shortest or “earliest” possible match. Instead, the main idea of this API is that it returns the offset at the point at which the internal regex engine has determined that a match has occurred. This may vary depending on which internal regex engine is used, and thus, the offset itself may change based on internal heuristics.
§Example
Typically, a+
would match the entire first sequence of a
in some
haystack, but shortest_match
may give up as soon as it sees the
first a
.
use regex::bytes::Regex;
let re = Regex::new(r"a+").unwrap();
let offset = re.shortest_match(b"aaaaa").unwrap();
assert_eq!(offset, 1);
sourcepub fn shortest_match_at(&self, haystack: &[u8], start: usize) -> Option<usize>
pub fn shortest_match_at(&self, haystack: &[u8], start: usize) -> Option<usize>
Returns the same as shortest_match
, but starts the search at the
given offset.
The significance of the starting point is that it takes the surrounding
context into consideration. For example, the \A
anchor can only match
when start == 0
.
If a match is found, the offset returned is relative to the beginning of the haystack, not the beginning of the search.
§Panics
This panics when start >= haystack.len() + 1
.
§Example
This example shows the significance of start
by demonstrating how it
can be used to permit look-around assertions in a regex to take the
surrounding context into account.
use regex::bytes::Regex;
let re = Regex::new(r"\bchew\b").unwrap();
let hay = b"eschew";
// We get a match here, but it's probably not intended.
assert_eq!(re.shortest_match(&hay[2..]), Some(4));
// No match because the assertions take the context into account.
assert_eq!(re.shortest_match_at(hay, 2), None);
sourcepub fn is_match_at(&self, haystack: &[u8], start: usize) -> bool
pub fn is_match_at(&self, haystack: &[u8], start: usize) -> bool
Returns the same as Regex::is_match
, but starts the search at the
given offset.
The significance of the starting point is that it takes the surrounding
context into consideration. For example, the \A
anchor can only
match when start == 0
.
§Panics
This panics when start >= haystack.len() + 1
.
§Example
This example shows the significance of start
by demonstrating how it
can be used to permit look-around assertions in a regex to take the
surrounding context into account.
use regex::bytes::Regex;
let re = Regex::new(r"\bchew\b").unwrap();
let hay = b"eschew";
// We get a match here, but it's probably not intended.
assert!(re.is_match(&hay[2..]));
// No match because the assertions take the context into account.
assert!(!re.is_match_at(hay, 2));
sourcepub fn find_at<'h>(&self, haystack: &'h [u8], start: usize) -> Option<Match<'h>>
pub fn find_at<'h>(&self, haystack: &'h [u8], start: usize) -> Option<Match<'h>>
Returns the same as Regex::find
, but starts the search at the given
offset.
The significance of the starting point is that it takes the surrounding
context into consideration. For example, the \A
anchor can only
match when start == 0
.
§Panics
This panics when start >= haystack.len() + 1
.
§Example
This example shows the significance of start
by demonstrating how it
can be used to permit look-around assertions in a regex to take the
surrounding context into account.
use regex::bytes::Regex;
let re = Regex::new(r"\bchew\b").unwrap();
let hay = b"eschew";
// We get a match here, but it's probably not intended.
assert_eq!(re.find(&hay[2..]).map(|m| m.range()), Some(0..4));
// No match because the assertions take the context into account.
assert_eq!(re.find_at(hay, 2), None);
sourcepub fn captures_at<'h>(
&self,
haystack: &'h [u8],
start: usize
) -> Option<Captures<'h>>
pub fn captures_at<'h>( &self, haystack: &'h [u8], start: usize ) -> Option<Captures<'h>>
Returns the same as Regex::captures
, but starts the search at the
given offset.
The significance of the starting point is that it takes the surrounding
context into consideration. For example, the \A
anchor can only
match when start == 0
.
§Panics
This panics when start >= haystack.len() + 1
.
§Example
This example shows the significance of start
by demonstrating how it
can be used to permit look-around assertions in a regex to take the
surrounding context into account.
use regex::bytes::Regex;
let re = Regex::new(r"\bchew\b").unwrap();
let hay = b"eschew";
// We get a match here, but it's probably not intended.
assert_eq!(&re.captures(&hay[2..]).unwrap()[0], b"chew");
// No match because the assertions take the context into account.
assert!(re.captures_at(hay, 2).is_none());
sourcepub fn captures_read<'h>(
&self,
locs: &mut CaptureLocations,
haystack: &'h [u8]
) -> Option<Match<'h>>
pub fn captures_read<'h>( &self, locs: &mut CaptureLocations, haystack: &'h [u8] ) -> Option<Match<'h>>
This is like Regex::captures
, but writes the byte offsets of each
capture group match into the locations given.
A CaptureLocations
stores the same byte offsets as a Captures
,
but does not store a reference to the haystack. This makes its API
a bit lower level and less convenient. But in exchange, callers
may allocate their own CaptureLocations
and reuse it for multiple
searches. This may be helpful if allocating a Captures
shows up in a
profile as too costly.
To create a CaptureLocations
value, use the
Regex::capture_locations
method.
This also returns the overall match if one was found. When a match is
found, its offsets are also always stored in locs
at index 0
.
§Example
use regex::bytes::Regex;
let re = Regex::new(r"^([a-z]+)=(\S*)$").unwrap();
let mut locs = re.capture_locations();
assert!(re.captures_read(&mut locs, b"id=foo123").is_some());
assert_eq!(Some((0, 9)), locs.get(0));
assert_eq!(Some((0, 2)), locs.get(1));
assert_eq!(Some((3, 9)), locs.get(2));
sourcepub fn captures_read_at<'h>(
&self,
locs: &mut CaptureLocations,
haystack: &'h [u8],
start: usize
) -> Option<Match<'h>>
pub fn captures_read_at<'h>( &self, locs: &mut CaptureLocations, haystack: &'h [u8], start: usize ) -> Option<Match<'h>>
Returns the same as Regex::captures_read
, but starts the search at
the given offset.
The significance of the starting point is that it takes the surrounding
context into consideration. For example, the \A
anchor can only
match when start == 0
.
§Panics
This panics when start >= haystack.len() + 1
.
§Example
This example shows the significance of start
by demonstrating how it
can be used to permit look-around assertions in a regex to take the
surrounding context into account.
use regex::bytes::Regex;
let re = Regex::new(r"\bchew\b").unwrap();
let hay = b"eschew";
let mut locs = re.capture_locations();
// We get a match here, but it's probably not intended.
assert!(re.captures_read(&mut locs, &hay[2..]).is_some());
// No match because the assertions take the context into account.
assert!(re.captures_read_at(&mut locs, hay, 2).is_none());
source§impl Regex
impl Regex
Auxiliary methods.
sourcepub fn as_str(&self) -> &str
pub fn as_str(&self) -> &str
Returns the original string of this regex.
§Example
use regex::bytes::Regex;
let re = Regex::new(r"foo\w+bar").unwrap();
assert_eq!(re.as_str(), r"foo\w+bar");
sourcepub fn capture_names(&self) -> CaptureNames<'_> ⓘ
pub fn capture_names(&self) -> CaptureNames<'_> ⓘ
Returns an iterator over the capture names in this regex.
The iterator returned yields elements of type Option<&str>
. That is,
the iterator yields values for all capture groups, even ones that are
unnamed. The order of the groups corresponds to the order of the group’s
corresponding opening parenthesis.
The first element of the iterator always yields the group corresponding to the overall match, and this group is always unnamed. Therefore, the iterator always yields at least one group.
§Example
This shows basic usage with a mix of named and unnamed capture groups:
use regex::bytes::Regex;
let re = Regex::new(r"(?<a>.(?<b>.))(.)(?:.)(?<c>.)").unwrap();
let mut names = re.capture_names();
assert_eq!(names.next(), Some(None));
assert_eq!(names.next(), Some(Some("a")));
assert_eq!(names.next(), Some(Some("b")));
assert_eq!(names.next(), Some(None));
// the '(?:.)' group is non-capturing and so doesn't appear here!
assert_eq!(names.next(), Some(Some("c")));
assert_eq!(names.next(), None);
The iterator always yields at least one element, even for regexes with no capture groups and even for regexes that can never match:
use regex::bytes::Regex;
let re = Regex::new(r"").unwrap();
let mut names = re.capture_names();
assert_eq!(names.next(), Some(None));
assert_eq!(names.next(), None);
let re = Regex::new(r"[a&&b]").unwrap();
let mut names = re.capture_names();
assert_eq!(names.next(), Some(None));
assert_eq!(names.next(), None);
sourcepub fn captures_len(&self) -> usize
pub fn captures_len(&self) -> usize
Returns the number of captures groups in this regex.
This includes all named and unnamed groups, including the implicit unnamed group that is always present and corresponds to the entire match.
Since the implicit unnamed group is always included in this length, the length returned is guaranteed to be greater than zero.
§Example
use regex::bytes::Regex;
let re = Regex::new(r"foo").unwrap();
assert_eq!(1, re.captures_len());
let re = Regex::new(r"(foo)").unwrap();
assert_eq!(2, re.captures_len());
let re = Regex::new(r"(?<a>.(?<b>.))(.)(?:.)(?<c>.)").unwrap();
assert_eq!(5, re.captures_len());
let re = Regex::new(r"[a&&b]").unwrap();
assert_eq!(1, re.captures_len());
sourcepub fn static_captures_len(&self) -> Option<usize>
pub fn static_captures_len(&self) -> Option<usize>
Returns the total number of capturing groups that appear in every possible match.
If the number of capture groups can vary depending on the match, then
this returns None
. That is, a value is only returned when the number
of matching groups is invariant or “static.”
Note that like Regex::captures_len
, this does include the
implicit capturing group corresponding to the entire match. Therefore,
when a non-None value is returned, it is guaranteed to be at least 1
.
Stated differently, a return value of Some(0)
is impossible.
§Example
This shows a few cases where a static number of capture groups is available and a few cases where it is not.
use regex::bytes::Regex;
let len = |pattern| {
Regex::new(pattern).map(|re| re.static_captures_len())
};
assert_eq!(Some(1), len("a")?);
assert_eq!(Some(2), len("(a)")?);
assert_eq!(Some(2), len("(a)|(b)")?);
assert_eq!(Some(3), len("(a)(b)|(c)(d)")?);
assert_eq!(None, len("(a)|b")?);
assert_eq!(None, len("a|(b)")?);
assert_eq!(None, len("(b)*")?);
assert_eq!(Some(2), len("(b)+")?);
sourcepub fn capture_locations(&self) -> CaptureLocations
pub fn capture_locations(&self) -> CaptureLocations
Returns a fresh allocated set of capture locations that can
be reused in multiple calls to Regex::captures_read
or
Regex::captures_read_at
.
§Example
use regex::bytes::Regex;
let re = Regex::new(r"(.)(.)(\w+)").unwrap();
let mut locs = re.capture_locations();
assert!(re.captures_read(&mut locs, b"Padron").is_some());
assert_eq!(locs.get(0), Some((0, 6)));
assert_eq!(locs.get(1), Some((0, 1)));
assert_eq!(locs.get(2), Some((1, 2)));
assert_eq!(locs.get(3), Some((2, 6)));