Struct regex::RegexSetBuilder
source · pub struct RegexSetBuilder { /* private fields */ }
Expand description
A configurable builder for a RegexSet
.
This builder can be used to programmatically set flags such as
i
(case insensitive) and x
(for verbose mode). This builder
can also be used to configure things like the line terminator
and a size limit on the compiled regular expression.
Implementations§
source§impl RegexSetBuilder
impl RegexSetBuilder
sourcepub fn new<I, S>(patterns: I) -> RegexSetBuilder
pub fn new<I, S>(patterns: I) -> RegexSetBuilder
Create a new builder with a default configuration for the given patterns.
If the patterns are invalid or exceed the configured size limits,
then an error will be returned when RegexSetBuilder::build
is
called.
sourcepub fn build(&self) -> Result<RegexSet, Error>
pub fn build(&self) -> Result<RegexSet, Error>
Compiles the patterns given to RegexSetBuilder::new
with the
configuration set on this builder.
If the patterns aren’t valid regexes or if a configured size limit was exceeded, then an error is returned.
sourcepub fn unicode(&mut self, yes: bool) -> &mut RegexSetBuilder
pub fn unicode(&mut self, yes: bool) -> &mut RegexSetBuilder
This configures Unicode mode for the all of the patterns.
Enabling Unicode mode does a number of things:
- Most fundamentally, it causes the fundamental atom of matching
to be a single codepoint. When Unicode mode is disabled, it’s a
single byte. For example, when Unicode mode is enabled,
.
will match💩
once, where as it will match 4 times when Unicode mode is disabled. (Since the UTF-8 encoding of💩
is 4 bytes long.) - Case insensitive matching uses Unicode simple case folding rules.
- Unicode character classes like
\p{Letter}
and\p{Greek}
are available. - Perl character classes are Unicode aware. That is,
\w
,\s
and\d
. - The word boundary assertions,
\b
and\B
, use the Unicode definition of a word character.
Note that if Unicode mode is disabled, then the regex will fail to
compile if it could match invalid UTF-8. For example, when Unicode
mode is disabled, then since .
matches any byte (except for
\n
), then it can match invalid UTF-8 and thus building a regex
from it will fail. Another example is \w
and \W
. Since \w
can
only match ASCII bytes when Unicode mode is disabled, it’s allowed.
But \W
can match more than ASCII bytes, including invalid UTF-8,
and so it is not allowed. This restriction can be lifted only by
using a bytes::RegexSet
.
For more details on the Unicode support in this crate, see the Unicode section in this crate’s top-level documentation.
The default for this is true
.
Example
use regex::RegexSetBuilder;
let re = RegexSetBuilder::new([r"\w"])
.unicode(false)
.build()
.unwrap();
// Normally greek letters would be included in \w, but since
// Unicode mode is disabled, it only matches ASCII letters.
assert!(!re.is_match("δ"));
let re = RegexSetBuilder::new([r"s"])
.case_insensitive(true)
.unicode(false)
.build()
.unwrap();
// Normally 'ſ' is included when searching for 's' case
// insensitively due to Unicode's simple case folding rules. But
// when Unicode mode is disabled, only ASCII case insensitive rules
// are used.
assert!(!re.is_match("ſ"));
sourcepub fn case_insensitive(&mut self, yes: bool) -> &mut RegexSetBuilder
pub fn case_insensitive(&mut self, yes: bool) -> &mut RegexSetBuilder
This configures whether to enable case insensitive matching for all of the patterns.
This setting can also be configured using the inline flag i
in the pattern. For example, (?i:foo)
matches foo
case
insensitively while (?-i:foo)
matches foo
case sensitively.
The default for this is false
.
Example
use regex::RegexSetBuilder;
let re = RegexSetBuilder::new([r"foo(?-i:bar)quux"])
.case_insensitive(true)
.build()
.unwrap();
assert!(re.is_match("FoObarQuUx"));
// Even though case insensitive matching is enabled in the builder,
// it can be locally disabled within the pattern. In this case,
// `bar` is matched case sensitively.
assert!(!re.is_match("fooBARquux"));
sourcepub fn multi_line(&mut self, yes: bool) -> &mut RegexSetBuilder
pub fn multi_line(&mut self, yes: bool) -> &mut RegexSetBuilder
This configures multi-line mode for all of the patterns.
Enabling multi-line mode changes the behavior of the ^
and $
anchor assertions. Instead of only matching at the beginning and
end of a haystack, respectively, multi-line mode causes them to
match at the beginning and end of a line in addition to the
beginning and end of a haystack. More precisely, ^
will match at
the position immediately following a \n
and $
will match at the
position immediately preceding a \n
.
The behavior of this option can be impacted by other settings too:
- The
RegexSetBuilder::line_terminator
option changes\n
above to any ASCII byte. - The
RegexSetBuilder::crlf
option changes the line terminator to be either\r
or\n
, but never at the position between a\r
and\n
.
This setting can also be configured using the inline flag m
in
the pattern.
The default for this is false
.
Example
use regex::RegexSetBuilder;
let re = RegexSetBuilder::new([r"^foo$"])
.multi_line(true)
.build()
.unwrap();
assert!(re.is_match("\nfoo\n"));
sourcepub fn dot_matches_new_line(&mut self, yes: bool) -> &mut RegexSetBuilder
pub fn dot_matches_new_line(&mut self, yes: bool) -> &mut RegexSetBuilder
This configures dot-matches-new-line mode for the entire pattern.
Perhaps surprisingly, the default behavior for .
is not to match
any character, but rather, to match any character except for the
line terminator (which is \n
by default). When this mode is
enabled, the behavior changes such that .
truly matches any
character.
This setting can also be configured using the inline flag s
in
the pattern. For example, (?s:.)
and \p{any}
are equivalent
regexes.
The default for this is false
.
Example
use regex::RegexSetBuilder;
let re = RegexSetBuilder::new([r"foo.bar"])
.dot_matches_new_line(true)
.build()
.unwrap();
let hay = "foo\nbar";
assert!(re.is_match(hay));
sourcepub fn crlf(&mut self, yes: bool) -> &mut RegexSetBuilder
pub fn crlf(&mut self, yes: bool) -> &mut RegexSetBuilder
This configures CRLF mode for all of the patterns.
When CRLF mode is enabled, both \r
(“carriage return” or CR for
short) and \n
(“line feed” or LF for short) are treated as line
terminators. This results in the following:
- Unless dot-matches-new-line mode is enabled,
.
will now match any character except for\n
and\r
. - When multi-line mode is enabled,
^
will match immediately following a\n
or a\r
. Similarly,$
will match immediately preceding a\n
or a\r
. Neither^
nor$
will ever match between\r
and\n
.
This setting can also be configured using the inline flag R
in
the pattern.
The default for this is false
.
Example
use regex::RegexSetBuilder;
let re = RegexSetBuilder::new([r"^foo$"])
.multi_line(true)
.crlf(true)
.build()
.unwrap();
let hay = "\r\nfoo\r\n";
// If CRLF mode weren't enabled here, then '$' wouldn't match
// immediately after 'foo', and thus no match would be found.
assert!(re.is_match(hay));
This example demonstrates that ^
will never match at a position
between \r
and \n
. ($
will similarly not match between a \r
and a \n
.)
use regex::RegexSetBuilder;
let re = RegexSetBuilder::new([r"^\n"])
.multi_line(true)
.crlf(true)
.build()
.unwrap();
assert!(!re.is_match("\r\n"));
sourcepub fn line_terminator(&mut self, byte: u8) -> &mut RegexSetBuilder
pub fn line_terminator(&mut self, byte: u8) -> &mut RegexSetBuilder
Configures the line terminator to be used by the regex.
The line terminator is relevant in two ways for a particular regex:
- When dot-matches-new-line mode is not enabled (the default),
then
.
will match any character except for the configured line terminator. - When multi-line mode is enabled (not the default), then
^
and$
will match immediately after and before, respectively, a line terminator.
In both cases, if CRLF mode is enabled in a particular context, then it takes precedence over any configured line terminator.
This option cannot be configured from within the pattern.
The default line terminator is \n
.
Example
This shows how to treat the NUL byte as a line terminator. This can be a useful heuristic when searching binary data.
use regex::RegexSetBuilder;
let re = RegexSetBuilder::new([r"^foo$"])
.multi_line(true)
.line_terminator(b'\x00')
.build()
.unwrap();
let hay = "\x00foo\x00";
assert!(re.is_match(hay));
This example shows that the behavior of .
is impacted by this
setting as well:
use regex::RegexSetBuilder;
let re = RegexSetBuilder::new([r"."])
.line_terminator(b'\x00')
.build()
.unwrap();
assert!(re.is_match("\n"));
assert!(!re.is_match("\x00"));
This shows that building a regex will fail if the byte given
is not ASCII and the pattern could result in matching invalid
UTF-8. This is because any singular non-ASCII byte is not valid
UTF-8, and it is not permitted for a RegexSet
to match invalid
UTF-8. (It is permissible to use a non-ASCII byte when building a
bytes::RegexSet
.)
use regex::RegexSetBuilder;
assert!(
RegexSetBuilder::new([r"."])
.line_terminator(0x80)
.build()
.is_err()
);
// Note that using a non-ASCII byte isn't enough on its own to
// cause regex compilation to fail. You actually have to make use
// of it in the regex in a way that leads to matching invalid
// UTF-8. If you don't, then regex compilation will succeed!
assert!(
RegexSetBuilder::new([r"a"])
.line_terminator(0x80)
.build()
.is_ok()
);
sourcepub fn swap_greed(&mut self, yes: bool) -> &mut RegexSetBuilder
pub fn swap_greed(&mut self, yes: bool) -> &mut RegexSetBuilder
This configures swap-greed mode for all of the patterns.
When swap-greed mode is enabled, patterns like a+
will become
non-greedy and patterns like a+?
will become greedy. In other
words, the meanings of a+
and a+?
are switched.
This setting can also be configured using the inline flag U
in
the pattern.
Note that this is generally not useful for a RegexSet
since a
RegexSet
can only report whether a pattern matches or not. Since
greediness never impacts whether a match is found or not (only the
offsets of the match), it follows that whether parts of a pattern
are greedy or not doesn’t matter for a RegexSet
.
The default for this is false
.
sourcepub fn ignore_whitespace(&mut self, yes: bool) -> &mut RegexSetBuilder
pub fn ignore_whitespace(&mut self, yes: bool) -> &mut RegexSetBuilder
This configures verbose mode for all of the patterns.
When enabled, whitespace will treated as insignifcant in the
pattern and #
can be used to start a comment until the next new
line.
Normally, in most places in a pattern, whitespace is treated
literally. For example +
will match one or more ASCII whitespace
characters.
When verbose mode is enabled, \#
can be used to match a literal
#
and \
can be used to match a literal ASCII whitespace
character.
Verbose mode is useful for permitting regexes to be formatted and broken up more nicely. This may make them more easily readable.
This setting can also be configured using the inline flag x
in
the pattern.
The default for this is false
.
Example
use regex::RegexSetBuilder;
let pat = r"
\b
(?<first>\p{Uppercase}\w*) # always start with uppercase letter
[\s--\n]+ # whitespace should separate names
(?: # middle name can be an initial!
(?:(?<initial>\p{Uppercase})\.|(?<middle>\p{Uppercase}\w*))
[\s--\n]+
)?
(?<last>\p{Uppercase}\w*)
\b
";
let re = RegexSetBuilder::new([pat])
.ignore_whitespace(true)
.build()
.unwrap();
assert!(re.is_match("Harry Potter"));
assert!(re.is_match("Harry J. Potter"));
assert!(re.is_match("Harry James Potter"));
assert!(!re.is_match("harry J. Potter"));
sourcepub fn octal(&mut self, yes: bool) -> &mut RegexSetBuilder
pub fn octal(&mut self, yes: bool) -> &mut RegexSetBuilder
This configures octal mode for all of the patterns.
Octal syntax is a little-known way of uttering Unicode codepoints
in a pattern. For example, a
, \x61
, \u0061
and \141
are all
equivalent patterns, where the last example shows octal syntax.
While supporting octal syntax isn’t in and of itself a problem,
it does make good error messages harder. That is, in PCRE based
regex engines, syntax like \1
invokes a backreference, which is
explicitly unsupported this library. However, many users expect
backreferences to be supported. Therefore, when octal support
is disabled, the error message will explicitly mention that
backreferences aren’t supported.
The default for this is false
.
Example
use regex::RegexSetBuilder;
// Normally this pattern would not compile, with an error message
// about backreferences not being supported. But with octal mode
// enabled, octal escape sequences work.
let re = RegexSetBuilder::new([r"\141"])
.octal(true)
.build()
.unwrap();
assert!(re.is_match("a"));
sourcepub fn size_limit(&mut self, bytes: usize) -> &mut RegexSetBuilder
pub fn size_limit(&mut self, bytes: usize) -> &mut RegexSetBuilder
Sets the approximate size limit, in bytes, of the compiled regex.
This roughly corresponds to the number of heap memory, in bytes, occupied by a single regex. If the regex would otherwise approximately exceed this limit, then compiling that regex will fail.
The main utility of a method like this is to avoid compiling
regexes that use an unexpected amount of resources, such as
time and memory. Even if the memory usage of a large regex is
acceptable, its search time may not be. Namely, worst case time
complexity for search is O(m * n)
, where m ~ len(pattern)
and
n ~ len(haystack)
. That is, search time depends, in part, on the
size of the compiled regex. This means that putting a limit on the
size of the regex limits how much a regex can impact search time.
For more information about regex size limits, see the section on untrusted inputs in the top-level crate documentation.
The default for this is some reasonable number that permits most patterns to compile successfully.
Example
use regex::RegexSetBuilder;
// It may surprise you how big some seemingly small patterns can
// be! Since \w is Unicode aware, this generates a regex that can
// match approximately 140,000 distinct codepoints.
assert!(
RegexSetBuilder::new([r"\w"])
.size_limit(45_000)
.build()
.is_err()
);
sourcepub fn dfa_size_limit(&mut self, bytes: usize) -> &mut RegexSetBuilder
pub fn dfa_size_limit(&mut self, bytes: usize) -> &mut RegexSetBuilder
Set the approximate capacity, in bytes, of the cache of transitions used by the lazy DFA.
While the lazy DFA isn’t always used, in tends to be the most commonly use regex engine in default configurations. It tends to adopt the performance profile of a fully build DFA, but without the downside of taking worst case exponential time to build.
The downside is that it needs to keep a cache of transitions and states that are built while running a search, and this cache can fill up. When it fills up, the cache will reset itself. Any previously generated states and transitions will then need to be re-generated. If this happens too many times, then this library will bail out of using the lazy DFA and switch to a different regex engine.
If your regex provokes this particular downside of the lazy DFA,
then it may be beneficial to increase its cache capacity. This will
potentially reduce the frequency of cache resetting (ideally to
0
). While it won’t fix all potential performance problems with
the lazy DFA, increasing the cache capacity does fix some.
There is no easy way to determine, a priori, whether increasing
this cache capacity will help. In general, the larger your regex,
the more cache it’s likely to use. But that isn’t an ironclad rule.
For example, a regex like [01]*1[01]{N}
would normally produce a
fully build DFA that is exponential in size with respect to N
.
The lazy DFA will prevent exponential space blow-up, but it cache
is likely to fill up, even when it’s large and even for smallish
values of N
.
If you aren’t sure whether this helps or not, it is sensible to
set this to some arbitrarily large number in testing, such as
usize::MAX
. Namely, this represents the amount of capacity that
may be used. It’s probably not a good idea to use usize::MAX
in
production though, since it implies there are no controls on heap
memory used by this library during a search. In effect, set it to
whatever you’re willing to allocate for a single regex search.
sourcepub fn nest_limit(&mut self, limit: u32) -> &mut RegexSetBuilder
pub fn nest_limit(&mut self, limit: u32) -> &mut RegexSetBuilder
Set the nesting limit for this parser.
The nesting limit controls how deep the abstract syntax tree is allowed to be. If the AST exceeds the given limit (e.g., with too many nested groups), then an error is returned by the parser.
The purpose of this limit is to act as a heuristic to prevent stack overflow for consumers that do structural induction on an AST using explicit recursion. While this crate never does this (instead using constant stack space and moving the call stack to the heap), other crates may.
This limit is not checked until the entire AST is parsed. Therefore, if callers want to put a limit on the amount of heap space used, then they should impose a limit on the length, in bytes, of the concrete pattern string. In particular, this is viable since this parser implementation will limit itself to heap space proportional to the length of the pattern string. See also the untrusted inputs section in the top-level crate documentation for more information about this.
Note that a nest limit of 0
will return a nest limit error for
most patterns but not all. For example, a nest limit of 0
permits
a
but not ab
, since ab
requires an explicit concatenation,
which results in a nest depth of 1
. In general, a nest limit is
not something that manifests in an obvious way in the concrete
syntax, therefore, it should not be used in a granular way.
Example
use regex::RegexSetBuilder;
assert!(RegexSetBuilder::new([r"a"]).nest_limit(0).build().is_ok());
assert!(RegexSetBuilder::new([r"ab"]).nest_limit(0).build().is_err());
Trait Implementations§
source§impl Clone for RegexSetBuilder
impl Clone for RegexSetBuilder
source§fn clone(&self) -> RegexSetBuilder
fn clone(&self) -> RegexSetBuilder
1.0.0 · source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source
. Read more