Struct regex_automata::meta::Config
source · pub struct Config { /* private fields */ }
Expand description
An object describing the configuration of a Regex
.
This configuration only includes options for the
non-syntax behavior of a Regex
, and can be applied via the
Builder::configure
method. For configuring the syntax options, see
util::syntax::Config
.
§Example: lower the NFA size limit
In some cases, the default size limit might be too big. The size limit can be lowered, which will prevent large regex patterns from compiling.
use regex_automata::meta::Regex;
let result = Regex::builder()
.configure(Regex::config().nfa_size_limit(Some(20 * (1<<10))))
// Not even 20KB is enough to build a single large Unicode class!
.build(r"\pL");
assert!(result.is_err());
Implementations§
source§impl Config
impl Config
sourcepub fn match_kind(self, kind: MatchKind) -> Config
pub fn match_kind(self, kind: MatchKind) -> Config
Set the match semantics for a Regex
.
The default value is MatchKind::LeftmostFirst
.
§Example
use regex_automata::{meta::Regex, Match, MatchKind};
// By default, leftmost-first semantics are used, which
// disambiguates matches at the same position by selecting
// the one that corresponds earlier in the pattern.
let re = Regex::new("sam|samwise")?;
assert_eq!(Some(Match::must(0, 0..3)), re.find("samwise"));
// But with 'all' semantics, match priority is ignored
// and all match states are included. When coupled with
// a leftmost search, the search will report the last
// possible match.
let re = Regex::builder()
.configure(Regex::config().match_kind(MatchKind::All))
.build("sam|samwise")?;
assert_eq!(Some(Match::must(0, 0..7)), re.find("samwise"));
// Beware that this can lead to skipping matches!
// Usually 'all' is used for anchored reverse searches
// only, or for overlapping searches.
assert_eq!(Some(Match::must(0, 4..11)), re.find("sam samwise"));
sourcepub fn utf8_empty(self, yes: bool) -> Config
pub fn utf8_empty(self, yes: bool) -> Config
Toggles whether empty matches are permitted to occur between the code units of a UTF-8 encoded codepoint.
This should generally be enabled when search a &str
or anything that
you otherwise know is valid UTF-8. It should be disabled in all other
cases. Namely, if the haystack is not valid UTF-8 and this is enabled,
then behavior is unspecified.
By default, this is enabled.
§Example
use regex_automata::{meta::Regex, Match};
let re = Regex::new("")?;
let got: Vec<Match> = re.find_iter("☃").collect();
// Matches only occur at the beginning and end of the snowman.
assert_eq!(got, vec![
Match::must(0, 0..0),
Match::must(0, 3..3),
]);
let re = Regex::builder()
.configure(Regex::config().utf8_empty(false))
.build("")?;
let got: Vec<Match> = re.find_iter("☃").collect();
// Matches now occur at every position!
assert_eq!(got, vec![
Match::must(0, 0..0),
Match::must(0, 1..1),
Match::must(0, 2..2),
Match::must(0, 3..3),
]);
Ok::<(), Box<dyn std::error::Error>>(())
sourcepub fn auto_prefilter(self, yes: bool) -> Config
pub fn auto_prefilter(self, yes: bool) -> Config
Toggles whether automatic prefilter support is enabled.
If this is disabled and Config::prefilter
is not set, then the
meta regex engine will not use any prefilters. This can sometimes
be beneficial in cases where you know (or have measured) that the
prefilter leads to overall worse search performance.
By default, this is enabled.
§Example
use regex_automata::{meta::Regex, Match};
let re = Regex::builder()
.configure(Regex::config().auto_prefilter(false))
.build(r"Bruce \w+")?;
let hay = "Hello Bruce Springsteen!";
assert_eq!(Some(Match::must(0, 6..23)), re.find(hay));
Ok::<(), Box<dyn std::error::Error>>(())
sourcepub fn prefilter(self, pre: Option<Prefilter>) -> Config
pub fn prefilter(self, pre: Option<Prefilter>) -> Config
Overrides and sets the prefilter to use inside a Regex
.
This permits one to forcefully set a prefilter in cases where the caller knows better than whatever the automatic prefilter logic is capable of.
By default, this is set to None
and an automatic prefilter will be
used if one could be built. (Assuming Config::auto_prefilter
is
enabled, which it is by default.)
§Example
This example shows how to set your own prefilter. In the case of a
pattern like Bruce \w+
, the automatic prefilter is likely to be
constructed in a way that it will look for occurrences of Bruce
.
In most cases, this is the best choice. But in some cases, it may be
the case that running memchr
on B
is the best choice. One can
achieve that behavior by overriding the automatic prefilter logic
and providing a prefilter that just matches B
.
use regex_automata::{
meta::Regex,
util::prefilter::Prefilter,
Match, MatchKind,
};
let pre = Prefilter::new(MatchKind::LeftmostFirst, &["B"])
.expect("a prefilter");
let re = Regex::builder()
.configure(Regex::config().prefilter(Some(pre)))
.build(r"Bruce \w+")?;
let hay = "Hello Bruce Springsteen!";
assert_eq!(Some(Match::must(0, 6..23)), re.find(hay));
§Example: incorrect prefilters can lead to incorrect results!
Be warned that setting an incorrect prefilter can lead to missed matches. So if you use this option, ensure your prefilter can never report false negatives. (A false positive is, on the other hand, quite okay and generally unavoidable.)
use regex_automata::{
meta::Regex,
util::prefilter::Prefilter,
Match, MatchKind,
};
let pre = Prefilter::new(MatchKind::LeftmostFirst, &["Z"])
.expect("a prefilter");
let re = Regex::builder()
.configure(Regex::config().prefilter(Some(pre)))
.build(r"Bruce \w+")?;
let hay = "Hello Bruce Springsteen!";
// Oops! No match found, but there should be one!
assert_eq!(None, re.find(hay));
sourcepub fn which_captures(self, which_captures: WhichCaptures) -> Config
pub fn which_captures(self, which_captures: WhichCaptures) -> Config
Configures what kinds of groups are compiled as “capturing” in the underlying regex engine.
This is set to WhichCaptures::All
by default. Callers may wish to
use WhichCaptures::Implicit
in cases where one wants avoid the
overhead of capture states for explicit groups.
Note that another approach to avoiding the overhead of capture groups
is by using non-capturing groups in the regex pattern. That is,
(?:a)
instead of (a)
. This option is useful when you can’t control
the concrete syntax but know that you don’t need the underlying capture
states. For example, using WhichCaptures::Implicit
will behave as if
all explicit capturing groups in the pattern were non-capturing.
Setting this to WhichCaptures::None
is usually not the right thing to
do. When no capture states are compiled, some regex engines (such as
the PikeVM
) won’t be able to report match offsets. This will manifest
as no match being found.
§Example
This example demonstrates how the results of capture groups can change based on this option. First we show the default (all capture groups in the pattern are capturing):
use regex_automata::{meta::Regex, Match, Span};
let re = Regex::new(r"foo([0-9]+)bar")?;
let hay = "foo123bar";
let mut caps = re.create_captures();
re.captures(hay, &mut caps);
assert_eq!(Some(Span::from(0..9)), caps.get_group(0));
assert_eq!(Some(Span::from(3..6)), caps.get_group(1));
Ok::<(), Box<dyn std::error::Error>>(())
And now we show the behavior when we only include implicit capture
groups. In this case, we can only find the overall match span, but the
spans of any other explicit group don’t exist because they are treated
as non-capturing. (In effect, when WhichCaptures::Implicit
is used,
there is no real point in using Regex::captures
since it will never
be able to report more information than Regex::find
.)
use regex_automata::{
meta::Regex,
nfa::thompson::WhichCaptures,
Match,
Span,
};
let re = Regex::builder()
.configure(Regex::config().which_captures(WhichCaptures::Implicit))
.build(r"foo([0-9]+)bar")?;
let hay = "foo123bar";
let mut caps = re.create_captures();
re.captures(hay, &mut caps);
assert_eq!(Some(Span::from(0..9)), caps.get_group(0));
assert_eq!(None, caps.get_group(1));
Ok::<(), Box<dyn std::error::Error>>(())
sourcepub fn nfa_size_limit(self, limit: Option<usize>) -> Config
pub fn nfa_size_limit(self, limit: Option<usize>) -> Config
Sets the size limit, in bytes, to enforce on the construction of every NFA build by the meta regex engine.
Setting it to None
disables the limit. This is not recommended if
you’re compiling untrusted patterns.
Note that this limit is applied to each NFA built, and if any of them exceed the limit, then construction will fail. This limit does not correspond to the total memory used by all NFAs in the meta regex engine.
This defaults to some reasonable number that permits most reasonable patterns.
§Example
use regex_automata::meta::Regex;
let result = Regex::builder()
.configure(Regex::config().nfa_size_limit(Some(20 * (1<<10))))
// Not even 20KB is enough to build a single large Unicode class!
.build(r"\pL");
assert!(result.is_err());
// But notice that building such a regex with the exact same limit
// can succeed depending on other aspects of the configuration. For
// example, a single *forward* NFA will (at time of writing) fit into
// the 20KB limit, but a *reverse* NFA of the same pattern will not.
// So if one configures a meta regex such that a reverse NFA is never
// needed and thus never built, then the 20KB limit will be enough for
// a pattern like \pL!
let result = Regex::builder()
.configure(Regex::config()
.nfa_size_limit(Some(20 * (1<<10)))
// The DFAs are the only thing that (currently) need a reverse
// NFA. So if both are disabled, the meta regex engine will
// skip building the reverse NFA. Note that this isn't an API
// guarantee. A future semver compatible version may introduce
// new use cases for a reverse NFA.
.hybrid(false)
.dfa(false)
)
// Not even 20KB is enough to build a single large Unicode class!
.build(r"\pL");
assert!(result.is_ok());
sourcepub fn onepass_size_limit(self, limit: Option<usize>) -> Config
pub fn onepass_size_limit(self, limit: Option<usize>) -> Config
Sets the size limit, in bytes, for the one-pass DFA.
Setting it to None
disables the limit. Disabling the limit is
strongly discouraged when compiling untrusted patterns. Even if the
patterns are trusted, it still may not be a good idea, since a one-pass
DFA can use a lot of memory. With that said, as the size of a regex
increases, the likelihood of it being one-pass likely decreases.
This defaults to some reasonable number that permits most reasonable one-pass patterns.
§Example
This shows how to set the one-pass DFA size limit. Note that since
a one-pass DFA is an optional component of the meta regex engine,
this size limit only impacts what is built internally and will never
determine whether a Regex
itself fails to build.
use regex_automata::meta::Regex;
let result = Regex::builder()
.configure(Regex::config().onepass_size_limit(Some(2 * (1<<20))))
.build(r"\pL{5}");
assert!(result.is_ok());
sourcepub fn hybrid_cache_capacity(self, limit: usize) -> Config
pub fn hybrid_cache_capacity(self, limit: usize) -> Config
Set the cache capacity, in bytes, for the lazy DFA.
The cache capacity of the lazy DFA determines approximately how much heap memory it is allowed to use to store its state transitions. The state transitions are computed at search time, and if the cache fills up it, it is cleared. At this point, any previously generated state transitions are lost and are re-generated if they’re needed again.
This sort of cache filling and clearing works quite well so long as cache clearing happens infrequently. If it happens too often, then the meta regex engine will stop using the lazy DFA and switch over to a different regex engine.
In cases where the cache is cleared too often, it may be possible to give the cache more space and reduce (or eliminate) how often it is cleared. Similarly, sometimes a regex is so big that the lazy DFA isn’t used at all if its cache capacity isn’t big enough.
The capacity set here is a limit on how much memory is used. The actual memory used is only allocated as it’s needed.
Determining the right value for this is a little tricky and will likely
required some profiling. Enabling the logging
feature and setting the
log level to trace
will also tell you how often the cache is being
cleared.
§Example
use regex_automata::meta::Regex;
let result = Regex::builder()
.configure(Regex::config().hybrid_cache_capacity(20 * (1<<20)))
.build(r"\pL{5}");
assert!(result.is_ok());
sourcepub fn dfa_size_limit(self, limit: Option<usize>) -> Config
pub fn dfa_size_limit(self, limit: Option<usize>) -> Config
Sets the size limit, in bytes, for heap memory used for a fully compiled DFA.
NOTE: If you increase this, you’ll likely also need to increase
Config::dfa_state_limit
.
In contrast to the lazy DFA, building a full DFA requires computing
all of its state transitions up front. This can be a very expensive
process, and runs in worst case 2^n
time and space (where n
is
proportional to the size of the regex). However, a full DFA unlocks
some additional optimization opportunities.
Because full DFAs can be so expensive, the default limits for them are
incredibly small. Generally speaking, if your regex is moderately big
or if you’re using Unicode features (\w
is Unicode-aware by default
for example), then you can expect that the meta regex engine won’t even
attempt to build a DFA for it.
If this and Config::dfa_state_limit
are set to None
, then the
meta regex will not use any sort of limits when deciding whether to
build a DFA. This in turn makes construction of a Regex
take
worst case exponential time and space. Even short patterns can result
in huge space blow ups. So it is strongly recommended to keep some kind
of limit set!
The default is set to a small number that permits some simple regexes to get compiled into DFAs in reasonable time.
§Example
use regex_automata::meta::Regex;
let result = Regex::builder()
// 100MB is much bigger than the default.
.configure(Regex::config()
.dfa_size_limit(Some(100 * (1<<20)))
// We don't care about size too much here, so just
// remove the NFA state limit altogether.
.dfa_state_limit(None))
.build(r"\pL{5}");
assert!(result.is_ok());
sourcepub fn dfa_state_limit(self, limit: Option<usize>) -> Config
pub fn dfa_state_limit(self, limit: Option<usize>) -> Config
Sets a limit on the total number of NFA states, beyond which, a full DFA is not attempted to be compiled.
This limit works in concert with Config::dfa_size_limit
. Namely,
where as Config::dfa_size_limit
is applied by attempting to construct
a DFA, this limit is used to avoid the attempt in the first place. This
is useful to avoid hefty initialization costs associated with building
a DFA for cases where it is obvious the DFA will ultimately be too big.
By default, this is set to a very small number.
§Example
use regex_automata::meta::Regex;
let result = Regex::builder()
.configure(Regex::config()
// Sometimes the default state limit rejects DFAs even
// if they would fit in the size limit. Here, we disable
// the check on the number of NFA states and just rely on
// the size limit.
.dfa_state_limit(None))
.build(r"(?-u)\w{30}");
assert!(result.is_ok());
sourcepub fn byte_classes(self, yes: bool) -> Config
pub fn byte_classes(self, yes: bool) -> Config
Whether to attempt to shrink the size of the alphabet for the regex pattern or not. When enabled, the alphabet is shrunk into a set of equivalence classes, where every byte in the same equivalence class cannot discriminate between a match or non-match.
WARNING: This is only useful for debugging DFAs. Disabling this does not yield any speed advantages. Indeed, disabling it can result in much higher memory usage. Disabling byte classes is useful for debugging the actual generated transitions because it lets one see the transitions defined on actual bytes instead of the equivalence classes.
This option is enabled by default and should never be disabled unless one is debugging the meta regex engine’s internals.
§Example
use regex_automata::{meta::Regex, Match};
let re = Regex::builder()
.configure(Regex::config().byte_classes(false))
.build(r"[a-z]+")?;
let hay = "!!quux!!";
assert_eq!(Some(Match::must(0, 2..6)), re.find(hay));
sourcepub fn line_terminator(self, byte: u8) -> Config
pub fn line_terminator(self, byte: u8) -> Config
Set the line terminator to be used by the ^
and $
anchors in
multi-line mode.
This option has no effect when CRLF mode is enabled. That is,
regardless of this setting, (?Rm:^)
and (?Rm:$)
will always treat
\r
and \n
as line terminators (and will never match between a \r
and a \n
).
By default, \n
is the line terminator.
Warning: This does not change the behavior of .
. To do that,
you’ll need to configure the syntax option
syntax::Config::line_terminator
in addition to this. Otherwise, .
will continue to match any
character other than \n
.
§Example
use regex_automata::{meta::Regex, util::syntax, Match};
let re = Regex::builder()
.syntax(syntax::Config::new().multi_line(true))
.configure(Regex::config().line_terminator(b'\x00'))
.build(r"^foo$")?;
let hay = "\x00foo\x00";
assert_eq!(Some(Match::must(0, 1..4)), re.find(hay));
sourcepub fn hybrid(self, yes: bool) -> Config
pub fn hybrid(self, yes: bool) -> Config
Toggle whether the hybrid NFA/DFA (also known as the “lazy DFA”) should be available for use by the meta regex engine.
Enabling this does not necessarily mean that the lazy DFA will definitely be used. It just means that it will be available for use if the meta regex engine thinks it will be useful.
When the hybrid
crate feature is enabled, then this is enabled by
default. Otherwise, if the crate feature is disabled, then this is
always disabled, regardless of its setting by the caller.
sourcepub fn dfa(self, yes: bool) -> Config
pub fn dfa(self, yes: bool) -> Config
Toggle whether a fully compiled DFA should be available for use by the meta regex engine.
Enabling this does not necessarily mean that a DFA will definitely be used. It just means that it will be available for use if the meta regex engine thinks it will be useful.
When the dfa-build
crate feature is enabled, then this is enabled by
default. Otherwise, if the crate feature is disabled, then this is
always disabled, regardless of its setting by the caller.
sourcepub fn onepass(self, yes: bool) -> Config
pub fn onepass(self, yes: bool) -> Config
Toggle whether a one-pass DFA should be available for use by the meta regex engine.
Enabling this does not necessarily mean that a one-pass DFA will
definitely be used. It just means that it will be available for
use if the meta regex engine thinks it will be useful. (Indeed, a
one-pass DFA can only be used when the regex is one-pass. See the
dfa::onepass
module for more details.)
When the dfa-onepass
crate feature is enabled, then this is enabled
by default. Otherwise, if the crate feature is disabled, then this is
always disabled, regardless of its setting by the caller.
sourcepub fn backtrack(self, yes: bool) -> Config
pub fn backtrack(self, yes: bool) -> Config
Toggle whether a bounded backtracking regex engine should be available for use by the meta regex engine.
Enabling this does not necessarily mean that a bounded backtracker will definitely be used. It just means that it will be available for use if the meta regex engine thinks it will be useful.
When the nfa-backtrack
crate feature is enabled, then this is enabled
by default. Otherwise, if the crate feature is disabled, then this is
always disabled, regardless of its setting by the caller.
sourcepub fn get_match_kind(&self) -> MatchKind
pub fn get_match_kind(&self) -> MatchKind
Returns the match kind on this configuration, as set by
Config::match_kind
.
If it was not explicitly set, then a default value is returned.
sourcepub fn get_utf8_empty(&self) -> bool
pub fn get_utf8_empty(&self) -> bool
Returns whether empty matches must fall on valid UTF-8 boundaries, as
set by Config::utf8_empty
.
If it was not explicitly set, then a default value is returned.
sourcepub fn get_auto_prefilter(&self) -> bool
pub fn get_auto_prefilter(&self) -> bool
Returns whether automatic prefilters are enabled, as set by
Config::auto_prefilter
.
If it was not explicitly set, then a default value is returned.
sourcepub fn get_prefilter(&self) -> Option<&Prefilter>
pub fn get_prefilter(&self) -> Option<&Prefilter>
Returns a manually set prefilter, if one was set by
Config::prefilter
.
If it was not explicitly set, then a default value is returned.
sourcepub fn get_which_captures(&self) -> WhichCaptures
pub fn get_which_captures(&self) -> WhichCaptures
Returns the capture configuration, as set by
Config::which_captures
.
If it was not explicitly set, then a default value is returned.
sourcepub fn get_nfa_size_limit(&self) -> Option<usize>
pub fn get_nfa_size_limit(&self) -> Option<usize>
Returns NFA size limit, as set by Config::nfa_size_limit
.
If it was not explicitly set, then a default value is returned.
sourcepub fn get_onepass_size_limit(&self) -> Option<usize>
pub fn get_onepass_size_limit(&self) -> Option<usize>
Returns one-pass DFA size limit, as set by
Config::onepass_size_limit
.
If it was not explicitly set, then a default value is returned.
sourcepub fn get_hybrid_cache_capacity(&self) -> usize
pub fn get_hybrid_cache_capacity(&self) -> usize
Returns hybrid NFA/DFA cache capacity, as set by
Config::hybrid_cache_capacity
.
If it was not explicitly set, then a default value is returned.
sourcepub fn get_dfa_size_limit(&self) -> Option<usize>
pub fn get_dfa_size_limit(&self) -> Option<usize>
Returns DFA size limit, as set by Config::dfa_size_limit
.
If it was not explicitly set, then a default value is returned.
sourcepub fn get_dfa_state_limit(&self) -> Option<usize>
pub fn get_dfa_state_limit(&self) -> Option<usize>
Returns DFA size limit in terms of the number of states in the NFA, as
set by Config::dfa_state_limit
.
If it was not explicitly set, then a default value is returned.
sourcepub fn get_byte_classes(&self) -> bool
pub fn get_byte_classes(&self) -> bool
Returns whether byte classes are enabled, as set by
Config::byte_classes
.
If it was not explicitly set, then a default value is returned.
sourcepub fn get_line_terminator(&self) -> u8
pub fn get_line_terminator(&self) -> u8
Returns the line terminator for this configuration, as set by
Config::line_terminator
.
If it was not explicitly set, then a default value is returned.
sourcepub fn get_hybrid(&self) -> bool
pub fn get_hybrid(&self) -> bool
Returns whether the hybrid NFA/DFA regex engine may be used, as set by
Config::hybrid
.
If it was not explicitly set, then a default value is returned.
sourcepub fn get_dfa(&self) -> bool
pub fn get_dfa(&self) -> bool
Returns whether the DFA regex engine may be used, as set by
Config::dfa
.
If it was not explicitly set, then a default value is returned.
sourcepub fn get_onepass(&self) -> bool
pub fn get_onepass(&self) -> bool
Returns whether the one-pass DFA regex engine may be used, as set by
Config::onepass
.
If it was not explicitly set, then a default value is returned.
sourcepub fn get_backtrack(&self) -> bool
pub fn get_backtrack(&self) -> bool
Returns whether the bounded backtracking regex engine may be used, as
set by Config::backtrack
.
If it was not explicitly set, then a default value is returned.