Struct regex_automata::nfa::thompson::Config
source · pub struct Config { /* private fields */ }
Expand description
The configuration used for a Thompson NFA compiler.
Implementations§
source§impl Config
impl Config
sourcepub fn utf8(self, yes: bool) -> Config
pub fn utf8(self, yes: bool) -> Config
Whether to enable UTF-8 mode during search or not.
A regex engine is said to be in UTF-8 mode when it guarantees that all matches returned by it have spans consisting of only valid UTF-8. That is, it is impossible for a match span to be returned that contains any invalid UTF-8.
UTF-8 mode generally consists of two things:
- Whether the NFA’s states are constructed such that all paths to a match state that consume at least one byte always correspond to valid UTF-8.
- Whether all paths to a match state that do not consume any bytes should always correspond to valid UTF-8 boundaries.
(1) is a guarantee made by whoever constructs the NFA.
If you’re parsing a regex from its concrete syntax, then
syntax::Config::utf8
can make
this guarantee for you. It does it by returning an error if the regex
pattern could every report a non-empty match span that contains invalid
UTF-8. So long as syntax::Config::utf8
mode is enabled and your regex
successfully parses, then you’re guaranteed that the corresponding NFA
will only ever report non-empty match spans containing valid UTF-8.
(2) is a trickier guarantee because it cannot be enforced by the NFA
state graph itself. Consider, for example, the regex a*
. It matches
the empty strings in ☃
at positions 0
, 1
, 2
and 3
, where
positions 1
and 2
occur within the UTF-8 encoding of a codepoint,
and thus correspond to invalid UTF-8 boundaries. Therefore, this
guarantee must be made at a higher level than the NFA state graph
itself. This crate deals with this case in each regex engine. Namely,
when a zero-width match that splits a codepoint is found and UTF-8
mode enabled, then it is ignored and the engine moves on looking for
the next match.
Thus, UTF-8 mode is both a promise that the NFA built only reports non-empty matches that are valid UTF-8, and an instruction to regex engines that empty matches that split codepoints should be banned.
Because UTF-8 mode is fundamentally about avoiding invalid UTF-8 spans,
it only makes sense to enable this option when you know your haystack
is valid UTF-8. (For example, a &str
.) Enabling UTF-8 mode and
searching a haystack that contains invalid UTF-8 leads to unspecified
behavior.
Therefore, it may make sense to enable syntax::Config::utf8
while
simultaneously disabling this option. That would ensure all non-empty
match spans are valid UTF-8, but that empty match spans may still split
a codepoint or match at other places that aren’t valid UTF-8.
In general, this mode is only relevant if your regex can match the empty string. Most regexes don’t.
This is enabled by default.
§Example
This example shows how UTF-8 mode can impact the match spans that may be reported in certain cases.
use regex_automata::{
nfa::thompson::{self, pikevm::PikeVM},
Match, Input,
};
let re = PikeVM::new("")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
// UTF-8 mode is enabled by default.
let mut input = Input::new("☃");
re.search(&mut cache, &input, &mut caps);
assert_eq!(Some(Match::must(0, 0..0)), caps.get_match());
// Even though an empty regex matches at 1..1, our next match is
// 3..3 because 1..1 and 2..2 split the snowman codepoint (which is
// three bytes long).
input.set_start(1);
re.search(&mut cache, &input, &mut caps);
assert_eq!(Some(Match::must(0, 3..3)), caps.get_match());
// But if we disable UTF-8, then we'll get matches at 1..1 and 2..2:
let re = PikeVM::builder()
.thompson(thompson::Config::new().utf8(false))
.build("")?;
re.search(&mut cache, &input, &mut caps);
assert_eq!(Some(Match::must(0, 1..1)), caps.get_match());
input.set_start(2);
re.search(&mut cache, &input, &mut caps);
assert_eq!(Some(Match::must(0, 2..2)), caps.get_match());
input.set_start(3);
re.search(&mut cache, &input, &mut caps);
assert_eq!(Some(Match::must(0, 3..3)), caps.get_match());
input.set_start(4);
re.search(&mut cache, &input, &mut caps);
assert_eq!(None, caps.get_match());
sourcepub fn reverse(self, yes: bool) -> Config
pub fn reverse(self, yes: bool) -> Config
Reverse the NFA.
A NFA reversal is performed by reversing all of the concatenated sub-expressions in the original pattern, recursively. (Look around operators are also inverted.) The resulting NFA can be used to match the pattern starting from the end of a string instead of the beginning of a string.
Reversing the NFA is useful for building a reverse DFA, which is most useful for finding the start of a match after its ending position has been found. NFA execution engines typically do not work on reverse NFAs. For example, currently, the Pike VM reports the starting location of matches without a reverse NFA.
Currently, enabling this setting requires disabling the
captures
setting. If both are enabled, then the
compiler will return an error. It is expected that this limitation will
be lifted in the future.
This is disabled by default.
§Example
This example shows how to build a DFA from a reverse NFA, and then use the DFA to search backwards.
use regex_automata::{
dfa::{self, Automaton},
nfa::thompson::{NFA, WhichCaptures},
HalfMatch, Input,
};
let dfa = dfa::dense::Builder::new()
.thompson(NFA::config()
.which_captures(WhichCaptures::None)
.reverse(true)
)
.build("baz[0-9]+")?;
let expected = Some(HalfMatch::must(0, 3));
assert_eq!(
expected,
dfa.try_search_rev(&Input::new("foobaz12345bar"))?,
);
sourcepub fn nfa_size_limit(self, bytes: Option<usize>) -> Config
pub fn nfa_size_limit(self, bytes: Option<usize>) -> Config
Sets an approximate size limit on the total heap used by the NFA being compiled.
This permits imposing constraints on the size of a compiled NFA. This may be useful in contexts where the regex pattern is untrusted and one wants to avoid using too much memory.
This size limit does not apply to auxiliary heap used during compilation that is not part of the built NFA.
Note that this size limit is applied during compilation in order for the limit to prevent too much heap from being used. However, the implementation may use an intermediate NFA representation that is otherwise slightly bigger than the final public form. Since the size limit may be applied to an intermediate representation, there is not necessarily a precise correspondence between the configured size limit and the heap usage of the final NFA.
There is no size limit by default.
§Example
This example demonstrates how Unicode mode can greatly increase the size of the NFA.
use regex_automata::nfa::thompson::NFA;
// 300KB isn't enough!
NFA::compiler()
.configure(NFA::config().nfa_size_limit(Some(300_000)))
.build(r"\w{20}")
.unwrap_err();
// ... but 400KB probably is.
let nfa = NFA::compiler()
.configure(NFA::config().nfa_size_limit(Some(400_000)))
.build(r"\w{20}")?;
assert_eq!(nfa.pattern_len(), 1);
sourcepub fn shrink(self, yes: bool) -> Config
pub fn shrink(self, yes: bool) -> Config
Apply best effort heuristics to shrink the NFA at the expense of more time/memory.
Generally speaking, if one is using an NFA to compile a DFA, then the extra time used to shrink the NFA will be more than made up for during DFA construction (potentially by a lot). In other words, enabling this can substantially decrease the overall amount of time it takes to build a DFA.
A reason to keep this disabled is if you want to compile an NFA and start using it as quickly as possible without needing to build a DFA, and you don’t mind using a bit of extra memory for the NFA. e.g., for an NFA simulation or for a lazy DFA.
NFA shrinking is currently most useful when compiling a reverse NFA with large Unicode character classes. In particular, it trades additional CPU time during NFA compilation in favor of generating fewer NFA states.
This is disabled by default because it can increase compile times quite a bit if you aren’t building a full DFA.
§Example
This example shows that NFA shrinking can lead to substantial space savings in some cases. Notice that, as noted above, we build a reverse DFA and use a pattern with a large Unicode character class.
use regex_automata::nfa::thompson::{NFA, WhichCaptures};
// Currently we have to disable captures when enabling reverse NFA.
let config = NFA::config()
.which_captures(WhichCaptures::None)
.reverse(true);
let not_shrunk = NFA::compiler()
.configure(config.clone().shrink(false))
.build(r"\w")?;
let shrunk = NFA::compiler()
.configure(config.clone().shrink(true))
.build(r"\w")?;
// While a specific shrink factor is not guaranteed, the savings can be
// considerable in some cases.
assert!(shrunk.states().len() * 2 < not_shrunk.states().len());
sourcepub fn captures(self, yes: bool) -> Config
👎Deprecated since 0.3.5: use which_captures instead
pub fn captures(self, yes: bool) -> Config
Whether to include ‘Capture’ states in the NFA.
Currently, enabling this setting requires disabling the
reverse
setting. If both are enabled, then the
compiler will return an error. It is expected that this limitation will
be lifted in the future.
This is enabled by default.
§Example
This example demonstrates that some regex engines, like the Pike VM, require capturing states to be present in the NFA to report match offsets.
(Note that since this method is deprecated, the example below uses
Config::which_captures
to disable capture states.)
use regex_automata::nfa::thompson::{
pikevm::PikeVM,
NFA,
WhichCaptures,
};
let re = PikeVM::builder()
.thompson(NFA::config().which_captures(WhichCaptures::None))
.build(r"[a-z]+")?;
let mut cache = re.create_cache();
assert!(re.is_match(&mut cache, "abc"));
assert_eq!(None, re.find(&mut cache, "abc"));
sourcepub fn which_captures(self, which_captures: WhichCaptures) -> Config
pub fn which_captures(self, which_captures: WhichCaptures) -> Config
Configures what kinds of capture groups are compiled into
State::Capture
states in a
Thompson NFA.
Currently, using any option except for WhichCaptures::None
requires
disabling the reverse
setting. If both are
enabled, then the compiler will return an error. It is expected that
this limitation will be lifted in the future.
This is set to WhichCaptures::All
by default. Callers may wish to
use WhichCaptures::Implicit
in cases where one wants avoid the
overhead of capture states for explicit groups. Usually this occurs
when one wants to use the PikeVM
only for determining the overall
match. Otherwise, the PikeVM
could use much more memory than is
necessary.
§Example
This example demonstrates that some regex engines, like the Pike VM, require capturing states to be present in the NFA to report match offsets.
use regex_automata::nfa::thompson::{
pikevm::PikeVM,
NFA,
WhichCaptures,
};
let re = PikeVM::builder()
.thompson(NFA::config().which_captures(WhichCaptures::None))
.build(r"[a-z]+")?;
let mut cache = re.create_cache();
assert!(re.is_match(&mut cache, "abc"));
assert_eq!(None, re.find(&mut cache, "abc"));
The same applies to the bounded backtracker:
use regex_automata::nfa::thompson::{
backtrack::BoundedBacktracker,
NFA,
WhichCaptures,
};
let re = BoundedBacktracker::builder()
.thompson(NFA::config().which_captures(WhichCaptures::None))
.build(r"[a-z]+")?;
let mut cache = re.create_cache();
assert!(re.try_is_match(&mut cache, "abc")?);
assert_eq!(None, re.try_find(&mut cache, "abc")?);
sourcepub fn look_matcher(self, m: LookMatcher) -> Config
pub fn look_matcher(self, m: LookMatcher) -> Config
Sets the look-around matcher that should be used with this NFA.
A look-around matcher determines how to match look-around assertions.
In particular, some assertions are configurable. For example, the
(?m:^)
and (?m:$)
assertions can have their line terminator changed
from the default of \n
to any other byte.
§Example
This shows how to change the line terminator for multi-line assertions.
use regex_automata::{
nfa::thompson::{self, pikevm::PikeVM},
util::look::LookMatcher,
Match, Input,
};
let mut lookm = LookMatcher::new();
lookm.set_line_terminator(b'\x00');
let re = PikeVM::builder()
.thompson(thompson::Config::new().look_matcher(lookm))
.build(r"(?m)^[a-z]+$")?;
let mut cache = re.create_cache();
// Multi-line assertions now use NUL as a terminator.
assert_eq!(
Some(Match::must(0, 1..4)),
re.find(&mut cache, b"\x00abc\x00"),
);
// ... and \n is no longer recognized as a terminator.
assert_eq!(
None,
re.find(&mut cache, b"\nabc\n"),
);
sourcepub fn get_reverse(&self) -> bool
pub fn get_reverse(&self) -> bool
Returns whether this configuration has enabled reverse NFA compilation.
sourcepub fn get_nfa_size_limit(&self) -> Option<usize>
pub fn get_nfa_size_limit(&self) -> Option<usize>
Return the configured NFA size limit, if it exists, in the number of bytes of heap used.
sourcepub fn get_shrink(&self) -> bool
pub fn get_shrink(&self) -> bool
Return whether NFA shrinking is enabled.
sourcepub fn get_captures(&self) -> bool
👎Deprecated since 0.3.5: use get_which_captures instead
pub fn get_captures(&self) -> bool
Return whether NFA compilation is configured to produce capture states.
sourcepub fn get_which_captures(&self) -> WhichCaptures
pub fn get_which_captures(&self) -> WhichCaptures
Return what kinds of capture states will be compiled into an NFA.
sourcepub fn get_look_matcher(&self) -> LookMatcher
pub fn get_look_matcher(&self) -> LookMatcher
Return the look-around matcher for this NFA.