Struct regex_automata::nfa::thompson::Config

source ·

pub struct Config { /* private fields */ }

Expand description

The configuration used for a Thompson NFA compiler.

Implementations§

source §

impl Config

source

pub fn new() -> Config

Return a new default Thompson NFA compiler configuration.

source

pub fn utf8(self, yes: bool) -> Config

Whether to enable UTF-8 mode during search or not.

A regex engine is said to be in UTF-8 mode when it guarantees that all matches returned by it have spans consisting of only valid UTF-8. That is, it is impossible for a match span to be returned that contains any invalid UTF-8.

UTF-8 mode generally consists of two things:

Whether the NFA’s states are constructed such that all paths to a match state that consume at least one byte always correspond to valid UTF-8.
Whether all paths to a match state that do not consume any bytes should always correspond to valid UTF-8 boundaries.

(1) is a guarantee made by whoever constructs the NFA. If you’re parsing a regex from its concrete syntax, then syntax::Config::utf8 can make this guarantee for you. It does it by returning an error if the regex pattern could every report a non-empty match span that contains invalid UTF-8. So long as syntax::Config::utf8 mode is enabled and your regex successfully parses, then you’re guaranteed that the corresponding NFA will only ever report non-empty match spans containing valid UTF-8.

(2) is a trickier guarantee because it cannot be enforced by the NFA state graph itself. Consider, for example, the regex a*. It matches the empty strings in ☃ at positions 0, 1, 2 and 3, where positions 1 and 2 occur within the UTF-8 encoding of a codepoint, and thus correspond to invalid UTF-8 boundaries. Therefore, this guarantee must be made at a higher level than the NFA state graph itself. This crate deals with this case in each regex engine. Namely, when a zero-width match that splits a codepoint is found and UTF-8 mode enabled, then it is ignored and the engine moves on looking for the next match.

Thus, UTF-8 mode is both a promise that the NFA built only reports non-empty matches that are valid UTF-8, and an instruction to regex engines that empty matches that split codepoints should be banned.

Because UTF-8 mode is fundamentally about avoiding invalid UTF-8 spans, it only makes sense to enable this option when you know your haystack is valid UTF-8. (For example, a &str.) Enabling UTF-8 mode and searching a haystack that contains invalid UTF-8 leads to unspecified behavior.

Therefore, it may make sense to enable syntax::Config::utf8 while simultaneously disabling this option. That would ensure all non-empty match spans are valid UTF-8, but that empty match spans may still split a codepoint or match at other places that aren’t valid UTF-8.

In general, this mode is only relevant if your regex can match the empty string. Most regexes don’t.

This is enabled by default.

§Example

This example shows how UTF-8 mode can impact the match spans that may be reported in certain cases.

use regex_automata::{
    nfa::thompson::{self, pikevm::PikeVM},
    Match, Input,
};

let re = PikeVM::new("")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());

// UTF-8 mode is enabled by default.
let mut input = Input::new("☃");
re.search(&mut cache, &input, &mut caps);
assert_eq!(Some(Match::must(0, 0..0)), caps.get_match());

// Even though an empty regex matches at 1..1, our next match is
// 3..3 because 1..1 and 2..2 split the snowman codepoint (which is
// three bytes long).
input.set_start(1);
re.search(&mut cache, &input, &mut caps);
assert_eq!(Some(Match::must(0, 3..3)), caps.get_match());

// But if we disable UTF-8, then we'll get matches at 1..1 and 2..2:
let re = PikeVM::builder()
    .thompson(thompson::Config::new().utf8(false))
    .build("")?;
re.search(&mut cache, &input, &mut caps);
assert_eq!(Some(Match::must(0, 1..1)), caps.get_match());

input.set_start(2);
re.search(&mut cache, &input, &mut caps);
assert_eq!(Some(Match::must(0, 2..2)), caps.get_match());

input.set_start(3);
re.search(&mut cache, &input, &mut caps);
assert_eq!(Some(Match::must(0, 3..3)), caps.get_match());

input.set_start(4);
re.search(&mut cache, &input, &mut caps);
assert_eq!(None, caps.get_match());

source

pub fn reverse(self, yes: bool) -> Config

Reverse the NFA.

A NFA reversal is performed by reversing all of the concatenated sub-expressions in the original pattern, recursively. (Look around operators are also inverted.) The resulting NFA can be used to match the pattern starting from the end of a string instead of the beginning of a string.

Reversing the NFA is useful for building a reverse DFA, which is most useful for finding the start of a match after its ending position has been found. NFA execution engines typically do not work on reverse NFAs. For example, currently, the Pike VM reports the starting location of matches without a reverse NFA.

Currently, enabling this setting requires disabling the captures setting. If both are enabled, then the compiler will return an error. It is expected that this limitation will be lifted in the future.

This is disabled by default.

§Example

This example shows how to build a DFA from a reverse NFA, and then use the DFA to search backwards.

use regex_automata::{
    dfa::{self, Automaton},
    nfa::thompson::{NFA, WhichCaptures},
    HalfMatch, Input,
};

let dfa = dfa::dense::Builder::new()
    .thompson(NFA::config()
        .which_captures(WhichCaptures::None)
        .reverse(true)
    )
    .build("baz[0-9]+")?;
let expected = Some(HalfMatch::must(0, 3));
assert_eq!(
    expected,
    dfa.try_search_rev(&Input::new("foobaz12345bar"))?,
);

source

pub fn nfa_size_limit(self, bytes: Option<usize>) -> Config

Sets an approximate size limit on the total heap used by the NFA being compiled.

This permits imposing constraints on the size of a compiled NFA. This may be useful in contexts where the regex pattern is untrusted and one wants to avoid using too much memory.

This size limit does not apply to auxiliary heap used during compilation that is not part of the built NFA.

Note that this size limit is applied during compilation in order for the limit to prevent too much heap from being used. However, the implementation may use an intermediate NFA representation that is otherwise slightly bigger than the final public form. Since the size limit may be applied to an intermediate representation, there is not necessarily a precise correspondence between the configured size limit and the heap usage of the final NFA.

There is no size limit by default.

§Example

This example demonstrates how Unicode mode can greatly increase the size of the NFA.

use regex_automata::nfa::thompson::NFA;

// 300KB isn't enough!
NFA::compiler()
    .configure(NFA::config().nfa_size_limit(Some(300_000)))
    .build(r"\w{20}")
    .unwrap_err();

// ... but 400KB probably is.
let nfa = NFA::compiler()
    .configure(NFA::config().nfa_size_limit(Some(400_000)))
    .build(r"\w{20}")?;

assert_eq!(nfa.pattern_len(), 1);

source

pub fn shrink(self, yes: bool) -> Config

Apply best effort heuristics to shrink the NFA at the expense of more time/memory.

Generally speaking, if one is using an NFA to compile a DFA, then the extra time used to shrink the NFA will be more than made up for during DFA construction (potentially by a lot). In other words, enabling this can substantially decrease the overall amount of time it takes to build a DFA.

A reason to keep this disabled is if you want to compile an NFA and start using it as quickly as possible without needing to build a DFA, and you don’t mind using a bit of extra memory for the NFA. e.g., for an NFA simulation or for a lazy DFA.

NFA shrinking is currently most useful when compiling a reverse NFA with large Unicode character classes. In particular, it trades additional CPU time during NFA compilation in favor of generating fewer NFA states.

This is disabled by default because it can increase compile times quite a bit if you aren’t building a full DFA.

§Example

This example shows that NFA shrinking can lead to substantial space savings in some cases. Notice that, as noted above, we build a reverse DFA and use a pattern with a large Unicode character class.

use regex_automata::nfa::thompson::{NFA, WhichCaptures};

// Currently we have to disable captures when enabling reverse NFA.
let config = NFA::config()
    .which_captures(WhichCaptures::None)
    .reverse(true);
let not_shrunk = NFA::compiler()
    .configure(config.clone().shrink(false))
    .build(r"\w")?;
let shrunk = NFA::compiler()
    .configure(config.clone().shrink(true))
    .build(r"\w")?;

// While a specific shrink factor is not guaranteed, the savings can be
// considerable in some cases.
assert!(shrunk.states().len() * 2 < not_shrunk.states().len());

source

pub fn captures(self, yes: bool) -> Config

👎Deprecated since 0.3.5: use which_captures instead

Whether to include ‘Capture’ states in the NFA.

Currently, enabling this setting requires disabling the reverse setting. If both are enabled, then the compiler will return an error. It is expected that this limitation will be lifted in the future.

This is enabled by default.

§Example

This example demonstrates that some regex engines, like the Pike VM, require capturing states to be present in the NFA to report match offsets.

(Note that since this method is deprecated, the example below uses Config::which_captures to disable capture states.)

use regex_automata::nfa::thompson::{
    pikevm::PikeVM,
    NFA,
    WhichCaptures,
};

let re = PikeVM::builder()
    .thompson(NFA::config().which_captures(WhichCaptures::None))
    .build(r"[a-z]+")?;
let mut cache = re.create_cache();

assert!(re.is_match(&mut cache, "abc"));
assert_eq!(None, re.find(&mut cache, "abc"));

source

pub fn which_captures(self, which_captures: WhichCaptures) -> Config

Configures what kinds of capture groups are compiled into State::Capture states in a Thompson NFA.

Currently, using any option except for WhichCaptures::None requires disabling the reverse setting. If both are enabled, then the compiler will return an error. It is expected that this limitation will be lifted in the future.

This is set to WhichCaptures::All by default. Callers may wish to use WhichCaptures::Implicit in cases where one wants avoid the overhead of capture states for explicit groups. Usually this occurs when one wants to use the PikeVM only for determining the overall match. Otherwise, the PikeVM could use much more memory than is necessary.

§Example

This example demonstrates that some regex engines, like the Pike VM, require capturing states to be present in the NFA to report match offsets.

use regex_automata::nfa::thompson::{
    pikevm::PikeVM,
    NFA,
    WhichCaptures,
};

let re = PikeVM::builder()
    .thompson(NFA::config().which_captures(WhichCaptures::None))
    .build(r"[a-z]+")?;
let mut cache = re.create_cache();

assert!(re.is_match(&mut cache, "abc"));
assert_eq!(None, re.find(&mut cache, "abc"));

The same applies to the bounded backtracker:

use regex_automata::nfa::thompson::{
    backtrack::BoundedBacktracker,
    NFA,
    WhichCaptures,
};

let re = BoundedBacktracker::builder()
    .thompson(NFA::config().which_captures(WhichCaptures::None))
    .build(r"[a-z]+")?;
let mut cache = re.create_cache();

assert!(re.try_is_match(&mut cache, "abc")?);
assert_eq!(None, re.try_find(&mut cache, "abc")?);

source

pub fn look_matcher(self, m: LookMatcher) -> Config

Sets the look-around matcher that should be used with this NFA.

A look-around matcher determines how to match look-around assertions. In particular, some assertions are configurable. For example, the (?m:^) and (?m:$) assertions can have their line terminator changed from the default of \n to any other byte.

§Example

This shows how to change the line terminator for multi-line assertions.

use regex_automata::{
    nfa::thompson::{self, pikevm::PikeVM},
    util::look::LookMatcher,
    Match, Input,
};

let mut lookm = LookMatcher::new();
lookm.set_line_terminator(b'\x00');

let re = PikeVM::builder()
    .thompson(thompson::Config::new().look_matcher(lookm))
    .build(r"(?m)^[a-z]+$")?;
let mut cache = re.create_cache();

// Multi-line assertions now use NUL as a terminator.
assert_eq!(
    Some(Match::must(0, 1..4)),
    re.find(&mut cache, b"\x00abc\x00"),
);
// ... and \n is no longer recognized as a terminator.
assert_eq!(
    None,
    re.find(&mut cache, b"\nabc\n"),
);