Struct regex_automata::meta::Config

source ·

pub struct Config { /* private fields */ }

Expand description

An object describing the configuration of a Regex.

This configuration only includes options for the non-syntax behavior of a Regex, and can be applied via the Builder::configure method. For configuring the syntax options, see util::syntax::Config.

Example: lower the NFA size limit

In some cases, the default size limit might be too big. The size limit can be lowered, which will prevent large regex patterns from compiling.

use regex_automata::meta::Regex;

let result = Regex::builder()
    .configure(Regex::config().nfa_size_limit(Some(20 * (1<<10))))
    // Not even 20KB is enough to build a single large Unicode class!
    .build(r"\pL");
assert!(result.is_err());

Implementations§

source §

impl Config

source

pub fn new() -> Config

Create a new configuration object for a Regex.

source

pub fn match_kind(self, kind: MatchKind) -> Config

Set the match semantics for a Regex.

The default value is MatchKind::LeftmostFirst.

Example

use regex_automata::{meta::Regex, Match, MatchKind};

// By default, leftmost-first semantics are used, which
// disambiguates matches at the same position by selecting
// the one that corresponds earlier in the pattern.
let re = Regex::new("sam|samwise")?;
assert_eq!(Some(Match::must(0, 0..3)), re.find("samwise"));

// But with 'all' semantics, match priority is ignored
// and all match states are included. When coupled with
// a leftmost search, the search will report the last
// possible match.
let re = Regex::builder()
    .configure(Regex::config().match_kind(MatchKind::All))
    .build("sam|samwise")?;
assert_eq!(Some(Match::must(0, 0..7)), re.find("samwise"));
// Beware that this can lead to skipping matches!
// Usually 'all' is used for anchored reverse searches
// only, or for overlapping searches.
assert_eq!(Some(Match::must(0, 4..11)), re.find("sam samwise"));

source

pub fn utf8_empty(self, yes: bool) -> Config

Toggles whether empty matches are permitted to occur between the code units of a UTF-8 encoded codepoint.

This should generally be enabled when search a &str or anything that you otherwise know is valid UTF-8. It should be disabled in all other cases. Namely, if the haystack is not valid UTF-8 and this is enabled, then behavior is unspecified.

By default, this is enabled.

Example

use regex_automata::{meta::Regex, Match};

let re = Regex::new("")?;
let got: Vec<Match> = re.find_iter("☃").collect();
// Matches only occur at the beginning and end of the snowman.
assert_eq!(got, vec![
    Match::must(0, 0..0),
    Match::must(0, 3..3),
]);

let re = Regex::builder()
    .configure(Regex::config().utf8_empty(false))
    .build("")?;
let got: Vec<Match> = re.find_iter("☃").collect();
// Matches now occur at every position!
assert_eq!(got, vec![
    Match::must(0, 0..0),
    Match::must(0, 1..1),
    Match::must(0, 2..2),
    Match::must(0, 3..3),
]);

Ok::<(), Box<dyn std::error::Error>>(())

source

pub fn auto_prefilter(self, yes: bool) -> Config

Toggles whether automatic prefilter support is enabled.

If this is disabled and Config::prefilter is not set, then the meta regex engine will not use any prefilters. This can sometimes be beneficial in cases where you know (or have measured) that the prefilter leads to overall worse search performance.

By default, this is enabled.

Example

use regex_automata::{meta::Regex, Match};

let re = Regex::builder()
    .configure(Regex::config().auto_prefilter(false))
    .build(r"Bruce \w+")?;
let hay = "Hello Bruce Springsteen!";
assert_eq!(Some(Match::must(0, 6..23)), re.find(hay));

Ok::<(), Box<dyn std::error::Error>>(())

source

pub fn prefilter(self, pre: Option<Prefilter>) -> Config

Overrides and sets the prefilter to use inside a Regex.

This permits one to forcefully set a prefilter in cases where the caller knows better than whatever the automatic prefilter logic is capable of.

By default, this is set to None and an automatic prefilter will be used if one could be built. (Assuming Config::auto_prefilter is enabled, which it is by default.)

Example

This example shows how to set your own prefilter. In the case of a pattern like Bruce \w+, the automatic prefilter is likely to be constructed in a way that it will look for occurrences of Bruce . In most cases, this is the best choice. But in some cases, it may be the case that running memchr on B is the best choice. One can achieve that behavior by overriding the automatic prefilter logic and providing a prefilter that just matches B.

use regex_automata::{
    meta::Regex,
    util::prefilter::Prefilter,
    Match, MatchKind,
};

let pre = Prefilter::new(MatchKind::LeftmostFirst, &["B"])
    .expect("a prefilter");
let re = Regex::builder()
    .configure(Regex::config().prefilter(Some(pre)))
    .build(r"Bruce \w+")?;
let hay = "Hello Bruce Springsteen!";
assert_eq!(Some(Match::must(0, 6..23)), re.find(hay));

Example: incorrect prefilters can lead to incorrect results!

Be warned that setting an incorrect prefilter can lead to missed matches. So if you use this option, ensure your prefilter can never report false negatives. (A false positive is, on the other hand, quite okay and generally unavoidable.)

use regex_automata::{
    meta::Regex,
    util::prefilter::Prefilter,
    Match, MatchKind,
};

let pre = Prefilter::new(MatchKind::LeftmostFirst, &["Z"])
    .expect("a prefilter");
let re = Regex::builder()
    .configure(Regex::config().prefilter(Some(pre)))
    .build(r"Bruce \w+")?;
let hay = "Hello Bruce Springsteen!";
// Oops! No match found, but there should be one!
assert_eq!(None, re.find(hay));

source

pub fn which_captures(self, which_captures: WhichCaptures) -> Config

Configures what kinds of groups are compiled as “capturing” in the underlying regex engine.

This is set to WhichCaptures::All by default. Callers may wish to use WhichCaptures::Implicit in cases where one wants avoid the overhead of capture states for explicit groups.

Note that another approach to avoiding the overhead of capture groups is by using non-capturing groups in the regex pattern. That is, (?:a) instead of (a). This option is useful when you can’t control the concrete syntax but know that you don’t need the underlying capture states. For example, using WhichCaptures::Implicit will behave as if all explicit capturing groups in the pattern were non-capturing.

Setting this to WhichCaptures::None is usually not the right thing to do. When no capture states are compiled, some regex engines (such as the PikeVM) won’t be able to report match offsets. This will manifest as no match being found.

Example

This example demonstrates how the results of capture groups can change based on this option. First we show the default (all capture groups in the pattern are capturing):

use regex_automata::{meta::Regex, Match, Span};

let re = Regex::new(r"foo([0-9]+)bar")?;
let hay = "foo123bar";

let mut caps = re.create_captures();
re.captures(hay, &mut caps);
assert_eq!(Some(Span::from(0..9)), caps.get_group(0));
assert_eq!(Some(Span::from(3..6)), caps.get_group(1));

Ok::<(), Box<dyn std::error::Error>>(())

And now we show the behavior when we only include implicit capture groups. In this case, we can only find the overall match span, but the spans of any other explicit group don’t exist because they are treated as non-capturing. (In effect, when WhichCaptures::Implicit is used, there is no real point in using Regex::captures since it will never be able to report more information than Regex::find.)

use regex_automata::{
    meta::Regex,
    nfa::thompson::WhichCaptures,
    Match,
    Span,
};

let re = Regex::builder()
    .configure(Regex::config().which_captures(WhichCaptures::Implicit))
    .build(r"foo([0-9]+)bar")?;
let hay = "foo123bar";

let mut caps = re.create_captures();
re.captures(hay, &mut caps);
assert_eq!(Some(Span::from(0..9)), caps.get_group(0));
assert_eq!(None, caps.get_group(1));

Ok::<(), Box<dyn std::error::Error>>(())

source

pub fn nfa_size_limit(self, limit: Option<usize>) -> Config

Sets the size limit, in bytes, to enforce on the construction of every NFA build by the meta regex engine.

Setting it to None disables the limit. This is not recommended if you’re compiling untrusted patterns.

Note that this limit is applied to each NFA built, and if any of them exceed the limit, then construction will fail. This limit does not correspond to the total memory used by all NFAs in the meta regex engine.

This defaults to some reasonable number that permits most reasonable patterns.

Example

use regex_automata::meta::Regex;

let result = Regex::builder()
    .configure(Regex::config().nfa_size_limit(Some(20 * (1<<10))))
    // Not even 20KB is enough to build a single large Unicode class!
    .build(r"\pL");
assert!(result.is_err());

// But notice that building such a regex with the exact same limit
// can succeed depending on other aspects of the configuration. For
// example, a single *forward* NFA will (at time of writing) fit into
// the 20KB limit, but a *reverse* NFA of the same pattern will not.
// So if one configures a meta regex such that a reverse NFA is never
// needed and thus never built, then the 20KB limit will be enough for
// a pattern like \pL!
let result = Regex::builder()
    .configure(Regex::config()
        .nfa_size_limit(Some(20 * (1<<10)))
        // The DFAs are the only thing that (currently) need a reverse
        // NFA. So if both are disabled, the meta regex engine will
        // skip building the reverse NFA. Note that this isn't an API
        // guarantee. A future semver compatible version may introduce
        // new use cases for a reverse NFA.
        .hybrid(false)
        .dfa(false)
    )
    // Not even 20KB is enough to build a single large Unicode class!
    .build(r"\pL");
assert!(result.is_ok());

source

pub fn onepass_size_limit(self, limit: Option<usize>) -> Config

Sets the size limit, in bytes, for the one-pass DFA.

Setting it to None disables the limit. Disabling the limit is strongly discouraged when compiling untrusted patterns. Even if the patterns are trusted, it still may not be a good idea, since a one-pass DFA can use a lot of memory. With that said, as the size of a regex increases, the likelihood of it being one-pass likely decreases.

This defaults to some reasonable number that permits most reasonable one-pass patterns.

Example

This shows how to set the one-pass DFA size limit. Note that since a one-pass DFA is an optional component of the meta regex engine, this size limit only impacts what is built internally and will never determine whether a Regex itself fails to build.

use regex_automata::meta::Regex;

let result = Regex::builder()
    .configure(Regex::config().onepass_size_limit(Some(2 * (1<<20))))
    .build(r"\pL{5}");
assert!(result.is_ok());

source

pub fn hybrid_cache_capacity(self, limit: usize) -> Config

Set the cache capacity, in bytes, for the lazy DFA.

The cache capacity of the lazy DFA determines approximately how much heap memory it is allowed to use to store its state transitions. The state transitions are computed at search time, and if the cache fills up it, it is cleared. At this point, any previously generated state transitions are lost and are re-generated if they’re needed again.

This sort of cache filling and clearing works quite well so long as cache clearing happens infrequently. If it happens too often, then the meta regex engine will stop using the lazy DFA and switch over to a different regex engine.

In cases where the cache is cleared too often, it may be possible to give the cache more space and reduce (or eliminate) how often it is cleared. Similarly, sometimes a regex is so big that the lazy DFA isn’t used at all if its cache capacity isn’t big enough.

The capacity set here is a limit on how much memory is used. The actual memory used is only allocated as it’s needed.

Determining the right value for this is a little tricky and will likely required some profiling. Enabling the logging feature and setting the log level to trace will also tell you how often the cache is being cleared.

Example

use regex_automata::meta::Regex;

let result = Regex::builder()
    .configure(Regex::config().hybrid_cache_capacity(20 * (1<<20)))
    .build(r"\pL{5}");
assert!(result.is_ok());

source

pub fn dfa_size_limit(self, limit: Option<usize>) -> Config

Sets the size limit, in bytes, for heap memory used for a fully compiled DFA.

NOTE: If you increase this, you’ll likely also need to increase Config::dfa_state_limit.

In contrast to the lazy DFA, building a full DFA requires computing all of its state transitions up front. This can be a very expensive process, and runs in worst case 2^n time and space (where n is proportional to the size of the regex). However, a full DFA unlocks some additional optimization opportunities.

Because full DFAs can be so expensive, the default limits for them are incredibly small. Generally speaking, if your regex is moderately big or if you’re using Unicode features (\w is Unicode-aware by default for example), then you can expect that the meta regex engine won’t even attempt to build a DFA for it.

If this and Config::dfa_state_limit are set to None, then the meta regex will not use any sort of limits when deciding whether to build a DFA. This in turn makes construction of a Regex take worst case exponential time and space. Even short patterns can result in huge space blow ups. So it is strongly recommended to keep some kind of limit set!

The default is set to a small number that permits some simple regexes to get compiled into DFAs in reasonable time.

Example

use regex_automata::meta::Regex;

let result = Regex::builder()
    // 100MB is much bigger than the default.
    .configure(Regex::config()
        .dfa_size_limit(Some(100 * (1<<20)))
        // We don't care about size too much here, so just
        // remove the NFA state limit altogether.
        .dfa_state_limit(None))
    .build(r"\pL{5}");
assert!(result.is_ok());

source

pub fn dfa_state_limit(self, limit: Option<usize>) -> Config

Sets a limit on the total number of NFA states, beyond which, a full DFA is not attempted to be compiled.

This limit works in concert with Config::dfa_size_limit. Namely, where as Config::dfa_size_limit is applied by attempting to construct a DFA, this limit is used to avoid the attempt in the first place. This is useful to avoid hefty initialization costs associated with building a DFA for cases where it is obvious the DFA will ultimately be too big.

By default, this is set to a very small number.

Example

use regex_automata::meta::Regex;

let result = Regex::builder()
    .configure(Regex::config()
        // Sometimes the default state limit rejects DFAs even
        // if they would fit in the size limit. Here, we disable
        // the check on the number of NFA states and just rely on
        // the size limit.
        .dfa_state_limit(None))
    .build(r"(?-u)\w{30}");
assert!(result.is_ok());

source

pub fn byte_classes(self, yes: bool) -> Config

Whether to attempt to shrink the size of the alphabet for the regex pattern or not. When enabled, the alphabet is shrunk into a set of equivalence classes, where every byte in the same equivalence class cannot discriminate between a match or non-match.

WARNING: This is only useful for debugging DFAs. Disabling this does not yield any speed advantages. Indeed, disabling it can result in much higher memory usage. Disabling byte classes is useful for debugging the actual generated transitions because it lets one see the transitions defined on actual bytes instead of the equivalence classes.

This option is enabled by default and should never be disabled unless one is debugging the meta regex engine’s internals.

Example

use regex_automata::{meta::Regex, Match};

let re = Regex::builder()
    .configure(Regex::config().byte_classes(false))
    .build(r"[a-z]+")?;
let hay = "!!quux!!";
assert_eq!(Some(Match::must(0, 2..6)), re.find(hay));

source

pub fn line_terminator(self, byte: u8) -> Config

Set the line terminator to be used by the ^ and $ anchors in multi-line mode.

This option has no effect when CRLF mode is enabled. That is, regardless of this setting, (?Rm:^) and (?Rm:$) will always treat \r and \n as line terminators (and will never match between a \r and a \n).

By default, \n is the line terminator.

Warning: This does not change the behavior of .. To do that, you’ll need to configure the syntax option syntax::Config::line_terminator in addition to this. Otherwise, . will continue to match any character other than \n.

Example

use regex_automata::{meta::Regex, util::syntax, Match};

let re = Regex::builder()
    .syntax(syntax::Config::new().multi_line(true))
    .configure(Regex::config().line_terminator(b'\x00'))
    .build(r"^foo$")?;
let hay = "\x00foo\x00";
assert_eq!(Some(Match::must(0, 1..4)), re.find(hay));

source

pub fn hybrid(self, yes: bool) -> Config

Toggle whether the hybrid NFA/DFA (also known as the “lazy DFA”) should be available for use by the meta regex engine.

Enabling this does not necessarily mean that the lazy DFA will definitely be used. It just means that it will be available for use if the meta regex engine thinks it will be useful.

When the hybrid crate feature is enabled, then this is enabled by default. Otherwise, if the crate feature is disabled, then this is always disabled, regardless of its setting by the caller.

source

pub fn dfa(self, yes: bool) -> Config

Toggle whether a fully compiled DFA should be available for use by the meta regex engine.

Enabling this does not necessarily mean that a DFA will definitely be used. It just means that it will be available for use if the meta regex engine thinks it will be useful.

When the dfa-build crate feature is enabled, then this is enabled by default. Otherwise, if the crate feature is disabled, then this is always disabled, regardless of its setting by the caller.

source

pub fn onepass(self, yes: bool) -> Config

Toggle whether a one-pass DFA should be available for use by the meta regex engine.

Enabling this does not necessarily mean that a one-pass DFA will definitely be used. It just means that it will be available for use if the meta regex engine thinks it will be useful. (Indeed, a one-pass DFA can only be used when the regex is one-pass. See the dfa::onepass module for more details.)

When the dfa-onepass crate feature is enabled, then this is enabled by default. Otherwise, if the crate feature is disabled, then this is always disabled, regardless of its setting by the caller.

source

pub fn backtrack(self, yes: bool) -> Config

Toggle whether a bounded backtracking regex engine should be available for use by the meta regex engine.

Enabling this does not necessarily mean that a bounded backtracker will definitely be used. It just means that it will be available for use if the meta regex engine thinks it will be useful.

When the nfa-backtrack crate feature is enabled, then this is enabled by default. Otherwise, if the crate feature is disabled, then this is always disabled, regardless of its setting by the caller.

source