pub struct PikeVM { /* private fields */ }
Expand description

A virtual machine for executing regex searches with capturing groups.

Infallible APIs

Unlike most other regex engines in this crate, a PikeVM never returns an error at search time. It supports all Anchored configurations, never quits and works on haystacks of arbitrary length.

There are two caveats to mention though:

  • If an invalid pattern ID is given to a search via Anchored::Pattern, then the PikeVM will report “no match.” This is consistent with all other regex engines in this crate.
  • When using PikeVM::which_overlapping_matches with a PatternSet that has insufficient capacity to store all valid pattern IDs, then if a match occurs for a PatternID that cannot be inserted, it is silently dropped as if it did not match.

Advice

The PikeVM is generally the most “powerful” regex engine in this crate. “Powerful” in this context means that it can handle any regular expression that is parseable by regex-syntax and any size haystack. Regretably, the PikeVM is also simultaneously often the slowest regex engine in practice. This results in an annoying situation where one generally tries to pick any other regex engine (or perhaps none at all) before being forced to fall back to a PikeVM.

For example, a common strategy for dealing with capturing groups is to actually look for the overall match of the regex using a faster regex engine, like a lazy DFA. Once the overall match is found, one can then run the PikeVM on just the match span to find the spans of the capturing groups. In this way, the faster regex engine does the majority of the work, while the PikeVM only lends its power in a more limited role.

Unfortunately, this isn’t always possible because the faster regex engines don’t support all of the regex features in regex-syntax. This notably includes (and is currently limited to) Unicode word boundaries. So if your pattern has Unicode word boundaries, you typically can’t use a DFA-based regex engine at all (unless you enable heuristic support for it). (The one-pass DFA can handle Unicode word boundaries for anchored searches only, but in a cruel sort of joke, many Unicode features tend to result in making the regex not one-pass.)

Example

This example shows that the PikeVM implements Unicode word boundaries correctly by default.

use regex_automata::{nfa::thompson::pikevm::PikeVM, Match};

let re = PikeVM::new(r"\b\w+\b")?;
let mut cache = re.create_cache();

let mut it = re.find_iter(&mut cache, "Шерлок Холмс");
assert_eq!(Some(Match::must(0, 0..12)), it.next());
assert_eq!(Some(Match::must(0, 13..23)), it.next());
assert_eq!(None, it.next());

Implementations§

source§

impl PikeVM

source

pub fn new(pattern: &str) -> Result<PikeVM, BuildError>

Parse the given regular expression using the default configuration and return the corresponding PikeVM.

If you want a non-default configuration, then use the Builder to set your own configuration.

Example
use regex_automata::{nfa::thompson::pikevm::PikeVM, Match};

let re = PikeVM::new("foo[0-9]+bar")?;
let mut cache = re.create_cache();
assert_eq!(
    Some(Match::must(0, 3..14)),
    re.find_iter(&mut cache, "zzzfoo12345barzzz").next(),
);
source

pub fn new_many<P: AsRef<str>>(patterns: &[P]) -> Result<PikeVM, BuildError>

Like new, but parses multiple patterns into a single “multi regex.” This similarly uses the default regex configuration.

Example
use regex_automata::{nfa::thompson::pikevm::PikeVM, Match};

let re = PikeVM::new_many(&["[a-z]+", "[0-9]+"])?;
let mut cache = re.create_cache();

let mut it = re.find_iter(&mut cache, "abc 1 foo 4567 0 quux");
assert_eq!(Some(Match::must(0, 0..3)), it.next());
assert_eq!(Some(Match::must(1, 4..5)), it.next());
assert_eq!(Some(Match::must(0, 6..9)), it.next());
assert_eq!(Some(Match::must(1, 10..14)), it.next());
assert_eq!(Some(Match::must(1, 15..16)), it.next());
assert_eq!(Some(Match::must(0, 17..21)), it.next());
assert_eq!(None, it.next());
source

pub fn new_from_nfa(nfa: NFA) -> Result<PikeVM, BuildError>

Like new, but builds a PikeVM directly from an NFA. This is useful if you already have an NFA, or even if you hand-assembled the NFA.

Example

This shows how to hand assemble a regular expression via its HIR, compile an NFA from it and build a PikeVM from the NFA.

use regex_automata::{nfa::thompson::{NFA, pikevm::PikeVM}, Match};
use regex_syntax::hir::{Hir, Class, ClassBytes, ClassBytesRange};

let hir = Hir::class(Class::Bytes(ClassBytes::new(vec![
    ClassBytesRange::new(b'0', b'9'),
    ClassBytesRange::new(b'A', b'Z'),
    ClassBytesRange::new(b'_', b'_'),
    ClassBytesRange::new(b'a', b'z'),
])));

let config = NFA::config().nfa_size_limit(Some(1_000));
let nfa = NFA::compiler().configure(config).build_from_hir(&hir)?;

let re = PikeVM::new_from_nfa(nfa)?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let expected = Some(Match::must(0, 3..4));
re.captures(&mut cache, "!@#A#@!", &mut caps);
assert_eq!(expected, caps.get_match());
source

pub fn always_match() -> Result<PikeVM, BuildError>

Create a new PikeVM that matches every input.

Example
use regex_automata::{nfa::thompson::pikevm::PikeVM, Match};

let re = PikeVM::always_match()?;
let mut cache = re.create_cache();

let expected = Match::must(0, 0..0);
assert_eq!(Some(expected), re.find_iter(&mut cache, "").next());
assert_eq!(Some(expected), re.find_iter(&mut cache, "foo").next());
source

pub fn never_match() -> Result<PikeVM, BuildError>

Create a new PikeVM that never matches any input.

Example
use regex_automata::nfa::thompson::pikevm::PikeVM;

let re = PikeVM::never_match()?;
let mut cache = re.create_cache();

assert_eq!(None, re.find_iter(&mut cache, "").next());
assert_eq!(None, re.find_iter(&mut cache, "foo").next());
source

pub fn config() -> Config

Return a default configuration for a PikeVM.

This is a convenience routine to avoid needing to import the Config type when customizing the construction of a PikeVM.

Example

This example shows how to disable UTF-8 mode. When UTF-8 mode is disabled, zero-width matches that split a codepoint are allowed. Otherwise they are never reported.

In the code below, notice that "" is permitted to match positions that split the encoding of a codepoint.

use regex_automata::{nfa::thompson::{self, pikevm::PikeVM}, Match};

let re = PikeVM::builder()
    .thompson(thompson::Config::new().utf8(false))
    .build(r"")?;
let mut cache = re.create_cache();

let haystack = "a☃z";
let mut it = re.find_iter(&mut cache, haystack);
assert_eq!(Some(Match::must(0, 0..0)), it.next());
assert_eq!(Some(Match::must(0, 1..1)), it.next());
assert_eq!(Some(Match::must(0, 2..2)), it.next());
assert_eq!(Some(Match::must(0, 3..3)), it.next());
assert_eq!(Some(Match::must(0, 4..4)), it.next());
assert_eq!(Some(Match::must(0, 5..5)), it.next());
assert_eq!(None, it.next());
source

pub fn builder() -> Builder

Return a builder for configuring the construction of a PikeVM.

This is a convenience routine to avoid needing to import the Builder type in common cases.

Example

This example shows how to use the builder to disable UTF-8 mode everywhere.

use regex_automata::{
    nfa::thompson::{self, pikevm::PikeVM},
    util::syntax,
    Match,
};

let re = PikeVM::builder()
    .syntax(syntax::Config::new().utf8(false))
    .thompson(thompson::Config::new().utf8(false))
    .build(r"foo(?-u:[^b])ar.*")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());

let haystack = b"\xFEfoo\xFFarzz\xE2\x98\xFF\n";
let expected = Some(Match::must(0, 1..9));
re.captures(&mut cache, haystack, &mut caps);
assert_eq!(expected, caps.get_match());
source

pub fn create_captures(&self) -> Captures

Create a new empty set of capturing groups that is guaranteed to be valid for the search APIs on this PikeVM.

A Captures value created for a specific PikeVM cannot be used with any other PikeVM.

This is a convenience function for Captures::all. See the Captures documentation for an explanation of its alternative constructors that permit the PikeVM to do less work during a search, and thus might make it faster.

source

pub fn create_cache(&self) -> Cache

Create a new cache for this PikeVM.

The cache returned should only be used for searches for this PikeVM. If you want to reuse the cache for another PikeVM, then you must call Cache::reset with that PikeVM (or, equivalently, PikeVM::reset_cache).

source

pub fn reset_cache(&self, cache: &mut Cache)

Reset the given cache such that it can be used for searching with the this PikeVM (and only this PikeVM).

A cache reset permits reusing memory already allocated in this cache with a different PikeVM.

Example

This shows how to re-purpose a cache for use with a different PikeVM.

use regex_automata::{nfa::thompson::pikevm::PikeVM, Match};

let re1 = PikeVM::new(r"\w")?;
let re2 = PikeVM::new(r"\W")?;

let mut cache = re1.create_cache();
assert_eq!(
    Some(Match::must(0, 0..2)),
    re1.find_iter(&mut cache, "Δ").next(),
);

// Using 'cache' with re2 is not allowed. It may result in panics or
// incorrect results. In order to re-purpose the cache, we must reset
// it with the PikeVM we'd like to use it with.
//
// Similarly, after this reset, using the cache with 're1' is also not
// allowed.
re2.reset_cache(&mut cache);
assert_eq!(
    Some(Match::must(0, 0..3)),
    re2.find_iter(&mut cache, "☃").next(),
);
source

pub fn pattern_len(&self) -> usize

Returns the total number of patterns compiled into this PikeVM.

In the case of a PikeVM that contains no patterns, this returns 0.

Example

This example shows the pattern length for a PikeVM that never matches:

use regex_automata::nfa::thompson::pikevm::PikeVM;

let re = PikeVM::never_match()?;
assert_eq!(re.pattern_len(), 0);

And another example for a PikeVM that matches at every position:

use regex_automata::nfa::thompson::pikevm::PikeVM;

let re = PikeVM::always_match()?;
assert_eq!(re.pattern_len(), 1);

And finally, a PikeVM that was constructed from multiple patterns:

use regex_automata::nfa::thompson::pikevm::PikeVM;

let re = PikeVM::new_many(&["[0-9]+", "[a-z]+", "[A-Z]+"])?;
assert_eq!(re.pattern_len(), 3);
source

pub fn get_config(&self) -> &Config

Return the config for this PikeVM.

source

pub fn get_nfa(&self) -> &NFA

Returns a reference to the underlying NFA.

source§

impl PikeVM

source

pub fn is_match<'h, I: Into<Input<'h>>>( &self, cache: &mut Cache, input: I ) -> bool

Returns true if and only if this PikeVM matches the given haystack.

This routine may short circuit if it knows that scanning future input will never lead to a different result. In particular, if the underlying NFA enters a match state, then this routine will return true immediately without inspecting any future input. (Consider how this might make a difference given the regex a+ on the haystack aaaaaaaaaaaaaaa. This routine can stop after it sees the first a, but routines like find need to continue searching because + is greedy by default.)

Example

This shows basic usage:

use regex_automata::nfa::thompson::pikevm::PikeVM;

let re = PikeVM::new("foo[0-9]+bar")?;
let mut cache = re.create_cache();

assert!(re.is_match(&mut cache, "foo12345bar"));
assert!(!re.is_match(&mut cache, "foobar"));
Example: consistency with search APIs

is_match is guaranteed to return true whenever find returns a match. This includes searches that are executed entirely within a codepoint:

use regex_automata::{nfa::thompson::pikevm::PikeVM, Input};

let re = PikeVM::new("a*")?;
let mut cache = re.create_cache();

assert!(!re.is_match(&mut cache, Input::new("☃").span(1..2)));

Notice that when UTF-8 mode is disabled, then the above reports a match because the restriction against zero-width matches that split a codepoint has been lifted:

use regex_automata::{nfa::thompson::{pikevm::PikeVM, NFA}, Input};

let re = PikeVM::builder()
    .thompson(NFA::config().utf8(false))
    .build("a*")?;
let mut cache = re.create_cache();

assert!(re.is_match(&mut cache, Input::new("☃").span(1..2)));
source

pub fn find<'h, I: Into<Input<'h>>>( &self, cache: &mut Cache, input: I ) -> Option<Match>

Executes a leftmost forward search and returns a Match if one exists.

This routine only includes the overall match span. To get access to the individual spans of each capturing group, use PikeVM::captures.

Example

Leftmost first match semantics corresponds to the match with the smallest starting offset, but where the end offset is determined by preferring earlier branches in the original regular expression. For example, Sam|Samwise will match Sam in Samwise, but Samwise|Sam will match Samwise in Samwise.

Generally speaking, the “leftmost first” match is how most backtracking regular expressions tend to work. This is in contrast to POSIX-style regular expressions that yield “leftmost longest” matches. Namely, both Sam|Samwise and Samwise|Sam match Samwise when using leftmost longest semantics. (This crate does not currently support leftmost longest semantics.)

use regex_automata::{nfa::thompson::pikevm::PikeVM, Match};

let re = PikeVM::new("foo[0-9]+")?;
let mut cache = re.create_cache();
let expected = Match::must(0, 0..8);
assert_eq!(Some(expected), re.find(&mut cache, "foo12345"));

// Even though a match is found after reading the first byte (`a`),
// the leftmost first match semantics demand that we find the earliest
// match that prefers earlier parts of the pattern over later parts.
let re = PikeVM::new("abc|a")?;
let mut cache = re.create_cache();
let expected = Match::must(0, 0..3);
assert_eq!(Some(expected), re.find(&mut cache, "abc"));
source

pub fn captures<'h, I: Into<Input<'h>>>( &self, cache: &mut Cache, input: I, caps: &mut Captures )

Executes a leftmost forward search and writes the spans of capturing groups that participated in a match into the provided Captures value. If no match was found, then Captures::is_match is guaranteed to return false.

Example
use regex_automata::{nfa::thompson::pikevm::PikeVM, Span};

let re = PikeVM::new(r"^([0-9]{4})-([0-9]{2})-([0-9]{2})$")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());

re.captures(&mut cache, "2010-03-14", &mut caps);
assert!(caps.is_match());
assert_eq!(Some(Span::from(0..4)), caps.get_group(1));
assert_eq!(Some(Span::from(5..7)), caps.get_group(2));
assert_eq!(Some(Span::from(8..10)), caps.get_group(3));
source

pub fn find_iter<'r, 'c, 'h, I: Into<Input<'h>>>( &'r self, cache: &'c mut Cache, input: I ) -> FindMatches<'r, 'c, 'h>

Returns an iterator over all non-overlapping leftmost matches in the given bytes. If no match exists, then the iterator yields no elements.

Example
use regex_automata::{nfa::thompson::pikevm::PikeVM, Match};

let re = PikeVM::new("foo[0-9]+")?;
let mut cache = re.create_cache();

let text = "foo1 foo12 foo123";
let matches: Vec<Match> = re.find_iter(&mut cache, text).collect();
assert_eq!(matches, vec![
    Match::must(0, 0..4),
    Match::must(0, 5..10),
    Match::must(0, 11..17),
]);
source

pub fn captures_iter<'r, 'c, 'h, I: Into<Input<'h>>>( &'r self, cache: &'c mut Cache, input: I ) -> CapturesMatches<'r, 'c, 'h>

Returns an iterator over all non-overlapping Captures values. If no match exists, then the iterator yields no elements.

This yields the same matches as PikeVM::find_iter, but it includes the spans of all capturing groups that participate in each match.

Tip: See util::iter::Searcher for how to correctly iterate over all matches in a haystack while avoiding the creation of a new Captures value for every match. (Which you are forced to do with an Iterator.)

Example
use regex_automata::{nfa::thompson::pikevm::PikeVM, Span};

let re = PikeVM::new("foo(?P<numbers>[0-9]+)")?;
let mut cache = re.create_cache();

let text = "foo1 foo12 foo123";
let matches: Vec<Span> = re
    .captures_iter(&mut cache, text)
    // The unwrap is OK since 'numbers' matches if the pattern matches.
    .map(|caps| caps.get_group_by_name("numbers").unwrap())
    .collect();
assert_eq!(matches, vec![
    Span::from(3..4),
    Span::from(8..10),
    Span::from(14..17),
]);
source§

impl PikeVM

source

pub fn search(&self, cache: &mut Cache, input: &Input<'_>, caps: &mut Captures)

Executes a leftmost forward search and writes the spans of capturing groups that participated in a match into the provided Captures value. If no match was found, then Captures::is_match is guaranteed to return false.

This is like PikeVM::captures, but it accepts a concrete &Input instead of an Into<Input>.

This example shows how to build a multi-PikeVM that permits searching for specific patterns.

use regex_automata::{
    nfa::thompson::pikevm::PikeVM,
    Anchored, Match, PatternID, Input,
};

let re = PikeVM::new_many(&["[a-z0-9]{6}", "[a-z][a-z0-9]{5}"])?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let haystack = "foo123";

// Since we are using the default leftmost-first match and both
// patterns match at the same starting position, only the first pattern
// will be returned in this case when doing a search for any of the
// patterns.
let expected = Some(Match::must(0, 0..6));
re.search(&mut cache, &Input::new(haystack), &mut caps);
assert_eq!(expected, caps.get_match());

// But if we want to check whether some other pattern matches, then we
// can provide its pattern ID.
let expected = Some(Match::must(1, 0..6));
let input = Input::new(haystack)
    .anchored(Anchored::Pattern(PatternID::must(1)));
re.search(&mut cache, &input, &mut caps);
assert_eq!(expected, caps.get_match());

This example shows how providing the bounds of a search can produce different results than simply sub-slicing the haystack.

use regex_automata::{nfa::thompson::pikevm::PikeVM, Match, Input};

let re = PikeVM::new(r"\b[0-9]{3}\b")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let haystack = "foo123bar";

// Since we sub-slice the haystack, the search doesn't know about
// the larger context and assumes that `123` is surrounded by word
// boundaries. And of course, the match position is reported relative
// to the sub-slice as well, which means we get `0..3` instead of
// `3..6`.
let expected = Some(Match::must(0, 0..3));
re.search(&mut cache, &Input::new(&haystack[3..6]), &mut caps);
assert_eq!(expected, caps.get_match());

// But if we provide the bounds of the search within the context of the
// entire haystack, then the search can take the surrounding context
// into account. (And if we did find a match, it would be reported
// as a valid offset into `haystack` instead of its sub-slice.)
let expected = None;
let input = Input::new(haystack).range(3..6);
re.search(&mut cache, &input, &mut caps);
assert_eq!(expected, caps.get_match());
source

pub fn search_slots( &self, cache: &mut Cache, input: &Input<'_>, slots: &mut [Option<NonMaxUsize>] ) -> Option<PatternID>

Executes a leftmost forward search and writes the spans of capturing groups that participated in a match into the provided slots, and returns the matching pattern ID. The contents of the slots for patterns other than the matching pattern are unspecified. If no match was found, then None is returned and the contents of slots is unspecified.

This is like PikeVM::search, but it accepts a raw slots slice instead of a Captures value. This is useful in contexts where you don’t want or need to allocate a Captures.

It is legal to pass any number of slots to this routine. If the regex engine would otherwise write a slot offset that doesn’t fit in the provided slice, then it is simply skipped. In general though, there are usually three slice lengths you might want to use:

  • An empty slice, if you only care about which pattern matched.
  • A slice with pattern_len() * 2 slots, if you only care about the overall match spans for each matching pattern.
  • A slice with slot_len() slots, which permits recording match offsets for every capturing group in every pattern.
Example

This example shows how to find the overall match offsets in a multi-pattern search without allocating a Captures value. Indeed, we can put our slots right on the stack.

use regex_automata::{nfa::thompson::pikevm::PikeVM, PatternID, Input};

let re = PikeVM::new_many(&[
    r"\pL+",
    r"\d+",
])?;
let mut cache = re.create_cache();
let input = Input::new("!@#123");

// We only care about the overall match offsets here, so we just
// allocate two slots for each pattern. Each slot records the start
// and end of the match.
let mut slots = [None; 4];
let pid = re.search_slots(&mut cache, &input, &mut slots);
assert_eq!(Some(PatternID::must(1)), pid);

// The overall match offsets are always at 'pid * 2' and 'pid * 2 + 1'.
// See 'GroupInfo' for more details on the mapping between groups and
// slot indices.
let slot_start = pid.unwrap().as_usize() * 2;
let slot_end = slot_start + 1;
assert_eq!(Some(3), slots[slot_start].map(|s| s.get()));
assert_eq!(Some(6), slots[slot_end].map(|s| s.get()));
source

pub fn which_overlapping_matches( &self, cache: &mut Cache, input: &Input<'_>, patset: &mut PatternSet )

Writes the set of patterns that match anywhere in the given search configuration to patset. If multiple patterns match at the same position and this PikeVM was configured with MatchKind::All semantics, then all matching patterns are written to the given set.

Unless all of the patterns in this PikeVM are anchored, then generally speaking, this will visit every byte in the haystack.

This search routine does not clear the pattern set. This gives some flexibility to the caller (e.g., running multiple searches with the same pattern set), but does make the API bug-prone if you’re reusing the same pattern set for multiple searches but intended them to be independent.

If a pattern ID matched but the given PatternSet does not have sufficient capacity to store it, then it is not inserted and silently dropped.

Example

This example shows how to find all matching patterns in a haystack, even when some patterns match at the same position as other patterns.

use regex_automata::{
    nfa::thompson::pikevm::PikeVM,
    Input, MatchKind, PatternSet,
};

let patterns = &[
    r"\w+", r"\d+", r"\pL+", r"foo", r"bar", r"barfoo", r"foobar",
];
let re = PikeVM::builder()
    .configure(PikeVM::config().match_kind(MatchKind::All))
    .build_many(patterns)?;
let mut cache = re.create_cache();

let input = Input::new("foobar");
let mut patset = PatternSet::new(re.pattern_len());
re.which_overlapping_matches(&mut cache, &input, &mut patset);
let expected = vec![0, 2, 3, 4, 6];
let got: Vec<usize> = patset.iter().map(|p| p.as_usize()).collect();
assert_eq!(expected, got);

Trait Implementations§

source§

impl Clone for PikeVM

source§

fn clone(&self) -> PikeVM

Returns a copy of the value. Read more
1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
source§

impl Debug for PikeVM

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> ToOwned for T
where T: Clone,

§

type Owned = T

The resulting type after obtaining ownership.
source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.