Module aho_corasick::packed
source · Expand description
Provides packed multiple substring search, principally for a small number of patterns.
This sub-module provides vectorized routines for quickly finding
matches of a small number of patterns. In general, users of this crate
shouldn’t need to interface with this module directly, as the primary
AhoCorasick
searcher will use these routines
automatically as a prefilter when applicable. However, in some cases, callers
may want to bypass the Aho-Corasick machinery entirely and use this vectorized
searcher directly.
Overview
The primary types in this sub-module are:
Searcher
executes the actual search algorithm to report matches in a haystack.Builder
accumulates patterns incrementally and can construct aSearcher
.Config
permits tuning the searcher, and itself will produce aBuilder
(which can then be used to build aSearcher
). Currently, the only tuneable knob are the match semantics, but this may be expanded in the future.
Examples
This example shows how to create a searcher from an iterator of patterns.
By default, leftmost-first match semantics are used. (See the top-level
MatchKind
type for more details about match semantics, which apply
similarly to packed substring search.)
use aho_corasick::{packed::{MatchKind, Searcher}, PatternID};
let searcher = Searcher::new(["foobar", "foo"].iter().cloned())?;
let matches: Vec<PatternID> = searcher
.find_iter("foobar")
.map(|mat| mat.pattern())
.collect();
assert_eq!(vec![PatternID::ZERO], matches);
This example shows how to use Config
to change the match semantics to
leftmost-longest:
use aho_corasick::{packed::{Config, MatchKind}, PatternID};
let searcher = Config::new()
.match_kind(MatchKind::LeftmostLongest)
.builder()
.add("foo")
.add("foobar")
.build()?;
let matches: Vec<PatternID> = searcher
.find_iter("foobar")
.map(|mat| mat.pattern())
.collect();
assert_eq!(vec![PatternID::must(1)], matches);
Packed substring searching
Packed substring searching refers to the use of SIMD (Single Instruction, Multiple Data) to accelerate the detection of matches in a haystack. Unlike conventional algorithms, such as Aho-Corasick, SIMD algorithms for substring search tend to do better with a small number of patterns, where as Aho-Corasick generally maintains reasonably consistent performance regardless of the number of patterns you give it. Because of this, the vectorized searcher in this sub-module cannot be used as a general purpose searcher, since building the searcher may fail even when given a small number of patterns. However, in exchange, when searching for a small number of patterns, searching can be quite a bit faster than Aho-Corasick (sometimes by an order of magnitude).
The key take away here is that constructing a searcher from a list of patterns is a fallible operation with no clear rules for when it will fail. While the precise conditions under which building a searcher can fail is specifically an implementation detail, here are some common reasons:
- Too many patterns were given. Typically, the limit is on the order of 100 or so, but this limit may fluctuate based on available CPU features.
- The available packed algorithms require CPU features that aren’t available.
For example, currently, this crate only provides packed algorithms for
x86_64
andaarch64
. Therefore, constructing a packed searcher on any other target will always fail. - Zero patterns were given, or one of the patterns given was empty. Packed searchers require at least one pattern and that all patterns are non-empty.
- Something else about the nature of the patterns (typically based on heuristics) suggests that a packed searcher would perform very poorly, so no searcher is built.
Structs
- A builder for constructing a packed searcher from a collection of patterns.
- The configuration for a packed multiple pattern searcher.
- An iterator over non-overlapping matches from a packed searcher.
- A packed searcher for quickly finding occurrences of multiple patterns.
Enums
- A knob for controlling the match semantics of a packed multiple string searcher.