Struct regex_automata::Input
source · pub struct Input<'h> { /* private fields */ }
Expand description
The parameters for a regex search including the haystack to search.
It turns out that regex searches have a few parameters, and in most cases,
those parameters have defaults that work in the vast majority of cases.
This Input
type exists to make that common case seamnless while also
providing an avenue for changing the parameters of a search. In particular,
this type enables doing so without a combinatorial explosion of different
methods and/or superfluous parameters in the common cases.
An Input
permits configuring the following things:
- Search only a substring of a haystack, while taking the broader context into account for resolving look-around assertions.
- Indicating whether to search for all patterns in a regex, or to only search for one pattern in particular.
- Whether to perform an anchored on unanchored search.
- Whether to report a match as early as possible.
All of these parameters, except for the haystack, have sensible default
values. This means that the minimal search configuration is simply a call
to Input::new
with your haystack. Setting any other parameter is
optional.
Moreover, for any H
that implements AsRef<[u8]>
, there exists a
From<H> for Input
implementation. This is useful because many of the
search APIs in this crate accept an Into<Input>
. This means you can
provide string or byte strings to these routines directly, and they’ll
automatically get converted into an Input
for you.
The lifetime parameter 'h
refers to the lifetime of the haystack.
Organization
The API of Input
is split into a few different parts:
- A builder-like API that transforms a
Input
by value. Examples:Input::span
andInput::anchored
. - A setter API that permits mutating parameters in place. Examples:
Input::set_span
andInput::set_anchored
. - A getter API that permits retrieving any of the search parameters.
Examples:
Input::get_span
andInput::get_anchored
. - A few convenience getter routines that don’t conform to the above naming
pattern due to how common they are. Examples:
Input::haystack
,Input::start
andInput::end
. - Miscellaneous predicates and other helper routines that are useful
in some contexts. Examples:
Input::is_char_boundary
.
A Input
exposes so much because it is meant to be used by both callers of
regex engines and implementors of regex engines. A constraining factor is
that regex engines should accept a &Input
as its lowest level API, which
means that implementors should only use the “getter” APIs of a Input
.
Valid bounds and search termination
An Input
permits setting the bounds of a search via either
Input::span
or Input::range
. The bounds set must be valid, or
else a panic will occur. Bounds are valid if and only if:
- The bounds represent a valid range into the input’s haystack.
- or the end bound is a valid ending bound for the haystack and the start bound is exactly one greater than the start bound.
In the latter case, Input::is_done
will return true and indicates any
search receiving such an input should immediately return with no match.
Note that while Input
is used for reverse searches in this crate, the
Input::is_done
predicate assumes a forward search. Because unsigned
offsets are used internally, there is no way to tell from only the offsets
whether a reverse search is done or not.
Regex engine support
Any regex engine accepting an Input
must support at least the following
things:
- Searching a
&[u8]
for matches. - Searching a substring of
&[u8]
for a match, such that any match reported must appear entirely within that substring. - For a forwards search, a match should never be reported when
Input::is_done
returns true. (For reverse searches, termination should be handled outside ofInput
.)
Supporting other aspects of an Input
are optional, but regex engines
should handle aspects they don’t support gracefully. How this is done is
generally up to the regex engine. This crate generally treats unsupported
anchored modes as an error to report for example, but for simplicity, in
the meta regex engine, trying to search with an invalid pattern ID just
results in no match being reported.
Implementations§
source§impl<'h> Input<'h>
impl<'h> Input<'h>
sourcepub fn new<H: ?Sized + AsRef<[u8]>>(haystack: &'h H) -> Input<'h>
pub fn new<H: ?Sized + AsRef<[u8]>>(haystack: &'h H) -> Input<'h>
Create a new search configuration for the given haystack.
sourcepub fn span<S: Into<Span>>(self, span: S) -> Input<'h>
pub fn span<S: Into<Span>>(self, span: S) -> Input<'h>
Set the span for this search.
This routine does not panic if the span given is not a valid range for this search’s haystack. If this search is run with an invalid range, then the most likely outcome is that the actual search execution will panic.
This routine is generic over how a span is provided. While
a Span
may be given directly, one may also provide a
std::ops::Range<usize>
. To provide anything supported by range
syntax, use the Input::range
method.
The default span is the entire haystack.
Note that Input::range
overrides this method and vice versa.
Panics
This panics if the given span does not correspond to valid bounds in the haystack or the termination of a search.
Example
This example shows how the span of the search can impact whether a match is reported or not. This is particularly relevant for look-around operators, which might take things outside of the span into account when determining whether they match.
use regex_automata::{
nfa::thompson::pikevm::PikeVM,
Match, Input,
};
// Look for 'at', but as a distinct word.
let re = PikeVM::new(r"\bat\b")?;
let mut cache = re.create_cache();
let mut caps = re.create_captures();
// Our haystack contains 'at', but not as a distinct word.
let haystack = "batter";
// A standard search finds nothing, as expected.
let input = Input::new(haystack);
re.search(&mut cache, &input, &mut caps);
assert_eq!(None, caps.get_match());
// But if we wanted to search starting at position '1', we might
// slice the haystack. If we do this, it's impossible for the \b
// anchors to take the surrounding context into account! And thus,
// a match is produced.
let input = Input::new(&haystack[1..3]);
re.search(&mut cache, &input, &mut caps);
assert_eq!(Some(Match::must(0, 0..2)), caps.get_match());
// But if we specify the span of the search instead of slicing the
// haystack, then the regex engine can "see" outside of the span
// and resolve the anchors correctly.
let input = Input::new(haystack).span(1..3);
re.search(&mut cache, &input, &mut caps);
assert_eq!(None, caps.get_match());
This may seem a little ham-fisted, but this scenario tends to come up if some other regex engine found the match span and now you need to re-process that span to look for capturing groups. (e.g., Run a faster DFA first, find a match, then run the PikeVM on just the match span to resolve capturing groups.) In order to implement that sort of logic correctly, you need to set the span on the search instead of slicing the haystack directly.
The other advantage of using this routine to specify the bounds of the
search is that the match offsets are still reported in terms of the
original haystack. For example, the second search in the example above
reported a match at position 0
, even though at
starts at offset
1
because we sliced the haystack.
sourcepub fn range<R: RangeBounds<usize>>(self, range: R) -> Input<'h>
pub fn range<R: RangeBounds<usize>>(self, range: R) -> Input<'h>
Like Input::span
, but accepts any range instead.
This routine does not panic if the range given is not a valid range for this search’s haystack. If this search is run with an invalid range, then the most likely outcome is that the actual search execution will panic.
The default range is the entire haystack.
Note that Input::span
overrides this method and vice versa.
Panics
This routine will panic if the given range could not be converted
to a valid Range
. For example, this would panic when given
0..=usize::MAX
since it cannot be represented using a half-open
interval in terms of usize
.
This also panics if the given range does not correspond to valid bounds in the haystack or the termination of a search.
Example
use regex_automata::Input;
let input = Input::new("foobar");
assert_eq!(0..6, input.get_range());
let input = Input::new("foobar").range(2..=4);
assert_eq!(2..5, input.get_range());
sourcepub fn anchored(self, mode: Anchored) -> Input<'h>
pub fn anchored(self, mode: Anchored) -> Input<'h>
Sets the anchor mode of a search.
When a search is anchored (so that’s Anchored::Yes
or
Anchored::Pattern
), a match must begin at the start of a search.
When a search is not anchored (that’s Anchored::No
), regex engines
will behave as if the pattern started with a (?s-u:.)*?
. This prefix
permits a match to appear anywhere.
By default, the anchored mode is Anchored::No
.
WARNING: this is subtly different than using a ^
at the start of
your regex. A ^
forces a regex to match exclusively at the start of
a haystack, regardless of where you begin your search. In contrast,
anchoring a search will allow your regex to match anywhere in your
haystack, but the match must start at the beginning of a search.
For example, consider the haystack aba
and the following searches:
- The regex
^a
is compiled withAnchored::No
and searchesaba
starting at position2
. Since^
requires the match to start at the beginning of the haystack and2 > 0
, no match is found. - The regex
a
is compiled withAnchored::Yes
and searchesaba
starting at position2
. This reports a match at[2, 3]
since the match starts where the search started. Since there is no^
, there is no requirement for the match to start at the beginning of the haystack. - The regex
a
is compiled withAnchored::Yes
and searchesaba
starting at position1
. Sinceb
corresponds to position1
and since the search is anchored, it finds no match. While the regex matches at other positions, configuring the search to be anchored requires that it only report a match that begins at the same offset as the beginning of the search. - The regex
a
is compiled withAnchored::No
and searchesaba
starting at position1
. Since the search is not anchored and the regex does not start with^
, the search executes as if there is a(?s:.)*?
prefix that permits it to match anywhere. Thus, it reports a match at[2, 3]
.
Note that the Anchored::Pattern
mode is like Anchored::Yes
,
except it only reports matches for a particular pattern.
Example
This demonstrates the differences between an anchored search and
a pattern that begins with ^
(as described in the above warning
message).
use regex_automata::{
nfa::thompson::pikevm::PikeVM,
Anchored, Match, Input,
};
let haystack = "aba";
let re = PikeVM::new(r"^a")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let input = Input::new(haystack).span(2..3).anchored(Anchored::No);
re.search(&mut cache, &input, &mut caps);
// No match is found because 2 is not the beginning of the haystack,
// which is what ^ requires.
assert_eq!(None, caps.get_match());
let re = PikeVM::new(r"a")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let input = Input::new(haystack).span(2..3).anchored(Anchored::Yes);
re.search(&mut cache, &input, &mut caps);
// An anchored search can still match anywhere in the haystack, it just
// must begin at the start of the search which is '2' in this case.
assert_eq!(Some(Match::must(0, 2..3)), caps.get_match());
let re = PikeVM::new(r"a")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let input = Input::new(haystack).span(1..3).anchored(Anchored::Yes);
re.search(&mut cache, &input, &mut caps);
// No match is found since we start searching at offset 1 which
// corresponds to 'b'. Since there is no '(?s:.)*?' prefix, no match
// is found.
assert_eq!(None, caps.get_match());
let re = PikeVM::new(r"a")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let input = Input::new(haystack).span(1..3).anchored(Anchored::No);
re.search(&mut cache, &input, &mut caps);
// Since anchored=no, an implicit '(?s:.)*?' prefix was added to the
// pattern. Even though the search starts at 'b', the 'match anything'
// prefix allows the search to match 'a'.
let expected = Some(Match::must(0, 2..3));
assert_eq!(expected, caps.get_match());
sourcepub fn earliest(self, yes: bool) -> Input<'h>
pub fn earliest(self, yes: bool) -> Input<'h>
Whether to execute an “earliest” search or not.
When running a non-overlapping search, an “earliest” search will return
the match location as early as possible. For example, given a pattern
of foo[0-9]+
and a haystack of foo12345
, a normal leftmost search
will return foo12345
as a match. But an “earliest” search for regex
engines that support “earliest” semantics will return foo1
as a
match, since as soon as the first digit following foo
is seen, it is
known to have found a match.
Note that “earliest” semantics generally depend on the regex engine. Different regex engines may determine there is a match at different points. So there is no guarantee that “earliest” matches will always return the same offsets for all regex engines. The “earliest” notion is really about when the particular regex engine determines there is a match rather than a consistent semantic unto itself. This is often useful for implementing “did a match occur or not” predicates, but sometimes the offset is useful as well.
This is disabled by default.
Example
This example shows the difference between “earliest” searching and normal searching.
use regex_automata::{nfa::thompson::pikevm::PikeVM, Match, Input};
let re = PikeVM::new(r"foo[0-9]+")?;
let mut cache = re.create_cache();
let mut caps = re.create_captures();
// A normal search implements greediness like you expect.
let input = Input::new("foo12345");
re.search(&mut cache, &input, &mut caps);
assert_eq!(Some(Match::must(0, 0..8)), caps.get_match());
// When 'earliest' is enabled and the regex engine supports
// it, the search will bail once it knows a match has been
// found.
let input = Input::new("foo12345").earliest(true);
re.search(&mut cache, &input, &mut caps);
assert_eq!(Some(Match::must(0, 0..4)), caps.get_match());
sourcepub fn set_span<S: Into<Span>>(&mut self, span: S)
pub fn set_span<S: Into<Span>>(&mut self, span: S)
Set the span for this search configuration.
This is like the Input::span
method, except this mutates the
span in place.
This routine is generic over how a span is provided. While
a Span
may be given directly, one may also provide a
std::ops::Range<usize>
.
Panics
This panics if the given span does not correspond to valid bounds in the haystack or the termination of a search.
Example
use regex_automata::Input;
let mut input = Input::new("foobar");
assert_eq!(0..6, input.get_range());
input.set_span(2..4);
assert_eq!(2..4, input.get_range());
sourcepub fn set_range<R: RangeBounds<usize>>(&mut self, range: R)
pub fn set_range<R: RangeBounds<usize>>(&mut self, range: R)
Set the span for this search configuration given any range.
This is like the Input::range
method, except this mutates the
span in place.
This routine does not panic if the range given is not a valid range for this search’s haystack. If this search is run with an invalid range, then the most likely outcome is that the actual search execution will panic.
Panics
This routine will panic if the given range could not be converted
to a valid Range
. For example, this would panic when given
0..=usize::MAX
since it cannot be represented using a half-open
interval in terms of usize
.
This also panics if the given span does not correspond to valid bounds in the haystack or the termination of a search.
Example
use regex_automata::Input;
let mut input = Input::new("foobar");
assert_eq!(0..6, input.get_range());
input.set_range(2..=4);
assert_eq!(2..5, input.get_range());
sourcepub fn set_start(&mut self, start: usize)
pub fn set_start(&mut self, start: usize)
Set the starting offset for the span for this search configuration.
This is a convenience routine for only mutating the start of a span without having to set the entire span.
Panics
This panics if the span resulting from the new start position does not correspond to valid bounds in the haystack or the termination of a search.
Example
use regex_automata::Input;
let mut input = Input::new("foobar");
assert_eq!(0..6, input.get_range());
input.set_start(5);
assert_eq!(5..6, input.get_range());
sourcepub fn set_end(&mut self, end: usize)
pub fn set_end(&mut self, end: usize)
Set the ending offset for the span for this search configuration.
This is a convenience routine for only mutating the end of a span without having to set the entire span.
Panics
This panics if the span resulting from the new end position does not correspond to valid bounds in the haystack or the termination of a search.
Example
use regex_automata::Input;
let mut input = Input::new("foobar");
assert_eq!(0..6, input.get_range());
input.set_end(5);
assert_eq!(0..5, input.get_range());
sourcepub fn set_anchored(&mut self, mode: Anchored)
pub fn set_anchored(&mut self, mode: Anchored)
Set the anchor mode of a search.
This is like Input::anchored
, except it mutates the search
configuration in place.
Example
use regex_automata::{Anchored, Input, PatternID};
let mut input = Input::new("foobar");
assert_eq!(Anchored::No, input.get_anchored());
let pid = PatternID::must(5);
input.set_anchored(Anchored::Pattern(pid));
assert_eq!(Anchored::Pattern(pid), input.get_anchored());
sourcepub fn set_earliest(&mut self, yes: bool)
pub fn set_earliest(&mut self, yes: bool)
Set whether the search should execute in “earliest” mode or not.
This is like Input::earliest
, except it mutates the search
configuration in place.
Example
use regex_automata::Input;
let mut input = Input::new("foobar");
assert!(!input.get_earliest());
input.set_earliest(true);
assert!(input.get_earliest());
sourcepub fn haystack(&self) -> &[u8] ⓘ
pub fn haystack(&self) -> &[u8] ⓘ
Return a borrow of the underlying haystack as a slice of bytes.
Example
use regex_automata::Input;
let input = Input::new("foobar");
assert_eq!(b"foobar", input.haystack());
sourcepub fn start(&self) -> usize
pub fn start(&self) -> usize
Return the start position of this search.
This is a convenience routine for search.get_span().start()
.
When Input::is_done
is false
, this is guaranteed to return
an offset that is less than or equal to Input::end
. Otherwise,
the offset is one greater than Input::end
.
Example
use regex_automata::Input;
let input = Input::new("foobar");
assert_eq!(0, input.start());
let input = Input::new("foobar").span(2..4);
assert_eq!(2, input.start());
sourcepub fn end(&self) -> usize
pub fn end(&self) -> usize
Return the end position of this search.
This is a convenience routine for search.get_span().end()
.
This is guaranteed to return an offset that is a valid exclusive end bound for this input’s haystack.
Example
use regex_automata::Input;
let input = Input::new("foobar");
assert_eq!(6, input.end());
let input = Input::new("foobar").span(2..4);
assert_eq!(4, input.end());
sourcepub fn get_span(&self) -> Span
pub fn get_span(&self) -> Span
Return the span for this search configuration.
If one was not explicitly set, then the span corresponds to the entire range of the haystack.
When Input::is_done
is false
, the span returned is guaranteed
to correspond to valid bounds for this input’s haystack.
Example
use regex_automata::{Input, Span};
let input = Input::new("foobar");
assert_eq!(Span { start: 0, end: 6 }, input.get_span());
sourcepub fn get_range(&self) -> Range<usize>
pub fn get_range(&self) -> Range<usize>
Return the span as a range for this search configuration.
If one was not explicitly set, then the span corresponds to the entire range of the haystack.
When Input::is_done
is false
, the range returned is guaranteed
to correspond to valid bounds for this input’s haystack.
Example
use regex_automata::Input;
let input = Input::new("foobar");
assert_eq!(0..6, input.get_range());
sourcepub fn get_anchored(&self) -> Anchored
pub fn get_anchored(&self) -> Anchored
Return the anchored mode for this search configuration.
If no anchored mode was set, then it defaults to Anchored::No
.
Example
use regex_automata::{Anchored, Input, PatternID};
let mut input = Input::new("foobar");
assert_eq!(Anchored::No, input.get_anchored());
let pid = PatternID::must(5);
input.set_anchored(Anchored::Pattern(pid));
assert_eq!(Anchored::Pattern(pid), input.get_anchored());
sourcepub fn get_earliest(&self) -> bool
pub fn get_earliest(&self) -> bool
Return whether this search should execute in “earliest” mode.
Example
use regex_automata::Input;
let input = Input::new("foobar");
assert!(!input.get_earliest());
sourcepub fn is_done(&self) -> bool
pub fn is_done(&self) -> bool
Return true if and only if this search can never return any other matches.
This occurs when the start position of this search is greater than the end position of the search.
Example
use regex_automata::Input;
let mut input = Input::new("foobar");
assert!(!input.is_done());
input.set_start(6);
assert!(!input.is_done());
input.set_start(7);
assert!(input.is_done());
sourcepub fn is_char_boundary(&self, offset: usize) -> bool
pub fn is_char_boundary(&self, offset: usize) -> bool
Returns true if and only if the given offset in this search’s haystack falls on a valid UTF-8 encoded codepoint boundary.
If the haystack is not valid UTF-8, then the behavior of this routine is unspecified.
Example
This shows where codepoint boundaries do and don’t exist in valid UTF-8.
use regex_automata::Input;
let input = Input::new("☃");
assert!(input.is_char_boundary(0));
assert!(!input.is_char_boundary(1));
assert!(!input.is_char_boundary(2));
assert!(input.is_char_boundary(3));
assert!(!input.is_char_boundary(4));