Enum regex_automata::nfa::thompson::State
source · pub enum State {
ByteRange {
trans: Transition,
},
Sparse(SparseTransitions),
Dense(DenseTransitions),
Look {
look: Look,
next: StateID,
},
Union {
alternates: Box<[StateID]>,
},
BinaryUnion {
alt1: StateID,
alt2: StateID,
},
Capture {
next: StateID,
pattern_id: PatternID,
group_index: SmallIndex,
slot: SmallIndex,
},
Fail,
Match {
pattern_id: PatternID,
},
}
Expand description
A state in an NFA.
In theory, it can help to conceptualize an NFA
as a graph consisting of
State
s. Each State
contains its complete set of outgoing transitions.
In practice, it can help to conceptualize an NFA
as a sequence of
instructions for a virtual machine. Each State
says what to do and where
to go next.
Strictly speaking, the practical interpretation is the most correct one,
because of the Capture
state. Namely, a Capture
state always forwards execution to another state unconditionally. Its only
purpose is to cause a side effect: the recording of the current input
position at a particular location in memory. In this sense, an NFA
has more power than a theoretical non-deterministic finite automaton.
For most uses of this crate, it is likely that one may never even need to
be aware of this type at all. The main use cases for looking at State
s
directly are if you need to write your own search implementation or if you
need to do some kind of analysis on the NFA.
Variants§
ByteRange
A state with a single transition that can only be taken if the current input symbol is in a particular range of bytes.
Fields
trans: Transition
The transition from this state to the next.
Sparse(SparseTransitions)
A state with possibly many transitions represented in a sparse fashion. Transitions are non-overlapping and ordered lexicographically by input range.
In practice, this is used for encoding UTF-8 automata. Its presence is
primarily an optimization that avoids many additional unconditional
epsilon transitions (via Union
states), and thus
decreases the overhead of traversing the NFA. This can improve both
matching time and DFA construction time.
Dense(DenseTransitions)
A dense representation of a state with multiple transitions.
Look
A conditional epsilon transition satisfied via some sort of look-around. Look-around is limited to anchor and word boundary assertions.
Look-around states are meant to be evaluated while performing epsilon closure (computing the set of states reachable from a particular state via only epsilon transitions). If the current position in the haystack satisfies the look-around assertion, then you’re permitted to follow that epsilon transition.
Fields
Union
An alternation such that there exists an epsilon transition to all
states in alternates
, where matches found via earlier transitions
are preferred over later transitions.
Fields
BinaryUnion
An alternation such that there exists precisely two unconditional
epsilon transitions, where matches found via alt1
are preferred over
matches found via alt2
.
This state exists as a common special case of Union where there are only two alternates. In this case, we don’t need any allocations to represent the state. This saves a bit of memory and also saves an additional memory access when traversing the NFA.
Fields
Capture
An empty state that records a capture location.
From the perspective of finite automata, this is precisely equivalent to an unconditional epsilon transition, but serves the purpose of instructing NFA simulations to record additional state when the finite state machine passes through this epsilon transition.
slot
in this context refers to the specific capture group slot
offset that is being recorded. Each capturing group has two slots
corresponding to the start and end of the matching portion of that
group.
The pattern ID and capture group index are also included in this state
in case they are useful. But mostly, all you’ll need is next
and
slot
.
Fields
group_index: SmallIndex
The capture group index that this capture belongs to. Capture group
indices are local to each pattern. For example, when capturing
groups are enabled, every pattern has a capture group at index
0
.
slot: SmallIndex
The slot index for this capture. Every capturing group has two
slots: one for the start haystack offset and one for the end
haystack offset. Unlike capture group indices, slot indices are
global across all patterns in this NFA. That is, each slot belongs
to a single pattern, but there is only one slot at index i
.
Fail
A state that cannot be transitioned out of. This is useful for cases
where you want to prevent matching from occurring. For example, if your
regex parser permits empty character classes, then one could choose
a Fail
state to represent them. (An empty character class can be
thought of as an empty set. Since nothing is in an empty set, they can
never match anything.)
Match
A match state. There is at least one such occurrence of this state for each regex that can match that is in this NFA.
Implementations§
source§impl State
impl State
sourcepub fn is_epsilon(&self) -> bool
pub fn is_epsilon(&self) -> bool
Returns true if and only if this state contains one or more epsilon transitions.
In practice, a state has no outgoing transitions (like Match
), has
only non-epsilon transitions (like ByteRange
) or has only epsilon
transitions (like Union
).
§Example
use regex_automata::{
nfa::thompson::{State, Transition},
util::primitives::{PatternID, StateID, SmallIndex},
};
// Capture states are epsilon transitions.
let state = State::Capture {
next: StateID::ZERO,
pattern_id: PatternID::ZERO,
group_index: SmallIndex::ZERO,
slot: SmallIndex::ZERO,
};
assert!(state.is_epsilon());
// ByteRange states are not.
let state = State::ByteRange {
trans: Transition { start: b'a', end: b'z', next: StateID::ZERO },
};
assert!(!state.is_epsilon());