Struct regex_syntax::hir::Hir
source · pub struct Hir { /* private fields */ }
Expand description
A high-level intermediate representation (HIR) for a regular expression.
An HIR value is a combination of a HirKind
and a set of Properties
.
An HirKind
indicates what kind of regular expression it is (a literal,
a repetition, a look-around assertion, etc.), where as a Properties
describes various facts about the regular expression. For example, whether
it matches UTF-8 or if it matches the empty string.
The HIR of a regular expression represents an intermediate step between
its abstract syntax (a structured description of the concrete syntax) and
an actual regex matcher. The purpose of HIR is to make regular expressions
easier to analyze. In particular, the AST is much more complex than the
HIR. For example, while an AST supports arbitrarily nested character
classes, the HIR will flatten all nested classes into a single set. The HIR
will also “compile away” every flag present in the concrete syntax. For
example, users of HIR expressions never need to worry about case folding;
it is handled automatically by the translator (e.g., by translating
(?i:A)
to [aA]
).
The specific type of an HIR expression can be accessed via its kind
or into_kind
methods. This extra level of indirection exists for two
reasons:
- Construction of an HIR expression must use the constructor methods on
this
Hir
type instead of building theHirKind
values directly. This permits construction to enforce invariants like “concatenations always consist of two or more sub-expressions.” - Every HIR expression contains attributes that are defined inductively, and can be computed cheaply during the construction process. For example, one such attribute is whether the expression must match at the beginning of the haystack.
In particular, if you have an HirKind
value, then there is intentionally
no way to build an Hir
value from it. You instead need to do case
analysis on the HirKind
value and build the Hir
value using its smart
constructors.
§UTF-8
If the HIR was produced by a translator with
TranslatorBuilder::utf8
enabled,
then the HIR is guaranteed to match UTF-8 exclusively for all non-empty
matches.
For empty matches, those can occur at any position. It is the responsibility of the regex engine to determine whether empty matches are permitted between the code units of a single codepoint.
§Stack space
This type defines its own destructor that uses constant stack space and heap space proportional to the size of the HIR.
Also, an Hir
’s fmt::Display
implementation prints an HIR as a regular
expression pattern string, and uses constant stack space and heap space
proportional to the size of the Hir
. The regex it prints is guaranteed to
be semantically equivalent to the original concrete syntax, but it may
look very different. (And potentially not practically readable by a human.)
An Hir
’s fmt::Debug
implementation currently does not use constant
stack space. The implementation will also suppress some details (such as
the Properties
inlined into every Hir
value to make it less noisy).
Implementations§
source§impl Hir
impl Hir
Methods for accessing the underlying HirKind
and Properties
.
sourcepub fn into_kind(self) -> HirKind
pub fn into_kind(self) -> HirKind
Consumes ownership of this HIR expression and returns its underlying
HirKind
.
sourcepub fn properties(&self) -> &Properties
pub fn properties(&self) -> &Properties
Returns the properties computed for this Hir
.
source§impl Hir
impl Hir
Smart constructors for HIR values.
These constructors are called “smart” because they do inductive work or
simplifications. For example, calling Hir::repetition
with a repetition
like a{0}
will actually return a Hir
with a HirKind::Empty
kind
since it is equivalent to an empty regex. Another example is calling
Hir::concat(vec![expr])
. Instead of getting a HirKind::Concat
, you’ll
just get back the original expr
since it’s precisely equivalent.
Smart constructors enable maintaining invariants about the HIR data type while also simulanteously keeping the representation as simple as possible.
sourcepub fn empty() -> Hir
pub fn empty() -> Hir
Returns an empty HIR expression.
An empty HIR expression always matches, including the empty string.
sourcepub fn fail() -> Hir
pub fn fail() -> Hir
Returns an HIR expression that can never match anything. That is,
the size of the set of strings in the language described by the HIR
returned is 0
.
This is distinct from Hir::empty
in that the empty string matches
the HIR returned by Hir::empty
. That is, the set of strings in the
language describe described by Hir::empty
is non-empty.
Note that currently, the HIR returned uses an empty character class to
indicate that nothing can match. An equivalent expression that cannot
match is an empty alternation, but all such “fail” expressions are
normalized (via smart constructors) to empty character classes. This is
because empty character classes can be spelled in the concrete syntax
of a regex (e.g., \P{any}
or (?-u:[^\x00-\xFF])
or [a&&b]
), but
empty alternations cannot.
sourcepub fn literal<B: Into<Box<[u8]>>>(lit: B) -> Hir
pub fn literal<B: Into<Box<[u8]>>>(lit: B) -> Hir
Creates a literal HIR expression.
This accepts anything that can be converted into a Box<[u8]>
.
Note that there is no mechanism for storing a char
or a Box<str>
in an HIR. Everything is “just bytes.” Whether a Literal
(or
any HIR node) matches valid UTF-8 exclusively can be queried via
Properties::is_utf8
.
§Example
This example shows that concatenations of Literal
HIR values will
automatically get flattened and combined together. So for example, even
if you concat multiple Literal
values that are themselves not valid
UTF-8, they might add up to valid UTF-8. This also demonstrates just
how “smart” Hir’s smart constructors are.
use regex_syntax::hir::{Hir, HirKind, Literal};
let literals = vec![
Hir::literal([0xE2]),
Hir::literal([0x98]),
Hir::literal([0x83]),
];
// Each literal, on its own, is invalid UTF-8.
assert!(literals.iter().all(|hir| !hir.properties().is_utf8()));
let concat = Hir::concat(literals);
// But the concatenation is valid UTF-8!
assert!(concat.properties().is_utf8());
// And also notice that the literals have been concatenated into a
// single `Literal`, to the point where there is no explicit `Concat`!
let expected = HirKind::Literal(Literal(Box::from("☃".as_bytes())));
assert_eq!(&expected, concat.kind());
§Example: building a literal from a char
This example shows how to build a single Hir
literal from a char
value. Since a Literal
is just bytes, we just need to UTF-8
encode a char
value:
use regex_syntax::hir::{Hir, HirKind, Literal};
let ch = '☃';
let got = Hir::literal(ch.encode_utf8(&mut [0; 4]).as_bytes());
let expected = HirKind::Literal(Literal(Box::from("☃".as_bytes())));
assert_eq!(&expected, got.kind());
sourcepub fn class(class: Class) -> Hir
pub fn class(class: Class) -> Hir
Creates a class HIR expression. The class may either be defined over ranges of Unicode codepoints or ranges of raw byte values.
Note that an empty class is permitted. An empty class is equivalent to
Hir::fail()
.
sourcepub fn repetition(rep: Repetition) -> Hir
pub fn repetition(rep: Repetition) -> Hir
Creates a repetition HIR expression.
sourcepub fn capture(capture: Capture) -> Hir
pub fn capture(capture: Capture) -> Hir
Creates a capture HIR expression.
Note that there is no explicit HIR value for a non-capturing group. Since a non-capturing group only exists to override precedence in the concrete syntax and since an HIR already does its own grouping based on what is parsed, there is no need to explicitly represent non-capturing groups in the HIR.
sourcepub fn concat(subs: Vec<Hir>) -> Hir
pub fn concat(subs: Vec<Hir>) -> Hir
Returns the concatenation of the given expressions.
This attempts to flatten and simplify the concatenation as appropriate.
§Example
This shows a simple example of basic flattening of both concatenations and literals.
use regex_syntax::hir::Hir;
let hir = Hir::concat(vec![
Hir::concat(vec![
Hir::literal([b'a']),
Hir::literal([b'b']),
Hir::literal([b'c']),
]),
Hir::concat(vec![
Hir::literal([b'x']),
Hir::literal([b'y']),
Hir::literal([b'z']),
]),
]);
let expected = Hir::literal("abcxyz".as_bytes());
assert_eq!(expected, hir);
sourcepub fn alternation(subs: Vec<Hir>) -> Hir
pub fn alternation(subs: Vec<Hir>) -> Hir
Returns the alternation of the given expressions.
This flattens and simplifies the alternation as appropriate. This may include factoring out common prefixes or even rewriting the alternation as a character class.
Note that an empty alternation is equivalent to Hir::fail()
. (It
is not possible for one to write an empty alternation, or even an
alternation with a single sub-expression, in the concrete syntax of a
regex.)
§Example
This is a simple example showing how an alternation might get simplified.
use regex_syntax::hir::{Hir, Class, ClassUnicode, ClassUnicodeRange};
let hir = Hir::alternation(vec![
Hir::literal([b'a']),
Hir::literal([b'b']),
Hir::literal([b'c']),
Hir::literal([b'd']),
Hir::literal([b'e']),
Hir::literal([b'f']),
]);
let expected = Hir::class(Class::Unicode(ClassUnicode::new([
ClassUnicodeRange::new('a', 'f'),
])));
assert_eq!(expected, hir);
And another example showing how common prefixes might get factored out.
use regex_syntax::hir::{Hir, Class, ClassUnicode, ClassUnicodeRange};
let hir = Hir::alternation(vec![
Hir::concat(vec![
Hir::literal("abc".as_bytes()),
Hir::class(Class::Unicode(ClassUnicode::new([
ClassUnicodeRange::new('A', 'Z'),
]))),
]),
Hir::concat(vec![
Hir::literal("abc".as_bytes()),
Hir::class(Class::Unicode(ClassUnicode::new([
ClassUnicodeRange::new('a', 'z'),
]))),
]),
]);
let expected = Hir::concat(vec![
Hir::literal("abc".as_bytes()),
Hir::alternation(vec![
Hir::class(Class::Unicode(ClassUnicode::new([
ClassUnicodeRange::new('A', 'Z'),
]))),
Hir::class(Class::Unicode(ClassUnicode::new([
ClassUnicodeRange::new('a', 'z'),
]))),
]),
]);
assert_eq!(expected, hir);
Note that these sorts of simplifications are not guaranteed.
sourcepub fn dot(dot: Dot) -> Hir
pub fn dot(dot: Dot) -> Hir
Returns an HIR expression for .
.
Dot::AnyChar
maps to(?su-R:.)
.Dot::AnyByte
maps to(?s-Ru:.)
.Dot::AnyCharExceptLF
maps to(?u-Rs:.)
.Dot::AnyCharExceptCRLF
maps to(?Ru-s:.)
.Dot::AnyByteExceptLF
maps to(?-Rsu:.)
.Dot::AnyByteExceptCRLF
maps to(?R-su:.)
.
§Example
Note that this is a convenience routine for constructing the correct
character class based on the value of Dot
. There is no explicit “dot”
HIR value. It is just an abbreviation for a common character class.
use regex_syntax::hir::{Hir, Dot, Class, ClassBytes, ClassBytesRange};
let hir = Hir::dot(Dot::AnyByte);
let expected = Hir::class(Class::Bytes(ClassBytes::new([
ClassBytesRange::new(0x00, 0xFF),
])));
assert_eq!(expected, hir);
Trait Implementations§
source§impl Display for Hir
impl Display for Hir
Print a display representation of this Hir.
The result of this is a valid regular expression pattern string.
This implementation uses constant stack space and heap space proportional
to the size of the Hir
.
source§impl Drop for Hir
impl Drop for Hir
A custom Drop
impl is used for HirKind
such that it uses constant stack
space but heap space proportional to the depth of the total Hir
.