Struct regex_syntax::hir::Hir

source ·
pub struct Hir { /* private fields */ }
Expand description

A high-level intermediate representation (HIR) for a regular expression.

An HIR value is a combination of a HirKind and a set of Properties. An HirKind indicates what kind of regular expression it is (a literal, a repetition, a look-around assertion, etc.), where as a Properties describes various facts about the regular expression. For example, whether it matches UTF-8 or if it matches the empty string.

The HIR of a regular expression represents an intermediate step between its abstract syntax (a structured description of the concrete syntax) and an actual regex matcher. The purpose of HIR is to make regular expressions easier to analyze. In particular, the AST is much more complex than the HIR. For example, while an AST supports arbitrarily nested character classes, the HIR will flatten all nested classes into a single set. The HIR will also “compile away” every flag present in the concrete syntax. For example, users of HIR expressions never need to worry about case folding; it is handled automatically by the translator (e.g., by translating (?i:A) to [aA]).

The specific type of an HIR expression can be accessed via its kind or into_kind methods. This extra level of indirection exists for two reasons:

  1. Construction of an HIR expression must use the constructor methods on this Hir type instead of building the HirKind values directly. This permits construction to enforce invariants like “concatenations always consist of two or more sub-expressions.”
  2. Every HIR expression contains attributes that are defined inductively, and can be computed cheaply during the construction process. For example, one such attribute is whether the expression must match at the beginning of the haystack.

In particular, if you have an HirKind value, then there is intentionally no way to build an Hir value from it. You instead need to do case analysis on the HirKind value and build the Hir value using its smart constructors.

§UTF-8

If the HIR was produced by a translator with TranslatorBuilder::utf8 enabled, then the HIR is guaranteed to match UTF-8 exclusively for all non-empty matches.

For empty matches, those can occur at any position. It is the responsibility of the regex engine to determine whether empty matches are permitted between the code units of a single codepoint.

§Stack space

This type defines its own destructor that uses constant stack space and heap space proportional to the size of the HIR.

Also, an Hir’s fmt::Display implementation prints an HIR as a regular expression pattern string, and uses constant stack space and heap space proportional to the size of the Hir. The regex it prints is guaranteed to be semantically equivalent to the original concrete syntax, but it may look very different. (And potentially not practically readable by a human.)

An Hir’s fmt::Debug implementation currently does not use constant stack space. The implementation will also suppress some details (such as the Properties inlined into every Hir value to make it less noisy).

Implementations§

source§

impl Hir

Methods for accessing the underlying HirKind and Properties.

source

pub fn kind(&self) -> &HirKind

Returns a reference to the underlying HIR kind.

source

pub fn into_kind(self) -> HirKind

Consumes ownership of this HIR expression and returns its underlying HirKind.

source

pub fn properties(&self) -> &Properties

Returns the properties computed for this Hir.

source§

impl Hir

Smart constructors for HIR values.

These constructors are called “smart” because they do inductive work or simplifications. For example, calling Hir::repetition with a repetition like a{0} will actually return a Hir with a HirKind::Empty kind since it is equivalent to an empty regex. Another example is calling Hir::concat(vec![expr]). Instead of getting a HirKind::Concat, you’ll just get back the original expr since it’s precisely equivalent.

Smart constructors enable maintaining invariants about the HIR data type while also simulanteously keeping the representation as simple as possible.

source

pub fn empty() -> Hir

Returns an empty HIR expression.

An empty HIR expression always matches, including the empty string.

source

pub fn fail() -> Hir

Returns an HIR expression that can never match anything. That is, the size of the set of strings in the language described by the HIR returned is 0.

This is distinct from Hir::empty in that the empty string matches the HIR returned by Hir::empty. That is, the set of strings in the language describe described by Hir::empty is non-empty.

Note that currently, the HIR returned uses an empty character class to indicate that nothing can match. An equivalent expression that cannot match is an empty alternation, but all such “fail” expressions are normalized (via smart constructors) to empty character classes. This is because empty character classes can be spelled in the concrete syntax of a regex (e.g., \P{any} or (?-u:[^\x00-\xFF]) or [a&&b]), but empty alternations cannot.

source

pub fn literal<B: Into<Box<[u8]>>>(lit: B) -> Hir

Creates a literal HIR expression.

This accepts anything that can be converted into a Box<[u8]>.

Note that there is no mechanism for storing a char or a Box<str> in an HIR. Everything is “just bytes.” Whether a Literal (or any HIR node) matches valid UTF-8 exclusively can be queried via Properties::is_utf8.

§Example

This example shows that concatenations of Literal HIR values will automatically get flattened and combined together. So for example, even if you concat multiple Literal values that are themselves not valid UTF-8, they might add up to valid UTF-8. This also demonstrates just how “smart” Hir’s smart constructors are.

use regex_syntax::hir::{Hir, HirKind, Literal};

let literals = vec![
    Hir::literal([0xE2]),
    Hir::literal([0x98]),
    Hir::literal([0x83]),
];
// Each literal, on its own, is invalid UTF-8.
assert!(literals.iter().all(|hir| !hir.properties().is_utf8()));

let concat = Hir::concat(literals);
// But the concatenation is valid UTF-8!
assert!(concat.properties().is_utf8());

// And also notice that the literals have been concatenated into a
// single `Literal`, to the point where there is no explicit `Concat`!
let expected = HirKind::Literal(Literal(Box::from("☃".as_bytes())));
assert_eq!(&expected, concat.kind());
§Example: building a literal from a char

This example shows how to build a single Hir literal from a char value. Since a Literal is just bytes, we just need to UTF-8 encode a char value:

use regex_syntax::hir::{Hir, HirKind, Literal};

let ch = '☃';
let got = Hir::literal(ch.encode_utf8(&mut [0; 4]).as_bytes());

let expected = HirKind::Literal(Literal(Box::from("☃".as_bytes())));
assert_eq!(&expected, got.kind());
source

pub fn class(class: Class) -> Hir

Creates a class HIR expression. The class may either be defined over ranges of Unicode codepoints or ranges of raw byte values.

Note that an empty class is permitted. An empty class is equivalent to Hir::fail().

source

pub fn look(look: Look) -> Hir

Creates a look-around assertion HIR expression.

source

pub fn repetition(rep: Repetition) -> Hir

Creates a repetition HIR expression.

source

pub fn capture(capture: Capture) -> Hir

Creates a capture HIR expression.

Note that there is no explicit HIR value for a non-capturing group. Since a non-capturing group only exists to override precedence in the concrete syntax and since an HIR already does its own grouping based on what is parsed, there is no need to explicitly represent non-capturing groups in the HIR.

source

pub fn concat(subs: Vec<Hir>) -> Hir

Returns the concatenation of the given expressions.

This attempts to flatten and simplify the concatenation as appropriate.

§Example

This shows a simple example of basic flattening of both concatenations and literals.

use regex_syntax::hir::Hir;

let hir = Hir::concat(vec![
    Hir::concat(vec![
        Hir::literal([b'a']),
        Hir::literal([b'b']),
        Hir::literal([b'c']),
    ]),
    Hir::concat(vec![
        Hir::literal([b'x']),
        Hir::literal([b'y']),
        Hir::literal([b'z']),
    ]),
]);
let expected = Hir::literal("abcxyz".as_bytes());
assert_eq!(expected, hir);
source

pub fn alternation(subs: Vec<Hir>) -> Hir

Returns the alternation of the given expressions.

This flattens and simplifies the alternation as appropriate. This may include factoring out common prefixes or even rewriting the alternation as a character class.

Note that an empty alternation is equivalent to Hir::fail(). (It is not possible for one to write an empty alternation, or even an alternation with a single sub-expression, in the concrete syntax of a regex.)

§Example

This is a simple example showing how an alternation might get simplified.

use regex_syntax::hir::{Hir, Class, ClassUnicode, ClassUnicodeRange};

let hir = Hir::alternation(vec![
    Hir::literal([b'a']),
    Hir::literal([b'b']),
    Hir::literal([b'c']),
    Hir::literal([b'd']),
    Hir::literal([b'e']),
    Hir::literal([b'f']),
]);
let expected = Hir::class(Class::Unicode(ClassUnicode::new([
    ClassUnicodeRange::new('a', 'f'),
])));
assert_eq!(expected, hir);

And another example showing how common prefixes might get factored out.

use regex_syntax::hir::{Hir, Class, ClassUnicode, ClassUnicodeRange};

let hir = Hir::alternation(vec![
    Hir::concat(vec![
        Hir::literal("abc".as_bytes()),
        Hir::class(Class::Unicode(ClassUnicode::new([
            ClassUnicodeRange::new('A', 'Z'),
        ]))),
    ]),
    Hir::concat(vec![
        Hir::literal("abc".as_bytes()),
        Hir::class(Class::Unicode(ClassUnicode::new([
            ClassUnicodeRange::new('a', 'z'),
        ]))),
    ]),
]);
let expected = Hir::concat(vec![
    Hir::literal("abc".as_bytes()),
    Hir::alternation(vec![
        Hir::class(Class::Unicode(ClassUnicode::new([
            ClassUnicodeRange::new('A', 'Z'),
        ]))),
        Hir::class(Class::Unicode(ClassUnicode::new([
            ClassUnicodeRange::new('a', 'z'),
        ]))),
    ]),
]);
assert_eq!(expected, hir);

Note that these sorts of simplifications are not guaranteed.

source

pub fn dot(dot: Dot) -> Hir

Returns an HIR expression for ..

§Example

Note that this is a convenience routine for constructing the correct character class based on the value of Dot. There is no explicit “dot” HIR value. It is just an abbreviation for a common character class.

use regex_syntax::hir::{Hir, Dot, Class, ClassBytes, ClassBytesRange};

let hir = Hir::dot(Dot::AnyByte);
let expected = Hir::class(Class::Bytes(ClassBytes::new([
    ClassBytesRange::new(0x00, 0xFF),
])));
assert_eq!(expected, hir);

Trait Implementations§

source§

impl Clone for Hir

source§

fn clone(&self) -> Hir

Returns a copy of the value. Read more
1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
source§

impl Debug for Hir

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
source§

impl Display for Hir

Print a display representation of this Hir.

The result of this is a valid regular expression pattern string.

This implementation uses constant stack space and heap space proportional to the size of the Hir.

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
source§

impl Drop for Hir

A custom Drop impl is used for HirKind such that it uses constant stack space but heap space proportional to the depth of the total Hir.

source§

fn drop(&mut self)

Executes the destructor for this type. Read more
source§

impl PartialEq for Hir

source§

fn eq(&self, other: &Hir) -> bool

This method tests for self and other values to be equal, and is used by ==.
1.0.0 · source§

fn ne(&self, other: &Rhs) -> bool

This method tests for !=. The default implementation is almost always sufficient, and should not be overridden without very good reason.
source§

impl Eq for Hir

source§

impl StructuralPartialEq for Hir

Auto Trait Implementations§

§

impl RefUnwindSafe for Hir

§

impl Send for Hir

§

impl Sync for Hir

§

impl Unpin for Hir

§

impl UnwindSafe for Hir

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> ToOwned for T
where T: Clone,

§

type Owned = T

The resulting type after obtaining ownership.
source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
source§

impl<T> ToString for T
where T: Display + ?Sized,

source§

default fn to_string(&self) -> String

Converts the given value to a String. Read more
source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.