Module regex::bytes

source ·
Expand description

Search for regex matches in &[u8] haystacks.

This module provides a nearly identical API via Regex to the one found in the top-level of this crate. There are two important differences:

  1. Matching is done on &[u8] instead of &str. Additionally, Vec<u8> is used where String would have been used in the top-level API.
  2. Unicode support can be disabled even when disabling it would result in matching invalid UTF-8 bytes.

Example: match null terminated string

This shows how to find all null-terminated strings in a slice of bytes. This works even if a C string contains invalid UTF-8.

use regex::bytes::Regex;

let re = Regex::new(r"(?-u)(?<cstr>[^\x00]+)\x00").unwrap();
let hay = b"foo\x00qu\xFFux\x00baz\x00";

// Extract all of the strings without the NUL terminator from each match.
// The unwrap is OK here since a match requires the `cstr` capture to match.
let cstrs: Vec<&[u8]> =
    re.captures_iter(hay)
      .map(|c| c.name("cstr").unwrap().as_bytes())
      .collect();
assert_eq!(cstrs, vec![&b"foo"[..], &b"qu\xFFux"[..], &b"baz"[..]]);

Example: selectively enable Unicode support

This shows how to match an arbitrary byte pattern followed by a UTF-8 encoded string (e.g., to extract a title from a Matroska file):

use regex::bytes::Regex;

let re = Regex::new(
    r"(?-u)\x7b\xa9(?:[\x80-\xfe]|[\x40-\xff].)(?u:(.*))"
).unwrap();
let hay = b"\x12\xd0\x3b\x5f\x7b\xa9\x85\xe2\x98\x83\x80\x98\x54\x76\x68\x65";

// Notice that despite the `.*` at the end, it will only match valid UTF-8
// because Unicode mode was enabled with the `u` flag. Without the `u` flag,
// the `.*` would match the rest of the bytes regardless of whether they were
// valid UTF-8.
let (_, [title]) = re.captures(hay).unwrap().extract();
assert_eq!(title, b"\xE2\x98\x83");
// We can UTF-8 decode the title now. And the unwrap here
// is correct because the existence of a match guarantees
// that `title` is valid UTF-8.
let title = std::str::from_utf8(title).unwrap();
assert_eq!(title, "☃");

In general, if the Unicode flag is enabled in a capture group and that capture is part of the overall match, then the capture is guaranteed to be valid UTF-8.

Syntax

The supported syntax is pretty much the same as the syntax for Unicode regular expressions with a few changes that make sense for matching arbitrary bytes:

  1. The u flag can be disabled even when disabling it might cause the regex to match invalid UTF-8. When the u flag is disabled, the regex is said to be in “ASCII compatible” mode.
  2. In ASCII compatible mode, Unicode character classes are not allowed. Literal Unicode scalar values outside of character classes are allowed.
  3. In ASCII compatible mode, Perl character classes (\w, \d and \s) revert to their typical ASCII definition. \w maps to [[:word:]], \d maps to [[:digit:]] and \s maps to [[:space:]].
  4. In ASCII compatible mode, word boundaries use the ASCII compatible \w to determine whether a byte is a word byte or not.
  5. Hexadecimal notation can be used to specify arbitrary bytes instead of Unicode codepoints. For example, in ASCII compatible mode, \xFF matches the literal byte \xFF, while in Unicode mode, \xFF is the Unicode codepoint U+00FF that matches its UTF-8 encoding of \xC3\xBF. Similarly for octal notation when enabled.
  6. In ASCII compatible mode, . matches any byte except for \n. When the s flag is additionally enabled, . matches any byte.

Performance

In general, one should expect performance on &[u8] to be roughly similar to performance on &str.

Structs

  • A low level representation of the byte offsets of each capture group.
  • An iterator over all non-overlapping capture matches in a haystack.
  • An iterator over the names of all capture groups in a regex.
  • Represents the capture groups for a single match.
  • Represents a single match of a regex in a haystack.
  • An iterator over all non-overlapping matches in a haystack.
  • A helper type for forcing literal string replacement.
  • A compiled regular expression for searching Unicode haystacks.
  • A configurable builder for a Regex.
  • Match multiple, possibly overlapping, regexes in a single search.
  • A configurable builder for a RegexSet.
  • A by-reference adaptor for a Replacer.
  • A set of matches returned by a regex set.
  • An owned iterator over the set of matches from a regex set.
  • A borrowed iterator over the set of matches from a regex set.
  • An iterator over all substrings delimited by a regex match.
  • An iterator over at most N substrings delimited by a regex match.
  • An iterator over all group matches in a Captures value.

Traits

  • A trait for types that can be used to replace matches in a haystack.