edlang/regex/bytes/index.html
2024-05-05 09:43:20 +00:00

78 lines
14 KiB
HTML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html><html lang="en"><head><meta charset="utf-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><meta name="generator" content="rustdoc"><meta name="description" content="Search for regex matches in `&amp;[u8]` haystacks."><title>regex::bytes - Rust</title><script> if (window.location.protocol !== "file:") document.write(`<link rel="preload" as="font" type="font/woff2" crossorigin href="../../static.files/SourceSerif4-Regular-46f98efaafac5295.ttf.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin href="../../static.files/FiraSans-Regular-018c141bf0843ffd.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin href="../../static.files/FiraSans-Medium-8f9a781e4970d388.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin href="../../static.files/SourceCodePro-Regular-562dcc5011b6de7d.ttf.woff2"><link rel="preload" as="font" type="font/woff2" crossorigin href="../../static.files/SourceCodePro-Semibold-d899c5a5c4aeb14a.ttf.woff2">`)</script><link rel="stylesheet" href="../../static.files/normalize-76eba96aa4d2e634.css"><link rel="stylesheet" href="../../static.files/rustdoc-e935ef01ae1c1829.css"><meta name="rustdoc-vars" data-root-path="../../" data-static-root-path="../../static.files/" data-current-crate="regex" data-themes="" data-resource-suffix="" data-rustdoc-version="1.78.0 (9b00956e5 2024-04-29)" data-channel="1.78.0" data-search-js="search-42d8da7a6b9792c2.js" data-settings-js="settings-4313503d2e1961c2.js" ><script src="../../static.files/storage-4c98445ec4002617.js"></script><script defer src="../sidebar-items.js"></script><script defer src="../../static.files/main-12cf3b4f4f9dc36d.js"></script><noscript><link rel="stylesheet" href="../../static.files/noscript-04d5337699b92874.css"></noscript><link rel="alternate icon" type="image/png" href="../../static.files/favicon-16x16-8b506e7a72182f1c.png"><link rel="alternate icon" type="image/png" href="../../static.files/favicon-32x32-422f7d1d52889060.png"><link rel="icon" type="image/svg+xml" href="../../static.files/favicon-2c020d218678b618.svg"></head><body class="rustdoc mod"><!--[if lte IE 11]><div class="warning">This old browser is unsupported and will most likely display funky things.</div><![endif]--><nav class="mobile-topbar"><button class="sidebar-menu-toggle" title="show sidebar"></button></nav><nav class="sidebar"><div class="sidebar-crate"><h2><a href="../../regex/index.html">regex</a><span class="version">1.10.4</span></h2></div><h2 class="location"><a href="#">Module bytes</a></h2><div class="sidebar-elems"><section><ul class="block"><li><a href="#structs">Structs</a></li><li><a href="#traits">Traits</a></li></ul></section><h2><a href="../index.html">In crate regex</a></h2></div></nav><div class="sidebar-resizer"></div>
<main><div class="width-limiter"><nav class="sub"><form class="search-form"><span></span><div id="sidebar-button" tabindex="-1"><a href="../../regex/all.html" title="show sidebar"></a></div><input class="search-input" name="search" aria-label="Run search in the documentation" autocomplete="off" spellcheck="false" placeholder="Click or press S to search, ? for more options…" type="search"><div id="help-button" tabindex="-1"><a href="../../help.html" title="help">?</a></div><div id="settings-menu" tabindex="-1"><a href="../../settings.html" title="settings"><img width="22" height="22" alt="Change settings" src="../../static.files/wheel-7b819b6101059cd0.svg"></a></div></form></nav><section id="main-content" class="content"><div class="main-heading"><h1>Module <a href="../index.html">regex</a>::<wbr><a class="mod" href="#">bytes</a><button id="copy-path" title="Copy item path to clipboard"><img src="../../static.files/clipboard-7571035ce49a181d.svg" width="19" height="18" alt="Copy item path"></button></h1><span class="out-of-band"><a class="src" href="../../src/regex/bytes.rs.html#1-91">source</a> · <button id="toggle-all-docs" title="collapse all docs">[<span>&#x2212;</span>]</button></span></div><details class="toggle top-doc" open><summary class="hideme"><span>Expand description</span></summary><div class="docblock"><p>Search for regex matches in <code>&amp;[u8]</code> haystacks.</p>
<p>This module provides a nearly identical API via <a href="struct.Regex.html" title="struct regex::bytes::Regex"><code>Regex</code></a> to the one found in
the top-level of this crate. There are two important differences:</p>
<ol>
<li>Matching is done on <code>&amp;[u8]</code> instead of <code>&amp;str</code>. Additionally, <code>Vec&lt;u8&gt;</code>
is used where <code>String</code> would have been used in the top-level API.</li>
<li>Unicode support can be disabled even when disabling it would result in
matching invalid UTF-8 bytes.</li>
</ol>
<h2 id="example-match-null-terminated-string"><a class="doc-anchor" href="#example-match-null-terminated-string">§</a>Example: match null terminated string</h2>
<p>This shows how to find all null-terminated strings in a slice of bytes. This
works even if a C string contains invalid UTF-8.</p>
<div class="example-wrap"><pre class="rust rust-example-rendered"><code><span class="kw">use </span>regex::bytes::Regex;
<span class="kw">let </span>re = Regex::new(<span class="string">r"(?-u)(?&lt;cstr&gt;[^\x00]+)\x00"</span>).unwrap();
<span class="kw">let </span>hay = <span class="string">b"foo\x00qu\xFFux\x00baz\x00"</span>;
<span class="comment">// Extract all of the strings without the NUL terminator from each match.
// The unwrap is OK here since a match requires the `cstr` capture to match.
</span><span class="kw">let </span>cstrs: Vec&lt;<span class="kw-2">&amp;</span>[u8]&gt; =
re.captures_iter(hay)
.map(|c| c.name(<span class="string">"cstr"</span>).unwrap().as_bytes())
.collect();
<span class="macro">assert_eq!</span>(cstrs, <span class="macro">vec!</span>[<span class="kw-2">&amp;</span><span class="string">b"foo"</span>[..], <span class="kw-2">&amp;</span><span class="string">b"qu\xFFux"</span>[..], <span class="kw-2">&amp;</span><span class="string">b"baz"</span>[..]]);</code></pre></div>
<h2 id="example-selectively-enable-unicode-support"><a class="doc-anchor" href="#example-selectively-enable-unicode-support">§</a>Example: selectively enable Unicode support</h2>
<p>This shows how to match an arbitrary byte pattern followed by a UTF-8 encoded
string (e.g., to extract a title from a Matroska file):</p>
<div class="example-wrap"><pre class="rust rust-example-rendered"><code><span class="kw">use </span>regex::bytes::Regex;
<span class="kw">let </span>re = Regex::new(
<span class="string">r"(?-u)\x7b\xa9(?:[\x80-\xfe]|[\x40-\xff].)(?u:(.*))"
</span>).unwrap();
<span class="kw">let </span>hay = <span class="string">b"\x12\xd0\x3b\x5f\x7b\xa9\x85\xe2\x98\x83\x80\x98\x54\x76\x68\x65"</span>;
<span class="comment">// Notice that despite the `.*` at the end, it will only match valid UTF-8
// because Unicode mode was enabled with the `u` flag. Without the `u` flag,
// the `.*` would match the rest of the bytes regardless of whether they were
// valid UTF-8.
</span><span class="kw">let </span>(<span class="kw">_</span>, [title]) = re.captures(hay).unwrap().extract();
<span class="macro">assert_eq!</span>(title, <span class="string">b"\xE2\x98\x83"</span>);
<span class="comment">// We can UTF-8 decode the title now. And the unwrap here
// is correct because the existence of a match guarantees
// that `title` is valid UTF-8.
</span><span class="kw">let </span>title = std::str::from_utf8(title).unwrap();
<span class="macro">assert_eq!</span>(title, <span class="string">"☃"</span>);</code></pre></div>
<p>In general, if the Unicode flag is enabled in a capture group and that capture
is part of the overall match, then the capture is <em>guaranteed</em> to be valid
UTF-8.</p>
<h2 id="syntax"><a class="doc-anchor" href="#syntax">§</a>Syntax</h2>
<p>The supported syntax is pretty much the same as the syntax for Unicode
regular expressions with a few changes that make sense for matching arbitrary
bytes:</p>
<ol>
<li>The <code>u</code> flag can be disabled even when disabling it might cause the regex to
match invalid UTF-8. When the <code>u</code> flag is disabled, the regex is said to be in
“ASCII compatible” mode.</li>
<li>In ASCII compatible mode, Unicode character classes are not allowed. Literal
Unicode scalar values outside of character classes are allowed.</li>
<li>In ASCII compatible mode, Perl character classes (<code>\w</code>, <code>\d</code> and <code>\s</code>)
revert to their typical ASCII definition. <code>\w</code> maps to <code>[[:word:]]</code>, <code>\d</code> maps
to <code>[[:digit:]]</code> and <code>\s</code> maps to <code>[[:space:]]</code>.</li>
<li>In ASCII compatible mode, word boundaries use the ASCII compatible <code>\w</code> to
determine whether a byte is a word byte or not.</li>
<li>Hexadecimal notation can be used to specify arbitrary bytes instead of
Unicode codepoints. For example, in ASCII compatible mode, <code>\xFF</code> matches the
literal byte <code>\xFF</code>, while in Unicode mode, <code>\xFF</code> is the Unicode codepoint
<code>U+00FF</code> that matches its UTF-8 encoding of <code>\xC3\xBF</code>. Similarly for octal
notation when enabled.</li>
<li>In ASCII compatible mode, <code>.</code> matches any <em>byte</em> except for <code>\n</code>. When the
<code>s</code> flag is additionally enabled, <code>.</code> matches any byte.</li>
</ol>
<h2 id="performance"><a class="doc-anchor" href="#performance">§</a>Performance</h2>
<p>In general, one should expect performance on <code>&amp;[u8]</code> to be roughly similar to
performance on <code>&amp;str</code>.</p>
</div></details><h2 id="structs" class="section-header">Structs<a href="#structs" class="anchor">§</a></h2><ul class="item-table"><li><div class="item-name"><a class="struct" href="struct.CaptureLocations.html" title="struct regex::bytes::CaptureLocations">CaptureLocations</a></div><div class="desc docblock-short">A low level representation of the byte offsets of each capture group.</div></li><li><div class="item-name"><a class="struct" href="struct.CaptureMatches.html" title="struct regex::bytes::CaptureMatches">CaptureMatches</a></div><div class="desc docblock-short">An iterator over all non-overlapping capture matches in a haystack.</div></li><li><div class="item-name"><a class="struct" href="struct.CaptureNames.html" title="struct regex::bytes::CaptureNames">CaptureNames</a></div><div class="desc docblock-short">An iterator over the names of all capture groups in a regex.</div></li><li><div class="item-name"><a class="struct" href="struct.Captures.html" title="struct regex::bytes::Captures">Captures</a></div><div class="desc docblock-short">Represents the capture groups for a single match.</div></li><li><div class="item-name"><a class="struct" href="struct.Match.html" title="struct regex::bytes::Match">Match</a></div><div class="desc docblock-short">Represents a single match of a regex in a haystack.</div></li><li><div class="item-name"><a class="struct" href="struct.Matches.html" title="struct regex::bytes::Matches">Matches</a></div><div class="desc docblock-short">An iterator over all non-overlapping matches in a haystack.</div></li><li><div class="item-name"><a class="struct" href="struct.NoExpand.html" title="struct regex::bytes::NoExpand">NoExpand</a></div><div class="desc docblock-short">A helper type for forcing literal string replacement.</div></li><li><div class="item-name"><a class="struct" href="struct.Regex.html" title="struct regex::bytes::Regex">Regex</a></div><div class="desc docblock-short">A compiled regular expression for searching Unicode haystacks.</div></li><li><div class="item-name"><a class="struct" href="struct.RegexBuilder.html" title="struct regex::bytes::RegexBuilder">RegexBuilder</a></div><div class="desc docblock-short">A configurable builder for a <a href="struct.Regex.html" title="struct regex::bytes::Regex"><code>Regex</code></a>.</div></li><li><div class="item-name"><a class="struct" href="struct.RegexSet.html" title="struct regex::bytes::RegexSet">RegexSet</a></div><div class="desc docblock-short">Match multiple, possibly overlapping, regexes in a single search.</div></li><li><div class="item-name"><a class="struct" href="struct.RegexSetBuilder.html" title="struct regex::bytes::RegexSetBuilder">RegexSetBuilder</a></div><div class="desc docblock-short">A configurable builder for a <a href="struct.RegexSet.html" title="struct regex::bytes::RegexSet"><code>RegexSet</code></a>.</div></li><li><div class="item-name"><a class="struct" href="struct.ReplacerRef.html" title="struct regex::bytes::ReplacerRef">ReplacerRef</a></div><div class="desc docblock-short">A by-reference adaptor for a <a href="trait.Replacer.html" title="trait regex::bytes::Replacer"><code>Replacer</code></a>.</div></li><li><div class="item-name"><a class="struct" href="struct.SetMatches.html" title="struct regex::bytes::SetMatches">SetMatches</a></div><div class="desc docblock-short">A set of matches returned by a regex set.</div></li><li><div class="item-name"><a class="struct" href="struct.SetMatchesIntoIter.html" title="struct regex::bytes::SetMatchesIntoIter">SetMatchesIntoIter</a></div><div class="desc docblock-short">An owned iterator over the set of matches from a regex set.</div></li><li><div class="item-name"><a class="struct" href="struct.SetMatchesIter.html" title="struct regex::bytes::SetMatchesIter">SetMatchesIter</a></div><div class="desc docblock-short">A borrowed iterator over the set of matches from a regex set.</div></li><li><div class="item-name"><a class="struct" href="struct.Split.html" title="struct regex::bytes::Split">Split</a></div><div class="desc docblock-short">An iterator over all substrings delimited by a regex match.</div></li><li><div class="item-name"><a class="struct" href="struct.SplitN.html" title="struct regex::bytes::SplitN">SplitN</a></div><div class="desc docblock-short">An iterator over at most <code>N</code> substrings delimited by a regex match.</div></li><li><div class="item-name"><a class="struct" href="struct.SubCaptureMatches.html" title="struct regex::bytes::SubCaptureMatches">SubCaptureMatches</a></div><div class="desc docblock-short">An iterator over all group matches in a <a href="struct.Captures.html" title="struct regex::bytes::Captures"><code>Captures</code></a> value.</div></li></ul><h2 id="traits" class="section-header">Traits<a href="#traits" class="anchor">§</a></h2><ul class="item-table"><li><div class="item-name"><a class="trait" href="trait.Replacer.html" title="trait regex::bytes::Replacer">Replacer</a></div><div class="desc docblock-short">A trait for types that can be used to replace matches in a haystack.</div></li></ul></section></div></main></body></html>