1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548
/*!
Provides types for dealing with capturing groups.
Capturing groups refer to sub-patterns of regexes that some regex engines can
report matching offsets for. For example, matching `[a-z]([0-9]+)` against
`a789` would give `a789` as the overall match (for the implicit capturing group
at index `0`) and `789` as the match for the capturing group `([0-9]+)` (an
explicit capturing group at index `1`).
Not all regex engines can report match offsets for capturing groups. Indeed,
to a first approximation, regex engines that can report capturing group offsets
tend to be quite a bit slower than regex engines that can't. This is because
tracking capturing groups at search time usually requires more "power" that
in turn adds overhead.
Other regex implementations might call capturing groups "submatches."
# Overview
The main types in this module are:
* [`Captures`] records the capturing group offsets found during a search. It
provides convenience routines for looking up capturing group offsets by either
index or name.
* [`GroupInfo`] records the mapping between capturing groups and "slots,"
where the latter are how capturing groups are recorded during a regex search.
This also keeps a mapping from capturing group name to index, and capture
group index to name. A `GroupInfo` is used by `Captures` internally to
provide a convenient API. It is unlikely that you'll use a `GroupInfo`
directly, but for example, if you've compiled an Thompson NFA, then you can use
[`thompson::NFA::group_info`](crate::nfa::thompson::NFA::group_info) to get its
underlying `GroupInfo`.
*/
use alloc::{string::String, sync::Arc, vec, vec::Vec};
use crate::util::{
interpolate,
primitives::{
NonMaxUsize, PatternID, PatternIDError, PatternIDIter, SmallIndex,
},
search::{Match, Span},
};
/// The span offsets of capturing groups after a match has been found.
///
/// This type represents the output of regex engines that can report the
/// offsets at which capturing groups matches or "submatches" occur. For
/// example, the [`PikeVM`](crate::nfa::thompson::pikevm::PikeVM). When a match
/// occurs, it will at minimum contain the [`PatternID`] of the pattern that
/// matched. Depending upon how it was constructed, it may also contain the
/// start/end offsets of the entire match of the pattern and the start/end
/// offsets of each capturing group that participated in the match.
///
/// Values of this type are always created for a specific [`GroupInfo`]. It is
/// unspecified behavior to use a `Captures` value in a search with any regex
/// engine that has a different `GroupInfo` than the one the `Captures` were
/// created with.
///
/// # Constructors
///
/// There are three constructors for this type that control what kind of
/// information is available upon a match:
///
/// * [`Captures::all`]: Will store overall pattern match offsets in addition
/// to the offsets of capturing groups that participated in the match.
/// * [`Captures::matches`]: Will store only the overall pattern
/// match offsets. The offsets of capturing groups (even ones that participated
/// in the match) are not available.
/// * [`Captures::empty`]: Will only store the pattern ID that matched. No
/// match offsets are available at all.
///
/// If you aren't sure which to choose, then pick the first one. The first one
/// is what convenience routines like,
/// [`PikeVM::create_captures`](crate::nfa::thompson::pikevm::PikeVM::create_captures),
/// will use automatically.
///
/// The main difference between these choices is performance. Namely, if you
/// ask for _less_ information, then the execution of regex search may be able
/// to run more quickly.
///
/// # Notes
///
/// It is worth pointing out that this type is not coupled to any one specific
/// regex engine. Instead, its coupling is with [`GroupInfo`], which is the
/// thing that is responsible for mapping capturing groups to "slot" offsets.
/// Slot offsets are indices into a single sequence of memory at which matching
/// haystack offsets for the corresponding group are written by regex engines.
///
/// # Example
///
/// This example shows how to parse a simple date and extract the components of
/// the date via capturing groups:
///
/// ```
/// use regex_automata::{nfa::thompson::pikevm::PikeVM, Span};
///
/// let re = PikeVM::new(r"^([0-9]{4})-([0-9]{2})-([0-9]{2})$")?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
///
/// re.captures(&mut cache, "2010-03-14", &mut caps);
/// assert!(caps.is_match());
/// assert_eq!(Some(Span::from(0..4)), caps.get_group(1));
/// assert_eq!(Some(Span::from(5..7)), caps.get_group(2));
/// assert_eq!(Some(Span::from(8..10)), caps.get_group(3));
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
///
/// # Example: named capturing groups
///
/// This example is like the one above, but leverages the ability to name
/// capturing groups in order to make the code a bit clearer:
///
/// ```
/// use regex_automata::{nfa::thompson::pikevm::PikeVM, Span};
///
/// let re = PikeVM::new(r"^(?P<y>[0-9]{4})-(?P<m>[0-9]{2})-(?P<d>[0-9]{2})$")?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
///
/// re.captures(&mut cache, "2010-03-14", &mut caps);
/// assert!(caps.is_match());
/// assert_eq!(Some(Span::from(0..4)), caps.get_group_by_name("y"));
/// assert_eq!(Some(Span::from(5..7)), caps.get_group_by_name("m"));
/// assert_eq!(Some(Span::from(8..10)), caps.get_group_by_name("d"));
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[derive(Clone)]
pub struct Captures {
/// The group info that these capture groups are coupled to. This is what
/// gives the "convenience" of the `Captures` API. Namely, it provides the
/// slot mapping and the name|-->index mapping for capture lookups by name.
group_info: GroupInfo,
/// The ID of the pattern that matched. Regex engines must set this to
/// None when no match occurs.
pid: Option<PatternID>,
/// The slot values, i.e., submatch offsets.
///
/// In theory, the smallest sequence of slots would be something like
/// `max(groups(pattern) for pattern in regex) * 2`, but instead, we use
/// `sum(groups(pattern) for pattern in regex) * 2`. Why?
///
/// Well, the former could be used in theory, because we don't generally
/// have any overlapping APIs that involve capturing groups. Therefore,
/// there's technically never any need to have slots set for multiple
/// patterns. However, this might change some day, in which case, we would
/// need to have slots available.
///
/// The other reason is that during the execution of some regex engines,
/// there exists a point in time where multiple slots for different
/// patterns may be written to before knowing which pattern has matched.
/// Therefore, the regex engines themselves, in order to support multiple
/// patterns correctly, must have all slots available. If `Captures`
/// doesn't have all slots available, then regex engines can't write
/// directly into the caller provided `Captures` and must instead write
/// into some other storage and then copy the slots involved in the match
/// at the end of the search.
///
/// So overall, at least as of the time of writing, it seems like the path
/// of least resistance is to just require allocating all possible slots
/// instead of the conceptual minimum. Another way to justify this is that
/// the most common case is a single pattern, in which case, there is no
/// inefficiency here since the 'max' and 'sum' calculations above are
/// equivalent in that case.
///
/// N.B. The mapping from group index to slot is maintained by `GroupInfo`
/// and is considered an API guarantee. See `GroupInfo` for more details on
/// that mapping.
///
/// N.B. `Option<NonMaxUsize>` has the same size as a `usize`.
slots: Vec<Option<NonMaxUsize>>,
}
impl Captures {
/// Create new storage for the offsets of all matching capturing groups.
///
/// This routine provides the most information for matches---namely, the
/// spans of matching capturing groups---but also requires the regex search
/// routines to do the most work.
///
/// It is unspecified behavior to use the returned `Captures` value in a
/// search with a `GroupInfo` other than the one that is provided to this
/// constructor.
///
/// # Example
///
/// This example shows that all capturing groups---but only ones that
/// participated in a match---are available to query after a match has
/// been found:
///
/// ```
/// use regex_automata::{
/// nfa::thompson::pikevm::PikeVM,
/// util::captures::Captures,
/// Span, Match,
/// };
///
/// let re = PikeVM::new(
/// r"^(?:(?P<lower>[a-z]+)|(?P<upper>[A-Z]+))(?P<digits>[0-9]+)$",
/// )?;
/// let mut cache = re.create_cache();
/// let mut caps = Captures::all(re.get_nfa().group_info().clone());
///
/// re.captures(&mut cache, "ABC123", &mut caps);
/// assert!(caps.is_match());
/// assert_eq!(Some(Match::must(0, 0..6)), caps.get_match());
/// // The 'lower' group didn't match, so it won't have any offsets.
/// assert_eq!(None, caps.get_group_by_name("lower"));
/// assert_eq!(Some(Span::from(0..3)), caps.get_group_by_name("upper"));
/// assert_eq!(Some(Span::from(3..6)), caps.get_group_by_name("digits"));
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
pub fn all(group_info: GroupInfo) -> Captures {
let slots = group_info.slot_len();
Captures { group_info, pid: None, slots: vec![None; slots] }
}
/// Create new storage for only the full match spans of a pattern. This
/// does not include any capturing group offsets.
///
/// It is unspecified behavior to use the returned `Captures` value in a
/// search with a `GroupInfo` other than the one that is provided to this
/// constructor.
///
/// # Example
///
/// This example shows that only overall match offsets are reported when
/// this constructor is used. Accessing any capturing groups other than
/// the 0th will always return `None`.
///
/// ```
/// use regex_automata::{
/// nfa::thompson::pikevm::PikeVM,
/// util::captures::Captures,
/// Match,
/// };
///
/// let re = PikeVM::new(
/// r"^(?:(?P<lower>[a-z]+)|(?P<upper>[A-Z]+))(?P<digits>[0-9]+)$",
/// )?;
/// let mut cache = re.create_cache();
/// let mut caps = Captures::matches(re.get_nfa().group_info().clone());
///
/// re.captures(&mut cache, "ABC123", &mut caps);
/// assert!(caps.is_match());
/// assert_eq!(Some(Match::must(0, 0..6)), caps.get_match());
/// // We didn't ask for capturing group offsets, so they aren't available.
/// assert_eq!(None, caps.get_group_by_name("lower"));
/// assert_eq!(None, caps.get_group_by_name("upper"));
/// assert_eq!(None, caps.get_group_by_name("digits"));
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
pub fn matches(group_info: GroupInfo) -> Captures {
// This is OK because we know there are at least this many slots,
// and GroupInfo construction guarantees that the number of slots fits
// into a usize.
let slots = group_info.pattern_len().checked_mul(2).unwrap();
Captures { group_info, pid: None, slots: vec![None; slots] }
}
/// Create new storage for only tracking which pattern matched. No offsets
/// are stored at all.
///
/// It is unspecified behavior to use the returned `Captures` value in a
/// search with a `GroupInfo` other than the one that is provided to this
/// constructor.
///
/// # Example
///
/// This example shows that only the pattern that matched can be accessed
/// from a `Captures` value created via this constructor.
///
/// ```
/// use regex_automata::{
/// nfa::thompson::pikevm::PikeVM,
/// util::captures::Captures,
/// PatternID,
/// };
///
/// let re = PikeVM::new_many(&[r"[a-z]+", r"[A-Z]+"])?;
/// let mut cache = re.create_cache();
/// let mut caps = Captures::empty(re.get_nfa().group_info().clone());
///
/// re.captures(&mut cache, "aABCz", &mut caps);
/// assert!(caps.is_match());
/// assert_eq!(Some(PatternID::must(0)), caps.pattern());
/// // We didn't ask for any offsets, so they aren't available.
/// assert_eq!(None, caps.get_match());
///
/// re.captures(&mut cache, &"aABCz"[1..], &mut caps);
/// assert!(caps.is_match());
/// assert_eq!(Some(PatternID::must(1)), caps.pattern());
/// // We didn't ask for any offsets, so they aren't available.
/// assert_eq!(None, caps.get_match());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
pub fn empty(group_info: GroupInfo) -> Captures {
Captures { group_info, pid: None, slots: vec![] }
}
/// Returns true if and only if this capturing group represents a match.
///
/// This is a convenience routine for `caps.pattern().is_some()`.
///
/// # Example
///
/// When using the PikeVM (for example), the lightest weight way of
/// detecting whether a match exists is to create capturing groups that
/// only track the ID of the pattern that match (if any):
///
/// ```
/// use regex_automata::{
/// nfa::thompson::pikevm::PikeVM,
/// util::captures::Captures,
/// };
///
/// let re = PikeVM::new(r"[a-z]+")?;
/// let mut cache = re.create_cache();
/// let mut caps = Captures::empty(re.get_nfa().group_info().clone());
///
/// re.captures(&mut cache, "aABCz", &mut caps);
/// assert!(caps.is_match());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn is_match(&self) -> bool {
self.pid.is_some()
}
/// Returns the identifier of the pattern that matched when this
/// capturing group represents a match. If no match was found, then this
/// always returns `None`.
///
/// This returns a pattern ID in precisely the cases in which `is_match`
/// returns `true`. Similarly, the pattern ID returned is always the
/// same pattern ID found in the `Match` returned by `get_match`.
///
/// # Example
///
/// When using the PikeVM (for example), the lightest weight way of
/// detecting which pattern matched is to create capturing groups that only
/// track the ID of the pattern that match (if any):
///
/// ```
/// use regex_automata::{
/// nfa::thompson::pikevm::PikeVM,
/// util::captures::Captures,
/// PatternID,
/// };
///
/// let re = PikeVM::new_many(&[r"[a-z]+", r"[A-Z]+"])?;
/// let mut cache = re.create_cache();
/// let mut caps = Captures::empty(re.get_nfa().group_info().clone());
///
/// re.captures(&mut cache, "ABC", &mut caps);
/// assert_eq!(Some(PatternID::must(1)), caps.pattern());
/// // Recall that offsets are only available when using a non-empty
/// // Captures value. So even though a match occurred, this returns None!
/// assert_eq!(None, caps.get_match());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn pattern(&self) -> Option<PatternID> {
self.pid
}
/// Returns the pattern ID and the span of the match, if one occurred.
///
/// This always returns `None` when `Captures` was created with
/// [`Captures::empty`], even if a match was found.
///
/// If this routine returns a non-`None` value, then `is_match` is
/// guaranteed to return `true` and `pattern` is also guaranteed to return
/// a non-`None` value.
///
/// # Example
///
/// This example shows how to get the full match from a search:
///
/// ```
/// use regex_automata::{nfa::thompson::pikevm::PikeVM, Match};
///
/// let re = PikeVM::new_many(&[r"[a-z]+", r"[A-Z]+"])?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
///
/// re.captures(&mut cache, "ABC", &mut caps);
/// assert_eq!(Some(Match::must(1, 0..3)), caps.get_match());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn get_match(&self) -> Option<Match> {
Some(Match::new(self.pattern()?, self.get_group(0)?))
}
/// Returns the span of a capturing group match corresponding to the group
/// index given, only if both the overall pattern matched and the capturing
/// group participated in that match.
///
/// This returns `None` if `index` is invalid. `index` is valid if and only
/// if it's less than [`Captures::group_len`] for the matching pattern.
///
/// This always returns `None` when `Captures` was created with
/// [`Captures::empty`], even if a match was found. This also always
/// returns `None` for any `index > 0` when `Captures` was created with
/// [`Captures::matches`].
///
/// If this routine returns a non-`None` value, then `is_match` is
/// guaranteed to return `true`, `pattern` is guaranteed to return a
/// non-`None` value and `get_match` is guaranteed to return a non-`None`
/// value.
///
/// By convention, the 0th capture group will always return the same
/// span as the span returned by `get_match`. This is because the 0th
/// capture group always corresponds to the entirety of the pattern's
/// match. (It is similarly always unnamed because it is implicit.) This
/// isn't necessarily true of all regex engines. For example, one can
/// hand-compile a [`thompson::NFA`](crate::nfa::thompson::NFA) via a
/// [`thompson::Builder`](crate::nfa::thompson::Builder), which isn't
/// technically forced to make the 0th capturing group always correspond to
/// the entire match.
///
/// # Example
///
/// This example shows how to get the capturing groups, by index, from a
/// match:
///
/// ```
/// # if cfg!(miri) { return Ok(()); } // miri takes too long
/// use regex_automata::{nfa::thompson::pikevm::PikeVM, Span, Match};
///
/// let re = PikeVM::new(r"^(?P<first>\pL+)\s+(?P<last>\pL+)$")?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
///
/// re.captures(&mut cache, "Bruce Springsteen", &mut caps);
/// assert_eq!(Some(Match::must(0, 0..17)), caps.get_match());
/// assert_eq!(Some(Span::from(0..5)), caps.get_group(1));
/// assert_eq!(Some(Span::from(6..17)), caps.get_group(2));
/// // Looking for a non-existent capturing group will return None:
/// assert_eq!(None, caps.get_group(3));
/// # // literals are too big for 32-bit usize: #1039
/// # #[cfg(target_pointer_width = "64")]
/// assert_eq!(None, caps.get_group(9944060567225171988));
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn get_group(&self, index: usize) -> Option<Span> {
let pid = self.pattern()?;
// There's a little bit of work needed to map captures to slots in the
// fully general case. But in the overwhelming common case of a single
// pattern, we can just do some simple arithmetic.
let (slot_start, slot_end) = if self.group_info().pattern_len() == 1 {
(index.checked_mul(2)?, index.checked_mul(2)?.checked_add(1)?)
} else {
self.group_info().slots(pid, index)?
};
let start = self.slots.get(slot_start).copied()??;
let end = self.slots.get(slot_end).copied()??;
Some(Span { start: start.get(), end: end.get() })
}
/// Returns the span of a capturing group match corresponding to the group
/// name given, only if both the overall pattern matched and the capturing
/// group participated in that match.
///
/// This returns `None` if `name` does not correspond to a valid capturing
/// group for the pattern that matched.
///
/// This always returns `None` when `Captures` was created with
/// [`Captures::empty`], even if a match was found. This also always
/// returns `None` for any `index > 0` when `Captures` was created with
/// [`Captures::matches`].
///
/// If this routine returns a non-`None` value, then `is_match` is
/// guaranteed to return `true`, `pattern` is guaranteed to return a
/// non-`None` value and `get_match` is guaranteed to return a non-`None`
/// value.
///
/// # Example
///
/// This example shows how to get the capturing groups, by name, from a
/// match:
///
/// ```
/// # if cfg!(miri) { return Ok(()); } // miri takes too long
/// use regex_automata::{nfa::thompson::pikevm::PikeVM, Span, Match};
///
/// let re = PikeVM::new(r"^(?P<first>\pL+)\s+(?P<last>\pL+)$")?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
///
/// re.captures(&mut cache, "Bruce Springsteen", &mut caps);
/// assert_eq!(Some(Match::must(0, 0..17)), caps.get_match());
/// assert_eq!(Some(Span::from(0..5)), caps.get_group_by_name("first"));
/// assert_eq!(Some(Span::from(6..17)), caps.get_group_by_name("last"));
/// // Looking for a non-existent capturing group will return None:
/// assert_eq!(None, caps.get_group_by_name("middle"));
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
pub fn get_group_by_name(&self, name: &str) -> Option<Span> {
let index = self.group_info().to_index(self.pattern()?, name)?;
self.get_group(index)
}
/// Returns an iterator of possible spans for every capturing group in the
/// matching pattern.
///
/// If this `Captures` value does not correspond to a match, then the
/// iterator returned yields no elements.
///
/// Note that the iterator returned yields elements of type `Option<Span>`.
/// A span is present if and only if it corresponds to a capturing group
/// that participated in a match.
///
/// # Example
///
/// This example shows how to collect all capturing groups:
///
/// ```
/// # if cfg!(miri) { return Ok(()); } // miri takes too long
/// use regex_automata::{nfa::thompson::pikevm::PikeVM, Span};
///
/// let re = PikeVM::new(
/// // Matches first/last names, with an optional middle name.
/// r"^(?P<first>\pL+)\s+(?:(?P<middle>\pL+)\s+)?(?P<last>\pL+)$",
/// )?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
///
/// re.captures(&mut cache, "Harry James Potter", &mut caps);
/// assert!(caps.is_match());
/// let groups: Vec<Option<Span>> = caps.iter().collect();
/// assert_eq!(groups, vec![
/// Some(Span::from(0..18)),
/// Some(Span::from(0..5)),
/// Some(Span::from(6..11)),
/// Some(Span::from(12..18)),
/// ]);
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
///
/// This example uses the same regex as the previous example, but with a
/// haystack that omits the middle name. This results in a capturing group
/// that is present in the elements yielded by the iterator but without a
/// match:
///
/// ```
/// # if cfg!(miri) { return Ok(()); } // miri takes too long
/// use regex_automata::{nfa::thompson::pikevm::PikeVM, Span};
///
/// let re = PikeVM::new(
/// // Matches first/last names, with an optional middle name.
/// r"^(?P<first>\pL+)\s+(?:(?P<middle>\pL+)\s+)?(?P<last>\pL+)$",
/// )?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
///
/// re.captures(&mut cache, "Harry Potter", &mut caps);
/// assert!(caps.is_match());
/// let groups: Vec<Option<Span>> = caps.iter().collect();
/// assert_eq!(groups, vec![
/// Some(Span::from(0..12)),
/// Some(Span::from(0..5)),
/// None,
/// Some(Span::from(6..12)),
/// ]);
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
pub fn iter(&self) -> CapturesPatternIter<'_> {
let names = self
.pattern()
.map_or(GroupInfoPatternNames::empty().enumerate(), |pid| {
self.group_info().pattern_names(pid).enumerate()
});
CapturesPatternIter { caps: self, names }
}
/// Return the total number of capturing groups for the matching pattern.
///
/// If this `Captures` value does not correspond to a match, then this
/// always returns `0`.
///
/// This always returns the same number of elements yielded by
/// [`Captures::iter`]. That is, the number includes capturing groups even
/// if they don't participate in the match.
///
/// # Example
///
/// This example shows how to count the total number of capturing groups
/// associated with a pattern. Notice that it includes groups that did not
/// participate in a match (just like `Captures::iter` does).
///
/// ```
/// # if cfg!(miri) { return Ok(()); } // miri takes too long
/// use regex_automata::nfa::thompson::pikevm::PikeVM;
///
/// let re = PikeVM::new(
/// // Matches first/last names, with an optional middle name.
/// r"^(?P<first>\pL+)\s+(?:(?P<middle>\pL+)\s+)?(?P<last>\pL+)$",
/// )?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
///
/// re.captures(&mut cache, "Harry Potter", &mut caps);
/// assert_eq!(4, caps.group_len());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
pub fn group_len(&self) -> usize {
let pid = match self.pattern() {
None => return 0,
Some(pid) => pid,
};
self.group_info().group_len(pid)
}
/// Returns a reference to the underlying group info on which these
/// captures are based.
///
/// The difference between `GroupInfo` and `Captures` is that the former
/// defines the structure of capturing groups where as the latter is what
/// stores the actual match information. So where as `Captures` only gives
/// you access to the current match, `GroupInfo` lets you query any
/// information about all capturing groups, even ones for patterns that
/// weren't involved in a match.
///
/// Note that a `GroupInfo` uses reference counting internally, so it may
/// be cloned cheaply.
///
/// # Example
///
/// This example shows how to get all capturing group names from the
/// underlying `GroupInfo`. Notice that we don't even need to run a
/// search.
///
/// ```
/// use regex_automata::{nfa::thompson::pikevm::PikeVM, PatternID};
///
/// let re = PikeVM::new_many(&[
/// r"(?P<foo>a)",
/// r"(a)(b)",
/// r"ab",
/// r"(?P<bar>a)(?P<quux>a)",
/// r"(?P<foo>z)",
/// ])?;
/// let caps = re.create_captures();
///
/// let expected = vec![
/// (PatternID::must(0), 0, None),
/// (PatternID::must(0), 1, Some("foo")),
/// (PatternID::must(1), 0, None),
/// (PatternID::must(1), 1, None),
/// (PatternID::must(1), 2, None),
/// (PatternID::must(2), 0, None),
/// (PatternID::must(3), 0, None),
/// (PatternID::must(3), 1, Some("bar")),
/// (PatternID::must(3), 2, Some("quux")),
/// (PatternID::must(4), 0, None),
/// (PatternID::must(4), 1, Some("foo")),
/// ];
/// // We could also just use 're.get_nfa().group_info()'.
/// let got: Vec<(PatternID, usize, Option<&str>)> =
/// caps.group_info().all_names().collect();
/// assert_eq!(expected, got);
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
pub fn group_info(&self) -> &GroupInfo {
&self.group_info
}
/// Interpolates the capture references in `replacement` with the
/// corresponding substrings in `haystack` matched by each reference. The
/// interpolated string is returned.
///
/// See the [`interpolate` module](interpolate) for documentation on the
/// format of the replacement string.
///
/// # Example
///
/// This example shows how to use interpolation, and also shows how it
/// can work with multi-pattern regexes.
///
/// ```
/// use regex_automata::{nfa::thompson::pikevm::PikeVM, PatternID};
///
/// let re = PikeVM::new_many(&[
/// r"(?<day>[0-9]{2})-(?<month>[0-9]{2})-(?<year>[0-9]{4})",
/// r"(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})",
/// ])?;
/// let mut cache = re.create_cache();
/// let mut caps = re.create_captures();
///
/// let replacement = "year=$year, month=$month, day=$day";
///
/// // This matches the first pattern.
/// let hay = "On 14-03-2010, I became a Tenneessee lamb.";
/// re.captures(&mut cache, hay, &mut caps);
/// let result = caps.interpolate_string(hay, replacement);
/// assert_eq!("year=2010, month=03, day=14", result);
///
/// // And this matches the second pattern.
/// let hay = "On 2010-03-14, I became a Tenneessee lamb.";
/// re.captures(&mut cache, hay, &mut caps);
/// let result = caps.interpolate_string(hay, replacement);
/// assert_eq!("year=2010, month=03, day=14", result);
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
pub fn interpolate_string(
&self,
haystack: &str,
replacement: &str,
) -> String {
let mut dst = String::new();
self.interpolate_string_into(haystack, replacement, &mut dst);
dst
}
/// Interpolates the capture references in `replacement` with the
/// corresponding substrings in `haystack` matched by each reference. The
/// interpolated string is written to `dst`.
///
/// See the [`interpolate` module](interpolate) for documentation on the
/// format of the replacement string.
///
/// # Example
///
/// This example shows how to use interpolation, and also shows how it
/// can work with multi-pattern regexes.
///
/// ```
/// use regex_automata::{nfa::thompson::pikevm::PikeVM, PatternID};
///
/// let re = PikeVM::new_many(&[
/// r"(?<day>[0-9]{2})-(?<month>[0-9]{2})-(?<year>[0-9]{4})",
/// r"(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})",
/// ])?;
/// let mut cache = re.create_cache();
/// let mut caps = re.create_captures();
///
/// let replacement = "year=$year, month=$month, day=$day";
///
/// // This matches the first pattern.
/// let hay = "On 14-03-2010, I became a Tenneessee lamb.";
/// re.captures(&mut cache, hay, &mut caps);
/// let mut dst = String::new();
/// caps.interpolate_string_into(hay, replacement, &mut dst);
/// assert_eq!("year=2010, month=03, day=14", dst);
///
/// // And this matches the second pattern.
/// let hay = "On 2010-03-14, I became a Tenneessee lamb.";
/// re.captures(&mut cache, hay, &mut caps);
/// let mut dst = String::new();
/// caps.interpolate_string_into(hay, replacement, &mut dst);
/// assert_eq!("year=2010, month=03, day=14", dst);
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
pub fn interpolate_string_into(
&self,
haystack: &str,
replacement: &str,
dst: &mut String,
) {
interpolate::string(
replacement,
|index, dst| {
let span = match self.get_group(index) {
None => return,
Some(span) => span,
};
dst.push_str(&haystack[span]);
},
|name| self.group_info().to_index(self.pattern()?, name),
dst,
);
}
/// Interpolates the capture references in `replacement` with the
/// corresponding substrings in `haystack` matched by each reference. The
/// interpolated byte string is returned.
///
/// See the [`interpolate` module](interpolate) for documentation on the
/// format of the replacement string.
///
/// # Example
///
/// This example shows how to use interpolation, and also shows how it
/// can work with multi-pattern regexes.
///
/// ```
/// use regex_automata::{nfa::thompson::pikevm::PikeVM, PatternID};
///
/// let re = PikeVM::new_many(&[
/// r"(?<day>[0-9]{2})-(?<month>[0-9]{2})-(?<year>[0-9]{4})",
/// r"(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})",
/// ])?;
/// let mut cache = re.create_cache();
/// let mut caps = re.create_captures();
///
/// let replacement = b"year=$year, month=$month, day=$day";
///
/// // This matches the first pattern.
/// let hay = b"On 14-03-2010, I became a Tenneessee lamb.";
/// re.captures(&mut cache, hay, &mut caps);
/// let result = caps.interpolate_bytes(hay, replacement);
/// assert_eq!(&b"year=2010, month=03, day=14"[..], result);
///
/// // And this matches the second pattern.
/// let hay = b"On 2010-03-14, I became a Tenneessee lamb.";
/// re.captures(&mut cache, hay, &mut caps);
/// let result = caps.interpolate_bytes(hay, replacement);
/// assert_eq!(&b"year=2010, month=03, day=14"[..], result);
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
pub fn interpolate_bytes(
&self,
haystack: &[u8],
replacement: &[u8],
) -> Vec<u8> {
let mut dst = vec![];
self.interpolate_bytes_into(haystack, replacement, &mut dst);
dst
}
/// Interpolates the capture references in `replacement` with the
/// corresponding substrings in `haystack` matched by each reference. The
/// interpolated byte string is written to `dst`.
///
/// See the [`interpolate` module](interpolate) for documentation on the
/// format of the replacement string.
///
/// # Example
///
/// This example shows how to use interpolation, and also shows how it
/// can work with multi-pattern regexes.
///
/// ```
/// use regex_automata::{nfa::thompson::pikevm::PikeVM, PatternID};
///
/// let re = PikeVM::new_many(&[
/// r"(?<day>[0-9]{2})-(?<month>[0-9]{2})-(?<year>[0-9]{4})",
/// r"(?<year>[0-9]{4})-(?<month>[0-9]{2})-(?<day>[0-9]{2})",
/// ])?;
/// let mut cache = re.create_cache();
/// let mut caps = re.create_captures();
///
/// let replacement = b"year=$year, month=$month, day=$day";
///
/// // This matches the first pattern.
/// let hay = b"On 14-03-2010, I became a Tenneessee lamb.";
/// re.captures(&mut cache, hay, &mut caps);
/// let mut dst = vec![];
/// caps.interpolate_bytes_into(hay, replacement, &mut dst);
/// assert_eq!(&b"year=2010, month=03, day=14"[..], dst);
///
/// // And this matches the second pattern.
/// let hay = b"On 2010-03-14, I became a Tenneessee lamb.";
/// re.captures(&mut cache, hay, &mut caps);
/// let mut dst = vec![];
/// caps.interpolate_bytes_into(hay, replacement, &mut dst);
/// assert_eq!(&b"year=2010, month=03, day=14"[..], dst);
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
pub fn interpolate_bytes_into(
&self,
haystack: &[u8],
replacement: &[u8],
dst: &mut Vec<u8>,
) {
interpolate::bytes(
replacement,
|index, dst| {
let span = match self.get_group(index) {
None => return,
Some(span) => span,
};
dst.extend_from_slice(&haystack[span]);
},
|name| self.group_info().to_index(self.pattern()?, name),
dst,
);
}
/// This is a convenience routine for extracting the substrings
/// corresponding to matching capture groups in the given `haystack`. The
/// `haystack` should be the same substring used to find the match spans in
/// this `Captures` value.
///
/// This is identical to [`Captures::extract_bytes`], except it works with
/// `&str` instead of `&[u8]`.
///
/// # Panics
///
/// This panics if the number of explicit matching groups in this
/// `Captures` value is less than `N`. This also panics if this `Captures`
/// value does not correspond to a match.
///
/// Note that this does *not* panic if the number of explicit matching
/// groups is bigger than `N`. In that case, only the first `N` matching
/// groups are extracted.
///
/// # Example
///
/// ```
/// use regex_automata::nfa::thompson::pikevm::PikeVM;
///
/// let re = PikeVM::new(r"([0-9]{4})-([0-9]{2})-([0-9]{2})")?;
/// let mut cache = re.create_cache();
/// let mut caps = re.create_captures();
///
/// let hay = "On 2010-03-14, I became a Tenneessee lamb.";
/// re.captures(&mut cache, hay, &mut caps);
/// assert!(caps.is_match());
/// let (full, [year, month, day]) = caps.extract(hay);
/// assert_eq!("2010-03-14", full);
/// assert_eq!("2010", year);
/// assert_eq!("03", month);
/// assert_eq!("14", day);
///
/// // We can also ask for fewer than all capture groups.
/// let (full, [year]) = caps.extract(hay);
/// assert_eq!("2010-03-14", full);
/// assert_eq!("2010", year);
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
pub fn extract<'h, const N: usize>(
&self,
haystack: &'h str,
) -> (&'h str, [&'h str; N]) {
let mut matched = self.iter().flatten();
let whole_match = &haystack[matched.next().expect("a match")];
let group_matches = [0; N].map(|_| {
let sp = matched.next().expect("too few matching groups");
&haystack[sp]
});
(whole_match, group_matches)
}
/// This is a convenience routine for extracting the substrings
/// corresponding to matching capture groups in the given `haystack`. The
/// `haystack` should be the same substring used to find the match spans in
/// this `Captures` value.
///
/// This is identical to [`Captures::extract`], except it works with
/// `&[u8]` instead of `&str`.
///
/// # Panics
///
/// This panics if the number of explicit matching groups in this
/// `Captures` value is less than `N`. This also panics if this `Captures`
/// value does not correspond to a match.
///
/// Note that this does *not* panic if the number of explicit matching
/// groups is bigger than `N`. In that case, only the first `N` matching
/// groups are extracted.
///
/// # Example
///
/// ```
/// use regex_automata::nfa::thompson::pikevm::PikeVM;
///
/// let re = PikeVM::new(r"([0-9]{4})-([0-9]{2})-([0-9]{2})")?;
/// let mut cache = re.create_cache();
/// let mut caps = re.create_captures();
///
/// let hay = b"On 2010-03-14, I became a Tenneessee lamb.";
/// re.captures(&mut cache, hay, &mut caps);
/// assert!(caps.is_match());
/// let (full, [year, month, day]) = caps.extract_bytes(hay);
/// assert_eq!(b"2010-03-14", full);
/// assert_eq!(b"2010", year);
/// assert_eq!(b"03", month);
/// assert_eq!(b"14", day);
///
/// // We can also ask for fewer than all capture groups.
/// let (full, [year]) = caps.extract_bytes(hay);
/// assert_eq!(b"2010-03-14", full);
/// assert_eq!(b"2010", year);
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
pub fn extract_bytes<'h, const N: usize>(
&self,
haystack: &'h [u8],
) -> (&'h [u8], [&'h [u8]; N]) {
let mut matched = self.iter().flatten();
let whole_match = &haystack[matched.next().expect("a match")];
let group_matches = [0; N].map(|_| {
let sp = matched.next().expect("too few matching groups");
&haystack[sp]
});
(whole_match, group_matches)
}
}
/// Lower level "slot" oriented APIs. One does not typically need to use these
/// when executing a search. They are instead mostly intended for folks that
/// are writing their own regex engine while reusing this `Captures` type.
impl Captures {
/// Clear this `Captures` value.
///
/// After clearing, all slots inside this `Captures` value will be set to
/// `None`. Similarly, any pattern ID that it was previously associated
/// with (for a match) is erased.
///
/// It is not usually necessary to call this routine. Namely, a `Captures`
/// value only provides high level access to the capturing groups of the
/// pattern that matched, and only low level access to individual slots.
/// Thus, even if slots corresponding to groups that aren't associated
/// with the matching pattern are set, then it won't impact the higher
/// level APIs. Namely, higher level APIs like [`Captures::get_group`] will
/// return `None` if no pattern ID is present, even if there are spans set
/// in the underlying slots.
///
/// Thus, to "clear" a `Captures` value of a match, it is usually only
/// necessary to call [`Captures::set_pattern`] with `None`.
///
/// # Example
///
/// This example shows what happens when a `Captures` value is cleared.
///
/// ```
/// # if cfg!(miri) { return Ok(()); } // miri takes too long
/// use regex_automata::nfa::thompson::pikevm::PikeVM;
///
/// let re = PikeVM::new(r"^(?P<first>\pL+)\s+(?P<last>\pL+)$")?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
///
/// re.captures(&mut cache, "Bruce Springsteen", &mut caps);
/// assert!(caps.is_match());
/// let slots: Vec<Option<usize>> =
/// caps.slots().iter().map(|s| s.map(|x| x.get())).collect();
/// // Note that the following ordering is considered an API guarantee.
/// assert_eq!(slots, vec![
/// Some(0),
/// Some(17),
/// Some(0),
/// Some(5),
/// Some(6),
/// Some(17),
/// ]);
///
/// // Now clear the slots. Everything is gone and it is no longer a match.
/// caps.clear();
/// assert!(!caps.is_match());
/// let slots: Vec<Option<usize>> =
/// caps.slots().iter().map(|s| s.map(|x| x.get())).collect();
/// assert_eq!(slots, vec![
/// None,
/// None,
/// None,
/// None,
/// None,
/// None,
/// ]);
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn clear(&mut self) {
self.pid = None;
for slot in self.slots.iter_mut() {
*slot = None;
}
}
/// Set the pattern on this `Captures` value.
///
/// When the pattern ID is `None`, then this `Captures` value does not
/// correspond to a match (`is_match` will return `false`). Otherwise, it
/// corresponds to a match.
///
/// This is useful in search implementations where you might want to
/// initially call `set_pattern(None)` in order to avoid the cost of
/// calling `clear()` if it turns out to not be necessary.
///
/// # Example
///
/// This example shows that `set_pattern` merely overwrites the pattern ID.
/// It does not actually change the underlying slot values.
///
/// ```
/// # if cfg!(miri) { return Ok(()); } // miri takes too long
/// use regex_automata::nfa::thompson::pikevm::PikeVM;
///
/// let re = PikeVM::new(r"^(?P<first>\pL+)\s+(?P<last>\pL+)$")?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
///
/// re.captures(&mut cache, "Bruce Springsteen", &mut caps);
/// assert!(caps.is_match());
/// assert!(caps.pattern().is_some());
/// let slots: Vec<Option<usize>> =
/// caps.slots().iter().map(|s| s.map(|x| x.get())).collect();
/// // Note that the following ordering is considered an API guarantee.
/// assert_eq!(slots, vec![
/// Some(0),
/// Some(17),
/// Some(0),
/// Some(5),
/// Some(6),
/// Some(17),
/// ]);
///
/// // Now set the pattern to None. Note that the slot values remain.
/// caps.set_pattern(None);
/// assert!(!caps.is_match());
/// assert!(!caps.pattern().is_some());
/// let slots: Vec<Option<usize>> =
/// caps.slots().iter().map(|s| s.map(|x| x.get())).collect();
/// // Note that the following ordering is considered an API guarantee.
/// assert_eq!(slots, vec![
/// Some(0),
/// Some(17),
/// Some(0),
/// Some(5),
/// Some(6),
/// Some(17),
/// ]);
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn set_pattern(&mut self, pid: Option<PatternID>) {
self.pid = pid;
}
/// Returns the underlying slots, where each slot stores a single offset.
///
/// Every matching capturing group generally corresponds to two slots: one
/// slot for the starting position and another for the ending position.
/// Typically, either both are present or neither are. (The weasel word
/// "typically" is used here because it really depends on the regex engine
/// implementation. Every sensible regex engine likely adheres to this
/// invariant, and every regex engine in this crate is sensible.)
///
/// Generally speaking, callers should prefer to use higher level routines
/// like [`Captures::get_match`] or [`Captures::get_group`].
///
/// An important note here is that a regex engine may not reset all of the
/// slots to `None` values when no match occurs, or even when a match of
/// a different pattern occurs. But this depends on how the regex engine
/// implementation deals with slots.
///
/// # Example
///
/// This example shows how to get the underlying slots from a regex match.
///
/// ```
/// use regex_automata::{
/// nfa::thompson::pikevm::PikeVM,
/// util::primitives::{PatternID, NonMaxUsize},
/// };
///
/// let re = PikeVM::new_many(&[
/// r"[a-z]+",
/// r"[0-9]+",
/// ])?;
/// let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
///
/// re.captures(&mut cache, "123", &mut caps);
/// assert_eq!(Some(PatternID::must(1)), caps.pattern());
/// // Note that the only guarantee we have here is that slots 2 and 3
/// // are set to correct values. The contents of the first two slots are
/// // unspecified since the 0th pattern did not match.
/// let expected = &[
/// None,
/// None,
/// NonMaxUsize::new(0),
/// NonMaxUsize::new(3),
/// ];
/// assert_eq!(expected, caps.slots());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn slots(&self) -> &[Option<NonMaxUsize>] {
&self.slots
}
/// Returns the underlying slots as a mutable slice, where each slot stores
/// a single offset.
///
/// This tends to be most useful for regex engine implementations for
/// writing offsets for matching capturing groups to slots.
///
/// See [`Captures::slots`] for more information about slots.
#[inline]
pub fn slots_mut(&mut self) -> &mut [Option<NonMaxUsize>] {
&mut self.slots
}
}
impl core::fmt::Debug for Captures {
fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
let mut dstruct = f.debug_struct("Captures");
dstruct.field("pid", &self.pid);
if let Some(pid) = self.pid {
dstruct.field("spans", &CapturesDebugMap { pid, caps: self });
}
dstruct.finish()
}
}
/// A little helper type to provide a nice map-like debug representation for
/// our capturing group spans.
struct CapturesDebugMap<'a> {
pid: PatternID,
caps: &'a Captures,
}
impl<'a> core::fmt::Debug for CapturesDebugMap<'a> {
fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
struct Key<'a>(usize, Option<&'a str>);
impl<'a> core::fmt::Debug for Key<'a> {
fn fmt(&self, f: &mut core::fmt::Formatter) -> core::fmt::Result {
write!(f, "{}", self.0)?;
if let Some(name) = self.1 {
write!(f, "/{:?}", name)?;
}
Ok(())
}
}
let mut map = f.debug_map();
let names = self.caps.group_info().pattern_names(self.pid);
for (group_index, maybe_name) in names.enumerate() {
let key = Key(group_index, maybe_name);
match self.caps.get_group(group_index) {
None => map.entry(&key, &None::<()>),
Some(span) => map.entry(&key, &span),
};
}
map.finish()
}
}
/// An iterator over all capturing groups in a `Captures` value.
///
/// This iterator includes capturing groups that did not participate in a
/// match. See the [`Captures::iter`] method documentation for more details
/// and examples.
///
/// The lifetime parameter `'a` refers to the lifetime of the underlying
/// `Captures` value.
#[derive(Clone, Debug)]
pub struct CapturesPatternIter<'a> {
caps: &'a Captures,
names: core::iter::Enumerate<GroupInfoPatternNames<'a>>,
}
impl<'a> Iterator for CapturesPatternIter<'a> {
type Item = Option<Span>;
fn next(&mut self) -> Option<Option<Span>> {
let (group_index, _) = self.names.next()?;
Some(self.caps.get_group(group_index))
}
fn size_hint(&self) -> (usize, Option<usize>) {
self.names.size_hint()
}
fn count(self) -> usize {
self.names.count()
}
}
impl<'a> ExactSizeIterator for CapturesPatternIter<'a> {}
impl<'a> core::iter::FusedIterator for CapturesPatternIter<'a> {}
/// Represents information about capturing groups in a compiled regex.
///
/// The information encapsulated by this type consists of the following. For
/// each pattern:
///
/// * A map from every capture group name to its corresponding capture group
/// index.
/// * A map from every capture group index to its corresponding capture group
/// name.
/// * A map from capture group index to its corresponding slot index. A slot
/// refers to one half of a capturing group. That is, a capture slot is either
/// the start or end of a capturing group. A slot is usually the mechanism
/// by which a regex engine records offsets for each capturing group during a
/// search.
///
/// A `GroupInfo` uses reference counting internally and is thus cheap to
/// clone.
///
/// # Mapping from capture groups to slots
///
/// One of the main responsibilities of a `GroupInfo` is to build a mapping
/// from `(PatternID, u32)` (where the `u32` is a capture index) to something
/// called a "slot." As mentioned above, a slot refers to one half of a
/// capturing group. Both combined provide the start and end offsets of
/// a capturing group that participated in a match.
///
/// **The mapping between group indices and slots is an API guarantee.** That
/// is, the mapping won't change within a semver compatible release.
///
/// Slots exist primarily because this is a convenient mechanism by which
/// regex engines report group offsets at search time. For example, the
/// [`nfa::thompson::State::Capture`](crate::nfa::thompson::State::Capture)
/// NFA state includes the slot index. When a regex engine transitions through
/// this state, it will likely use the slot index to write the current haystack
/// offset to some region of memory. When a match is found, those slots are
/// then reported to the caller, typically via a convenient abstraction like a
/// [`Captures`] value.
///
/// Because this crate provides first class support for multi-pattern regexes,
/// and because of some performance related reasons, the mapping between
/// capturing groups and slots is a little complex. However, in the case of a
/// single pattern, the mapping can be described very simply: for all capture
/// group indices `i`, its corresponding slots are at `i * 2` and `i * 2 + 1`.
/// Notice that the pattern ID isn't involved at all here, because it only
/// applies to a single-pattern regex, it is therefore always `0`.
///
/// In the multi-pattern case, the mapping is a bit more complicated. To talk
/// about it, we must define what we mean by "implicit" vs "explicit"
/// capturing groups:
///
/// * An **implicit** capturing group refers to the capturing group that is
/// present for every pattern automatically, and corresponds to the overall
/// match of a pattern. Every pattern has precisely one implicit capturing
/// group. It is always unnamed and it always corresponds to the capture group
/// index `0`.
/// * An **explicit** capturing group refers to any capturing group that
/// appears in the concrete syntax of the pattern. (Or, if an NFA was hand
/// built without any concrete syntax, it refers to any capturing group with an
/// index greater than `0`.)
///
/// Some examples:
///
/// * `\w+` has one implicit capturing group and zero explicit capturing
/// groups.
/// * `(\w+)` has one implicit group and one explicit group.
/// * `foo(\d+)(?:\pL+)(\d+)` has one implicit group and two explicit groups.
///
/// Turning back to the slot mapping, we can now state it as follows:
///
/// * Given a pattern ID `pid`, the slots for its implicit group are always
/// at `pid * 2` and `pid * 2 + 1`.
/// * Given a pattern ID `0`, the slots for its explicit groups start
/// at `group_info.pattern_len() * 2`.
/// * Given a pattern ID `pid > 0`, the slots for its explicit groups start
/// immediately following where the slots for the explicit groups of `pid - 1`
/// end.
///
/// In particular, while there is a concrete formula one can use to determine
/// where the slots for the implicit group of any pattern are, there is no
/// general formula for determining where the slots for explicit capturing
/// groups are. This is because each pattern can contain a different number
/// of groups.
///
/// The intended way of getting the slots for a particular capturing group
/// (whether implicit or explicit) is via the [`GroupInfo::slot`] or
/// [`GroupInfo::slots`] method.
///
/// See below for a concrete example of how capturing groups get mapped to
/// slots.
///
/// # Example
///
/// This example shows how to build a new `GroupInfo` and query it for
/// information.
///
/// ```
/// use regex_automata::util::{captures::GroupInfo, primitives::PatternID};
///
/// let info = GroupInfo::new(vec![
/// vec![None, Some("foo")],
/// vec![None],
/// vec![None, None, None, Some("bar"), None],
/// vec![None, None, Some("foo")],
/// ])?;
/// // The number of patterns being tracked.
/// assert_eq!(4, info.pattern_len());
/// // We can query the number of groups for any pattern.
/// assert_eq!(2, info.group_len(PatternID::must(0)));
/// assert_eq!(1, info.group_len(PatternID::must(1)));
/// assert_eq!(5, info.group_len(PatternID::must(2)));
/// assert_eq!(3, info.group_len(PatternID::must(3)));
/// // An invalid pattern always has zero groups.
/// assert_eq!(0, info.group_len(PatternID::must(999)));
/// // 2 slots per group
/// assert_eq!(22, info.slot_len());
///
/// // We can map a group index for a particular pattern to its name, if
/// // one exists.
/// assert_eq!(Some("foo"), info.to_name(PatternID::must(3), 2));
/// assert_eq!(None, info.to_name(PatternID::must(2), 4));
/// // Or map a name to its group index.
/// assert_eq!(Some(1), info.to_index(PatternID::must(0), "foo"));
/// assert_eq!(Some(2), info.to_index(PatternID::must(3), "foo"));
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
///
/// # Example: mapping from capture groups to slots
///
/// This example shows the specific mapping from capture group indices for
/// each pattern to their corresponding slots. The slot values shown in this
/// example are considered an API guarantee.
///
/// ```
/// use regex_automata::util::{captures::GroupInfo, primitives::PatternID};
///
/// let info = GroupInfo::new(vec![
/// vec![None, Some("foo")],
/// vec![None],
/// vec![None, None, None, Some("bar"), None],
/// vec![None, None, Some("foo")],
/// ])?;
///
/// // We first show the slots for each pattern's implicit group.
/// assert_eq!(Some((0, 1)), info.slots(PatternID::must(0), 0));
/// assert_eq!(Some((2, 3)), info.slots(PatternID::must(1), 0));
/// assert_eq!(Some((4, 5)), info.slots(PatternID::must(2), 0));
/// assert_eq!(Some((6, 7)), info.slots(PatternID::must(3), 0));
///
/// // And now we show the slots for each pattern's explicit group.
/// assert_eq!(Some((8, 9)), info.slots(PatternID::must(0), 1));
/// assert_eq!(Some((10, 11)), info.slots(PatternID::must(2), 1));
/// assert_eq!(Some((12, 13)), info.slots(PatternID::must(2), 2));
/// assert_eq!(Some((14, 15)), info.slots(PatternID::must(2), 3));
/// assert_eq!(Some((16, 17)), info.slots(PatternID::must(2), 4));
/// assert_eq!(Some((18, 19)), info.slots(PatternID::must(3), 1));
/// assert_eq!(Some((20, 21)), info.slots(PatternID::must(3), 2));
///
/// // Asking for the slots for an invalid pattern ID or even for an invalid
/// // group index for a specific pattern will return None. So for example,
/// // you're guaranteed to not get the slots for a different pattern than the
/// // one requested.
/// assert_eq!(None, info.slots(PatternID::must(5), 0));
/// assert_eq!(None, info.slots(PatternID::must(1), 1));
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[derive(Clone, Debug, Default)]
pub struct GroupInfo(Arc<GroupInfoInner>);
impl GroupInfo {
/// Creates a new group info from a sequence of patterns, where each
/// sequence of patterns yields a sequence of possible group names. The
/// index of each pattern in the sequence corresponds to its `PatternID`,
/// and the index of each group in each pattern's sequence corresponds to
/// its corresponding group index.
///
/// While this constructor is very generic and therefore perhaps hard to
/// chew on, an example of a valid concrete type that can be passed to
/// this constructor is `Vec<Vec<Option<String>>>`. The outer `Vec`
/// corresponds to the patterns, i.e., one `Vec<Option<String>>` per
/// pattern. The inner `Vec` corresponds to the capturing groups for
/// each pattern. The `Option<String>` corresponds to the name of the
/// capturing group, if present.
///
/// It is legal to pass an empty iterator to this constructor. It will
/// return an empty group info with zero slots. An empty group info is
/// useful for cases where you have no patterns or for cases where slots
/// aren't being used at all (e.g., for most DFAs in this crate).
///
/// # Errors
///
/// This constructor returns an error if the given capturing groups are
/// invalid in some way. Those reasons include, but are not necessarily
/// limited to:
///
/// * Too many patterns (i.e., `PatternID` would overflow).
/// * Too many capturing groups (e.g., `u32` would overflow).
/// * A pattern is given that has no capturing groups. (All patterns must
/// have at least an implicit capturing group at index `0`.)
/// * The capturing group at index `0` has a name. It must be unnamed.
/// * There are duplicate capturing group names within the same pattern.
/// (Multiple capturing groups with the same name may exist, but they
/// must be in different patterns.)
///
/// An example below shows how to trigger some of the above error
/// conditions.
///
/// # Example
///
/// This example shows how to build a new `GroupInfo` and query it for
/// information.
///
/// ```
/// use regex_automata::util::captures::GroupInfo;
///
/// let info = GroupInfo::new(vec![
/// vec![None, Some("foo")],
/// vec![None],
/// vec![None, None, None, Some("bar"), None],
/// vec![None, None, Some("foo")],
/// ])?;
/// // The number of patterns being tracked.
/// assert_eq!(4, info.pattern_len());
/// // 2 slots per group
/// assert_eq!(22, info.slot_len());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
///
/// # Example: empty `GroupInfo`
///
/// This example shows how to build a new `GroupInfo` and query it for
/// information.
///
/// ```
/// use regex_automata::util::captures::GroupInfo;
///
/// let info = GroupInfo::empty();
/// // Everything is zero.
/// assert_eq!(0, info.pattern_len());
/// assert_eq!(0, info.slot_len());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
///
/// # Example: error conditions
///
/// This example shows how to provoke some of the ways in which building
/// a `GroupInfo` can fail.
///
/// ```
/// use regex_automata::util::captures::GroupInfo;
///
/// // Either the group info is empty, or all patterns must have at least
/// // one capturing group.
/// assert!(GroupInfo::new(vec![
/// vec![None, Some("a")], // ok
/// vec![None], // ok
/// vec![], // not ok
/// ]).is_err());
/// // Note that building an empty group info is OK.
/// assert!(GroupInfo::new(Vec::<Vec<Option<String>>>::new()).is_ok());
///
/// // The first group in each pattern must correspond to an implicit
/// // anonymous group. i.e., One that is not named. By convention, this
/// // group corresponds to the overall match of a regex. Every other group
/// // in a pattern is explicit and optional.
/// assert!(GroupInfo::new(vec![vec![Some("foo")]]).is_err());
///
/// // There must not be duplicate group names within the same pattern.
/// assert!(GroupInfo::new(vec![
/// vec![None, Some("foo"), Some("foo")],
/// ]).is_err());
/// // But duplicate names across distinct patterns is OK.
/// assert!(GroupInfo::new(vec![
/// vec![None, Some("foo")],
/// vec![None, Some("foo")],
/// ]).is_ok());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
///
/// There are other ways for building a `GroupInfo` to fail but are
/// difficult to show. For example, if the number of patterns given would
/// overflow `PatternID`.
pub fn new<P, G, N>(pattern_groups: P) -> Result<GroupInfo, GroupInfoError>
where
P: IntoIterator<Item = G>,
G: IntoIterator<Item = Option<N>>,
N: AsRef<str>,
{
let mut group_info = GroupInfoInner {
slot_ranges: vec![],
name_to_index: vec![],
index_to_name: vec![],
memory_extra: 0,
};
for (pattern_index, groups) in pattern_groups.into_iter().enumerate() {
// If we can't convert the pattern index to an ID, then the caller
// tried to build capture info for too many patterns.
let pid = PatternID::new(pattern_index)
.map_err(GroupInfoError::too_many_patterns)?;
let mut groups_iter = groups.into_iter().enumerate();
match groups_iter.next() {
None => return Err(GroupInfoError::missing_groups(pid)),
Some((_, Some(_))) => {
return Err(GroupInfoError::first_must_be_unnamed(pid))
}
Some((_, None)) => {}
}
group_info.add_first_group(pid);
// Now iterate over the rest, which correspond to all of the
// (conventionally) explicit capture groups in a regex pattern.
for (group_index, maybe_name) in groups_iter {
// Just like for patterns, if the group index can't be
// converted to a "small" index, then the caller has given too
// many groups for a particular pattern.
let group = SmallIndex::new(group_index).map_err(|_| {
GroupInfoError::too_many_groups(pid, group_index)
})?;
group_info.add_explicit_group(pid, group, maybe_name)?;
}
}
group_info.fixup_slot_ranges()?;
Ok(GroupInfo(Arc::new(group_info)))
}
/// This creates an empty `GroupInfo`.
///
/// This is a convenience routine for calling `GroupInfo::new` with an
/// iterator that yields no elements.
///
/// # Example
///
/// This example shows how to build a new empty `GroupInfo` and query it
/// for information.
///
/// ```
/// use regex_automata::util::captures::GroupInfo;
///
/// let info = GroupInfo::empty();
/// // Everything is zero.
/// assert_eq!(0, info.pattern_len());
/// assert_eq!(0, info.all_group_len());
/// assert_eq!(0, info.slot_len());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
pub fn empty() -> GroupInfo {
GroupInfo::new(core::iter::empty::<[Option<&str>; 0]>())
.expect("empty group info is always valid")
}
/// Return the capture group index corresponding to the given name in the
/// given pattern. If no such capture group name exists in the given
/// pattern, then this returns `None`.
///
/// If the given pattern ID is invalid, then this returns `None`.
///
/// This also returns `None` for all inputs if these captures are empty
/// (e.g., built from an empty [`GroupInfo`]). To check whether captures
/// are are present for a specific pattern, use [`GroupInfo::group_len`].
///
/// # Example
///
/// This example shows how to find the capture index for the given pattern
/// and group name.
///
/// Remember that capture indices are relative to the pattern, such that
/// the same capture index value may refer to different capturing groups
/// for distinct patterns.
///
/// ```
/// # if cfg!(miri) { return Ok(()); } // miri takes too long
/// use regex_automata::{nfa::thompson::NFA, PatternID};
///
/// let (pid0, pid1) = (PatternID::must(0), PatternID::must(1));
///
/// let nfa = NFA::new_many(&[
/// r"a(?P<quux>\w+)z(?P<foo>\s+)",
/// r"a(?P<foo>\d+)z",
/// ])?;
/// let groups = nfa.group_info();
/// assert_eq!(Some(2), groups.to_index(pid0, "foo"));
/// // Recall that capture index 0 is always unnamed and refers to the
/// // entire pattern. So the first capturing group present in the pattern
/// // itself always starts at index 1.
/// assert_eq!(Some(1), groups.to_index(pid1, "foo"));
///
/// // And if a name does not exist for a particular pattern, None is
/// // returned.
/// assert!(groups.to_index(pid0, "quux").is_some());
/// assert!(groups.to_index(pid1, "quux").is_none());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn to_index(&self, pid: PatternID, name: &str) -> Option<usize> {
let indices = self.0.name_to_index.get(pid.as_usize())?;
indices.get(name).cloned().map(|i| i.as_usize())
}
/// Return the capture name for the given index and given pattern. If the
/// corresponding group does not have a name, then this returns `None`.
///
/// If the pattern ID is invalid, then this returns `None`.
///
/// If the group index is invalid for the given pattern, then this returns
/// `None`. A group `index` is valid for a pattern `pid` in an `nfa` if and
/// only if `index < nfa.pattern_capture_len(pid)`.
///
/// This also returns `None` for all inputs if these captures are empty
/// (e.g., built from an empty [`GroupInfo`]). To check whether captures
/// are are present for a specific pattern, use [`GroupInfo::group_len`].
///
/// # Example
///
/// This example shows how to find the capture group name for the given
/// pattern and group index.
///
/// ```
/// # if cfg!(miri) { return Ok(()); } // miri takes too long
/// use regex_automata::{nfa::thompson::NFA, PatternID};
///
/// let (pid0, pid1) = (PatternID::must(0), PatternID::must(1));
///
/// let nfa = NFA::new_many(&[
/// r"a(?P<foo>\w+)z(\s+)x(\d+)",
/// r"a(\d+)z(?P<foo>\s+)",
/// ])?;
/// let groups = nfa.group_info();
/// assert_eq!(None, groups.to_name(pid0, 0));
/// assert_eq!(Some("foo"), groups.to_name(pid0, 1));
/// assert_eq!(None, groups.to_name(pid0, 2));
/// assert_eq!(None, groups.to_name(pid0, 3));
///
/// assert_eq!(None, groups.to_name(pid1, 0));
/// assert_eq!(None, groups.to_name(pid1, 1));
/// assert_eq!(Some("foo"), groups.to_name(pid1, 2));
/// // '3' is not a valid capture index for the second pattern.
/// assert_eq!(None, groups.to_name(pid1, 3));
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn to_name(&self, pid: PatternID, group_index: usize) -> Option<&str> {
let pattern_names = self.0.index_to_name.get(pid.as_usize())?;
pattern_names.get(group_index)?.as_deref()
}
/// Return an iterator of all capture groups and their names (if present)
/// for a particular pattern.
///
/// If the given pattern ID is invalid or if this `GroupInfo` is empty,
/// then the iterator yields no elements.
///
/// The number of elements yielded by this iterator is always equal to
/// the result of calling [`GroupInfo::group_len`] with the same
/// `PatternID`.
///
/// # Example
///
/// This example shows how to get a list of all capture group names for
/// a particular pattern.
///
/// ```
/// use regex_automata::{nfa::thompson::NFA, PatternID};
///
/// let nfa = NFA::new(r"(a)(?P<foo>b)(c)(d)(?P<bar>e)")?;
/// // The first is the implicit group that is always unnammed. The next
/// // 5 groups are the explicit groups found in the concrete syntax above.
/// let expected = vec![None, None, Some("foo"), None, None, Some("bar")];
/// let got: Vec<Option<&str>> =
/// nfa.group_info().pattern_names(PatternID::ZERO).collect();
/// assert_eq!(expected, got);
///
/// // Using an invalid pattern ID will result in nothing yielded.
/// let got = nfa.group_info().pattern_names(PatternID::must(999)).count();
/// assert_eq!(0, got);
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn pattern_names(&self, pid: PatternID) -> GroupInfoPatternNames<'_> {
GroupInfoPatternNames {
it: self
.0
.index_to_name
.get(pid.as_usize())
.map(|indices| indices.iter())
.unwrap_or([].iter()),
}
}
/// Return an iterator of all capture groups for all patterns supported by
/// this `GroupInfo`. Each item yielded is a triple of the group's pattern
/// ID, index in the pattern and the group's name, if present.
///
/// # Example
///
/// This example shows how to get a list of all capture groups found in
/// one NFA, potentially spanning multiple patterns.
///
/// ```
/// use regex_automata::{nfa::thompson::NFA, PatternID};
///
/// let nfa = NFA::new_many(&[
/// r"(?P<foo>a)",
/// r"a",
/// r"(a)",
/// ])?;
/// let expected = vec![
/// (PatternID::must(0), 0, None),
/// (PatternID::must(0), 1, Some("foo")),
/// (PatternID::must(1), 0, None),
/// (PatternID::must(2), 0, None),
/// (PatternID::must(2), 1, None),
/// ];
/// let got: Vec<(PatternID, usize, Option<&str>)> =
/// nfa.group_info().all_names().collect();
/// assert_eq!(expected, got);
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
///
/// Unlike other capturing group related routines, this routine doesn't
/// panic even if captures aren't enabled on this NFA:
///
/// ```
/// use regex_automata::nfa::thompson::{NFA, WhichCaptures};
///
/// let nfa = NFA::compiler()
/// .configure(NFA::config().which_captures(WhichCaptures::None))
/// .build_many(&[
/// r"(?P<foo>a)",
/// r"a",
/// r"(a)",
/// ])?;
/// // When captures aren't enabled, there's nothing to return.
/// assert_eq!(0, nfa.group_info().all_names().count());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn all_names(&self) -> GroupInfoAllNames<'_> {
GroupInfoAllNames {
group_info: self,
pids: PatternID::iter(self.pattern_len()),
current_pid: None,
names: None,
}
}
/// Returns the starting and ending slot corresponding to the given
/// capturing group for the given pattern. The ending slot is always one
/// more than the starting slot returned.
///
/// Note that this is like [`GroupInfo::slot`], except that it also returns
/// the ending slot value for convenience.
///
/// If either the pattern ID or the capture index is invalid, then this
/// returns None.
///
/// # Example
///
/// This example shows that the starting slots for the first capturing
/// group of each pattern are distinct.
///
/// ```
/// use regex_automata::{nfa::thompson::NFA, PatternID};
///
/// let nfa = NFA::new_many(&["a", "b"])?;
/// assert_ne!(
/// nfa.group_info().slots(PatternID::must(0), 0),
/// nfa.group_info().slots(PatternID::must(1), 0),
/// );
///
/// // Also, the start and end slot values are never equivalent.
/// let (start, end) = nfa.group_info().slots(PatternID::ZERO, 0).unwrap();
/// assert_ne!(start, end);
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn slots(
&self,
pid: PatternID,
group_index: usize,
) -> Option<(usize, usize)> {
// Since 'slot' only even returns valid starting slots, we know that
// there must also be an end slot and that end slot is always one more
// than the start slot.
self.slot(pid, group_index).map(|start| (start, start + 1))
}
/// Returns the starting slot corresponding to the given capturing group
/// for the given pattern. The ending slot is always one more than the
/// value returned.
///
/// If either the pattern ID or the capture index is invalid, then this
/// returns None.
///
/// # Example
///
/// This example shows that the starting slots for the first capturing
/// group of each pattern are distinct.
///
/// ```
/// use regex_automata::{nfa::thompson::NFA, PatternID};
///
/// let nfa = NFA::new_many(&["a", "b"])?;
/// assert_ne!(
/// nfa.group_info().slot(PatternID::must(0), 0),
/// nfa.group_info().slot(PatternID::must(1), 0),
/// );
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn slot(&self, pid: PatternID, group_index: usize) -> Option<usize> {
if group_index >= self.group_len(pid) {
return None;
}
// At this point, we know that 'pid' refers to a real pattern and that
// 'group_index' refers to a real group. We therefore also know that
// the pattern and group can be combined to return a correct slot.
// That's why we don't need to use checked arithmetic below.
if group_index == 0 {
Some(pid.as_usize() * 2)
} else {
// As above, we don't need to check that our slot is less than the
// end of our range since we already know the group index is a
// valid index for the given pattern.
let (start, _) = self.0.slot_ranges[pid];
Some(start.as_usize() + ((group_index - 1) * 2))
}
}
/// Returns the total number of patterns in this `GroupInfo`.
///
/// This may return zero if the `GroupInfo` was constructed with no
/// patterns.
///
/// This is guaranteed to be no bigger than [`PatternID::LIMIT`] because
/// `GroupInfo` construction will fail if too many patterns are added.
///
/// # Example
///
/// ```
/// use regex_automata::nfa::thompson::NFA;
///
/// let nfa = NFA::new_many(&["[0-9]+", "[a-z]+", "[A-Z]+"])?;
/// assert_eq!(3, nfa.group_info().pattern_len());
///
/// let nfa = NFA::never_match();
/// assert_eq!(0, nfa.group_info().pattern_len());
///
/// let nfa = NFA::always_match();
/// assert_eq!(1, nfa.group_info().pattern_len());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn pattern_len(&self) -> usize {
self.0.pattern_len()
}
/// Return the number of capture groups in a pattern.
///
/// If the pattern ID is invalid, then this returns `0`.
///
/// # Example
///
/// This example shows how the values returned by this routine may vary
/// for different patterns and NFA configurations.
///
/// ```
/// use regex_automata::{nfa::thompson::{NFA, WhichCaptures}, PatternID};
///
/// let nfa = NFA::new(r"(a)(b)(c)")?;
/// // There are 3 explicit groups in the pattern's concrete syntax and
/// // 1 unnamed and implicit group spanning the entire pattern.
/// assert_eq!(4, nfa.group_info().group_len(PatternID::ZERO));
///
/// let nfa = NFA::new(r"abc")?;
/// // There is just the unnamed implicit group.
/// assert_eq!(1, nfa.group_info().group_len(PatternID::ZERO));
///
/// let nfa = NFA::compiler()
/// .configure(NFA::config().which_captures(WhichCaptures::None))
/// .build(r"abc")?;
/// // We disabled capturing groups, so there are none.
/// assert_eq!(0, nfa.group_info().group_len(PatternID::ZERO));
///
/// let nfa = NFA::compiler()
/// .configure(NFA::config().which_captures(WhichCaptures::None))
/// .build(r"(a)(b)(c)")?;
/// // We disabled capturing groups, so there are none, even if there are
/// // explicit groups in the concrete syntax.
/// assert_eq!(0, nfa.group_info().group_len(PatternID::ZERO));
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn group_len(&self, pid: PatternID) -> usize {
self.0.group_len(pid)
}
/// Return the total number of capture groups across all patterns.
///
/// This includes implicit groups that represent the entire match of a
/// pattern.
///
/// # Example
///
/// This example shows how the values returned by this routine may vary
/// for different patterns and NFA configurations.
///
/// ```
/// use regex_automata::{nfa::thompson::{NFA, WhichCaptures}, PatternID};
///
/// let nfa = NFA::new(r"(a)(b)(c)")?;
/// // There are 3 explicit groups in the pattern's concrete syntax and
/// // 1 unnamed and implicit group spanning the entire pattern.
/// assert_eq!(4, nfa.group_info().all_group_len());
///
/// let nfa = NFA::new(r"abc")?;
/// // There is just the unnamed implicit group.
/// assert_eq!(1, nfa.group_info().all_group_len());
///
/// let nfa = NFA::new_many(&["(a)", "b", "(c)"])?;
/// // Each pattern has one implicit groups, and two
/// // patterns have one explicit group each.
/// assert_eq!(5, nfa.group_info().all_group_len());
///
/// let nfa = NFA::compiler()
/// .configure(NFA::config().which_captures(WhichCaptures::None))
/// .build(r"abc")?;
/// // We disabled capturing groups, so there are none.
/// assert_eq!(0, nfa.group_info().all_group_len());
///
/// let nfa = NFA::compiler()
/// .configure(NFA::config().which_captures(WhichCaptures::None))
/// .build(r"(a)(b)(c)")?;
/// // We disabled capturing groups, so there are none, even if there are
/// // explicit groups in the concrete syntax.
/// assert_eq!(0, nfa.group_info().group_len(PatternID::ZERO));
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn all_group_len(&self) -> usize {
self.slot_len() / 2
}
/// Returns the total number of slots in this `GroupInfo` across all
/// patterns.
///
/// The total number of slots is always twice the total number of capturing
/// groups, including both implicit and explicit groups.
///
/// # Example
///
/// This example shows the relationship between the number of capturing
/// groups and slots.
///
/// ```
/// use regex_automata::util::captures::GroupInfo;
///
/// // There are 11 total groups here.
/// let info = GroupInfo::new(vec![
/// vec![None, Some("foo")],
/// vec![None],
/// vec![None, None, None, Some("bar"), None],
/// vec![None, None, Some("foo")],
/// ])?;
/// // 2 slots per group gives us 11*2=22 slots.
/// assert_eq!(22, info.slot_len());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn slot_len(&self) -> usize {
self.0.small_slot_len().as_usize()
}
/// Returns the total number of slots for implicit capturing groups.
///
/// This is like [`GroupInfo::slot_len`], except it doesn't include the
/// explicit slots for each pattern. Since there are always exactly 2
/// implicit slots for each pattern, the number of implicit slots is always
/// equal to twice the number of patterns.
///
/// # Example
///
/// This example shows the relationship between the number of capturing
/// groups, implicit slots and explicit slots.
///
/// ```
/// use regex_automata::util::captures::GroupInfo;
///
/// // There are 11 total groups here.
/// let info = GroupInfo::new(vec![vec![None, Some("foo"), Some("bar")]])?;
/// // 2 slots per group gives us 11*2=22 slots.
/// assert_eq!(6, info.slot_len());
/// // 2 implicit slots per pattern gives us 2 implicit slots since there
/// // is 1 pattern.
/// assert_eq!(2, info.implicit_slot_len());
/// // 2 explicit capturing groups gives us 2*2=4 explicit slots.
/// assert_eq!(4, info.explicit_slot_len());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn implicit_slot_len(&self) -> usize {
self.pattern_len() * 2
}
/// Returns the total number of slots for explicit capturing groups.
///
/// This is like [`GroupInfo::slot_len`], except it doesn't include the
/// implicit slots for each pattern. (There are always 2 implicit slots for
/// each pattern.)
///
/// For a non-empty `GroupInfo`, it is always the case that `slot_len` is
/// strictly greater than `explicit_slot_len`. For an empty `GroupInfo`,
/// both the total number of slots and the number of explicit slots is
/// `0`.
///
/// # Example
///
/// This example shows the relationship between the number of capturing
/// groups, implicit slots and explicit slots.
///
/// ```
/// use regex_automata::util::captures::GroupInfo;
///
/// // There are 11 total groups here.
/// let info = GroupInfo::new(vec![vec![None, Some("foo"), Some("bar")]])?;
/// // 2 slots per group gives us 11*2=22 slots.
/// assert_eq!(6, info.slot_len());
/// // 2 implicit slots per pattern gives us 2 implicit slots since there
/// // is 1 pattern.
/// assert_eq!(2, info.implicit_slot_len());
/// // 2 explicit capturing groups gives us 2*2=4 explicit slots.
/// assert_eq!(4, info.explicit_slot_len());
///
/// # Ok::<(), Box<dyn std::error::Error>>(())
/// ```
#[inline]
pub fn explicit_slot_len(&self) -> usize {
self.slot_len().saturating_sub(self.implicit_slot_len())
}
/// Returns the memory usage, in bytes, of this `GroupInfo`.
///
/// This does **not** include the stack size used up by this `GroupInfo`.
/// To compute that, use `std::mem::size_of::<GroupInfo>()`.
#[inline]
pub fn memory_usage(&self) -> usize {
use core::mem::size_of as s;
s::<GroupInfoInner>()
+ self.0.slot_ranges.len() * s::<(SmallIndex, SmallIndex)>()
+ self.0.name_to_index.len() * s::<CaptureNameMap>()
+ self.0.index_to_name.len() * s::<Vec<Option<Arc<str>>>>()
+ self.0.memory_extra
}
}
/// A map from capture group name to its corresponding capture group index.
///
/// This type is actually wrapped inside a Vec indexed by pattern ID on a
/// `GroupInfo`, since multiple patterns may have the same capture group name.
/// That is, each pattern gets its own namespace of capture group names.
///
/// Perhaps a more memory efficient representation would be
/// HashMap<(PatternID, Arc<str>), usize>, but this makes it difficult to look
/// up a capture index by name without producing a `Arc<str>`, which requires
/// an allocation. To fix this, I think we'd need to define our own unsized
/// type or something? Anyway, I didn't give this much thought since it
/// probably doesn't matter much in the grand scheme of things. But it did
/// stand out to me as mildly wasteful.
#[cfg(feature = "std")]
type CaptureNameMap = std::collections::HashMap<Arc<str>, SmallIndex>;
#[cfg(not(feature = "std"))]
type CaptureNameMap = alloc::collections::BTreeMap<Arc<str>, SmallIndex>;
/// The inner guts of `GroupInfo`. This type only exists so that it can
/// be wrapped in an `Arc` to make `GroupInfo` reference counted.
#[derive(Debug, Default)]
struct GroupInfoInner {
slot_ranges: Vec<(SmallIndex, SmallIndex)>,
name_to_index: Vec<CaptureNameMap>,
index_to_name: Vec<Vec<Option<Arc<str>>>>,
memory_extra: usize,
}
impl GroupInfoInner {
/// This adds the first unnamed group for the given pattern ID. The given
/// pattern ID must be zero if this is the first time this method is
/// called, or must be exactly one more than the pattern ID supplied to the
/// previous call to this method. (This method panics if this rule is
/// violated.)
///
/// This can be thought of as initializing the GroupInfo state for the
/// given pattern and closing off the state for any previous pattern.
fn add_first_group(&mut self, pid: PatternID) {
assert_eq!(pid.as_usize(), self.slot_ranges.len());
assert_eq!(pid.as_usize(), self.name_to_index.len());
assert_eq!(pid.as_usize(), self.index_to_name.len());
// This is the start of our slots for the explicit capturing groups.
// Note that since the slots for the 0th group for every pattern appear
// before any slots for the nth group (where n > 0) in any pattern, we
// will have to fix up the slot ranges once we know how many patterns
// we've added capture groups for.
let slot_start = self.small_slot_len();
self.slot_ranges.push((slot_start, slot_start));
self.name_to_index.push(CaptureNameMap::new());
self.index_to_name.push(vec![None]);
self.memory_extra += core::mem::size_of::<Option<Arc<str>>>();
}
/// Add an explicit capturing group for the given pattern with the given
/// index. If the group has a name, then that must be given as well.
///
/// Note that every capturing group except for the first or zeroth group is
/// explicit.
///
/// This returns an error if adding this group would result in overflowing
/// slot indices or if a capturing group with the same name for this
/// pattern has already been added.
fn add_explicit_group<N: AsRef<str>>(
&mut self,
pid: PatternID,
group: SmallIndex,
maybe_name: Option<N>,
) -> Result<(), GroupInfoError> {
// We also need to check that the slot index generated for
// this group is also valid. Although, this is a little weird
// because we offset these indices below, at which point, we'll
// have to recheck them. Gosh this is annoying. Note that
// the '+2' below is OK because 'end' is guaranteed to be less
// than isize::MAX.
let end = &mut self.slot_ranges[pid].1;
*end = SmallIndex::new(end.as_usize() + 2).map_err(|_| {
GroupInfoError::too_many_groups(pid, group.as_usize())
})?;
if let Some(name) = maybe_name {
let name = Arc::<str>::from(name.as_ref());
if self.name_to_index[pid].contains_key(&*name) {
return Err(GroupInfoError::duplicate(pid, &name));
}
let len = name.len();
self.name_to_index[pid].insert(Arc::clone(&name), group);
self.index_to_name[pid].push(Some(name));
// Adds the memory used by the Arc<str> in both maps.
self.memory_extra +=
2 * (len + core::mem::size_of::<Option<Arc<str>>>());
// And also the value entry for the 'name_to_index' map.
// This is probably an underestimate for 'name_to_index' since
// hashmaps/btrees likely have some non-zero overhead, but we
// assume here that they have zero overhead.
self.memory_extra += core::mem::size_of::<SmallIndex>();
} else {
self.index_to_name[pid].push(None);
self.memory_extra += core::mem::size_of::<Option<Arc<str>>>();
}
// This is a sanity assert that checks that our group index
// is in line with the number of groups added so far for this
// pattern.
assert_eq!(group.one_more(), self.group_len(pid));
// And is also in line with the 'index_to_name' map.
assert_eq!(group.one_more(), self.index_to_name[pid].len());
Ok(())
}
/// This corrects the slot ranges to account for the slots corresponding
/// to the zeroth group of each pattern. That is, every slot range is
/// offset by 'pattern_len() * 2', since each pattern uses two slots to
/// represent the zeroth group.
fn fixup_slot_ranges(&mut self) -> Result<(), GroupInfoError> {
use crate::util::primitives::IteratorIndexExt;
// Since we know number of patterns fits in PatternID and
// PatternID::MAX < isize::MAX, it follows that multiplying by 2 will
// never overflow usize.
let offset = self.pattern_len().checked_mul(2).unwrap();
for (pid, &mut (ref mut start, ref mut end)) in
self.slot_ranges.iter_mut().with_pattern_ids()
{
let group_len = 1 + ((end.as_usize() - start.as_usize()) / 2);
let new_end = match end.as_usize().checked_add(offset) {
Some(new_end) => new_end,
None => {
return Err(GroupInfoError::too_many_groups(
pid, group_len,
))
}
};
*end = SmallIndex::new(new_end).map_err(|_| {
GroupInfoError::too_many_groups(pid, group_len)
})?;
// Since start <= end, if end is valid then start must be too.
*start = SmallIndex::new(start.as_usize() + offset).unwrap();
}
Ok(())
}
/// Return the total number of patterns represented by this capture slot
/// info.
fn pattern_len(&self) -> usize {
self.slot_ranges.len()
}
/// Return the total number of capturing groups for the given pattern. If
/// the given pattern isn't valid for this capture slot info, then 0 is
/// returned.
fn group_len(&self, pid: PatternID) -> usize {
let (start, end) = match self.slot_ranges.get(pid.as_usize()) {
None => return 0,
Some(range) => range,
};
// The difference between any two SmallIndex values always fits in a
// usize since we know that SmallIndex::MAX <= isize::MAX-1. We also
// know that start<=end by construction and that the number of groups
// never exceeds SmallIndex and thus never overflows usize.
1 + ((end.as_usize() - start.as_usize()) / 2)
}
/// Return the total number of slots in this capture slot info as a
/// "small index."
fn small_slot_len(&self) -> SmallIndex {
// Since slots are allocated in order of pattern (starting at 0) and
// then in order of capture group, it follows that the number of slots
// is the end of the range of slots for the last pattern. This is
// true even when the last pattern has no capturing groups, since
// 'slot_ranges' will still represent it explicitly with an empty
// range.
self.slot_ranges.last().map_or(SmallIndex::ZERO, |&(_, end)| end)
}
}
/// An error that may occur when building a `GroupInfo`.
///
/// Building a `GroupInfo` does a variety of checks to make sure the
/// capturing groups satisfy a number of invariants. This includes, but is not
/// limited to, ensuring that the first capturing group is unnamed and that
/// there are no duplicate capture groups for a specific pattern.
#[derive(Clone, Debug)]
pub struct GroupInfoError {
kind: GroupInfoErrorKind,
}
/// The kind of error that occurs when building a `GroupInfo` fails.
///
/// We keep this un-exported because it's not clear how useful it is to
/// export it.
#[derive(Clone, Debug)]
enum GroupInfoErrorKind {
/// This occurs when too many patterns have been added. i.e., It would
/// otherwise overflow a `PatternID`.
TooManyPatterns { err: PatternIDError },
/// This occurs when too many capturing groups have been added for a
/// particular pattern.
TooManyGroups {
/// The ID of the pattern that had too many groups.
pattern: PatternID,
/// The minimum number of groups that the caller has tried to add for
/// a pattern.
minimum: usize,
},
/// An error that occurs when a pattern has no capture groups. Either the
/// group info must be empty, or all patterns must have at least one group
/// (corresponding to the unnamed group for the entire pattern).
MissingGroups {
/// The ID of the pattern that had no capturing groups.
pattern: PatternID,
},
/// An error that occurs when one tries to provide a name for the capture
/// group at index 0. This capturing group must currently always be
/// unnamed.
FirstMustBeUnnamed {
/// The ID of the pattern that was found to have a named first
/// capturing group.
pattern: PatternID,
},
/// An error that occurs when duplicate capture group names for the same
/// pattern are added.
///
/// NOTE: At time of writing, this error can never occur if you're using
/// regex-syntax, since the parser itself will reject patterns with
/// duplicate capture group names. This error can only occur when the
/// builder is used to hand construct NFAs.
Duplicate {
/// The pattern in which the duplicate capture group name was found.
pattern: PatternID,
/// The duplicate name.
name: String,
},
}
impl GroupInfoError {
fn too_many_patterns(err: PatternIDError) -> GroupInfoError {
GroupInfoError { kind: GroupInfoErrorKind::TooManyPatterns { err } }
}
fn too_many_groups(pattern: PatternID, minimum: usize) -> GroupInfoError {
GroupInfoError {
kind: GroupInfoErrorKind::TooManyGroups { pattern, minimum },
}
}
fn missing_groups(pattern: PatternID) -> GroupInfoError {
GroupInfoError { kind: GroupInfoErrorKind::MissingGroups { pattern } }
}
fn first_must_be_unnamed(pattern: PatternID) -> GroupInfoError {
GroupInfoError {
kind: GroupInfoErrorKind::FirstMustBeUnnamed { pattern },
}
}
fn duplicate(pattern: PatternID, name: &str) -> GroupInfoError {
GroupInfoError {
kind: GroupInfoErrorKind::Duplicate {
pattern,
name: String::from(name),
},
}
}
}
#[cfg(feature = "std")]
impl std::error::Error for GroupInfoError {
fn source(&self) -> Option<&(dyn std::error::Error + 'static)> {
match self.kind {
GroupInfoErrorKind::TooManyPatterns { .. }
| GroupInfoErrorKind::TooManyGroups { .. }
| GroupInfoErrorKind::MissingGroups { .. }
| GroupInfoErrorKind::FirstMustBeUnnamed { .. }
| GroupInfoErrorKind::Duplicate { .. } => None,
}
}
}
impl core::fmt::Display for GroupInfoError {
fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
use self::GroupInfoErrorKind::*;
match self.kind {
TooManyPatterns { ref err } => {
write!(f, "too many patterns to build capture info: {}", err)
}
TooManyGroups { pattern, minimum } => {
write!(
f,
"too many capture groups (at least {}) were \
found for pattern {}",
minimum,
pattern.as_usize()
)
}
MissingGroups { pattern } => write!(
f,
"no capturing groups found for pattern {} \
(either all patterns have zero groups or all patterns have \
at least one group)",
pattern.as_usize(),
),
FirstMustBeUnnamed { pattern } => write!(
f,
"first capture group (at index 0) for pattern {} has a name \
(it must be unnamed)",
pattern.as_usize(),
),
Duplicate { pattern, ref name } => write!(
f,
"duplicate capture group name '{}' found for pattern {}",
name,
pattern.as_usize(),
),
}
}
}
/// An iterator over capturing groups and their names for a specific pattern.
///
/// This iterator is created by [`GroupInfo::pattern_names`].
///
/// The lifetime parameter `'a` refers to the lifetime of the `GroupInfo`
/// from which this iterator was created.
#[derive(Clone, Debug)]
pub struct GroupInfoPatternNames<'a> {
it: core::slice::Iter<'a, Option<Arc<str>>>,
}
impl GroupInfoPatternNames<'static> {
fn empty() -> GroupInfoPatternNames<'static> {
GroupInfoPatternNames { it: [].iter() }
}
}
impl<'a> Iterator for GroupInfoPatternNames<'a> {
type Item = Option<&'a str>;
fn next(&mut self) -> Option<Option<&'a str>> {
self.it.next().map(|x| x.as_deref())
}
fn size_hint(&self) -> (usize, Option<usize>) {
self.it.size_hint()
}
fn count(self) -> usize {
self.it.count()
}
}
impl<'a> ExactSizeIterator for GroupInfoPatternNames<'a> {}
impl<'a> core::iter::FusedIterator for GroupInfoPatternNames<'a> {}
/// An iterator over capturing groups and their names for a `GroupInfo`.
///
/// This iterator is created by [`GroupInfo::all_names`].
///
/// The lifetime parameter `'a` refers to the lifetime of the `GroupInfo`
/// from which this iterator was created.
#[derive(Debug)]
pub struct GroupInfoAllNames<'a> {
group_info: &'a GroupInfo,
pids: PatternIDIter,
current_pid: Option<PatternID>,
names: Option<core::iter::Enumerate<GroupInfoPatternNames<'a>>>,
}
impl<'a> Iterator for GroupInfoAllNames<'a> {
type Item = (PatternID, usize, Option<&'a str>);
fn next(&mut self) -> Option<(PatternID, usize, Option<&'a str>)> {
// If the group info has no captures, then we never have anything
// to yield. We need to consider this case explicitly (at time of
// writing) because 'pattern_capture_names' will panic if captures
// aren't enabled.
if self.group_info.0.index_to_name.is_empty() {
return None;
}
if self.current_pid.is_none() {
self.current_pid = Some(self.pids.next()?);
}
let pid = self.current_pid.unwrap();
if self.names.is_none() {
self.names = Some(self.group_info.pattern_names(pid).enumerate());
}
let (group_index, name) = match self.names.as_mut().unwrap().next() {
Some((group_index, name)) => (group_index, name),
None => {
self.current_pid = None;
self.names = None;
return self.next();
}
};
Some((pid, group_index, name))
}
}