Spaces:
Running
Running
# | |
# Secret Labs' Regular Expression Engine | |
# | |
# re-compatible interface for the sre matching engine | |
# | |
# Copyright (c) 1998-2001 by Secret Labs AB. All rights reserved. | |
# | |
# This version of the SRE library can be redistributed under CNRI's | |
# Python 1.6 license. For any other use, please contact Secret Labs | |
# AB ([email protected]). | |
# | |
# Portions of this engine have been developed in cooperation with | |
# CNRI. Hewlett-Packard provided funding for 1.6 integration and | |
# other compatibility work. | |
# | |
r"""Support for regular expressions (RE). | |
This module provides regular expression matching operations similar to | |
those found in Perl. It supports both 8-bit and Unicode strings; both | |
the pattern and the strings being processed can contain null bytes and | |
characters outside the US ASCII range. | |
Regular expressions can contain both special and ordinary characters. | |
Most ordinary characters, like "A", "a", or "0", are the simplest | |
regular expressions; they simply match themselves. You can | |
concatenate ordinary characters, so last matches the string 'last'. | |
The special characters are: | |
"." Matches any character except a newline. | |
"^" Matches the start of the string. | |
"$" Matches the end of the string or just before the newline at | |
the end of the string. | |
"*" Matches 0 or more (greedy) repetitions of the preceding RE. | |
Greedy means that it will match as many repetitions as possible. | |
"+" Matches 1 or more (greedy) repetitions of the preceding RE. | |
"?" Matches 0 or 1 (greedy) of the preceding RE. | |
*?,+?,?? Non-greedy versions of the previous three special characters. | |
{m,n} Matches from m to n repetitions of the preceding RE. | |
{m,n}? Non-greedy version of the above. | |
"\\" Either escapes special characters or signals a special sequence. | |
[] Indicates a set of characters. | |
A "^" as the first character indicates a complementing set. | |
"|" A|B, creates an RE that will match either A or B. | |
(...) Matches the RE inside the parentheses. | |
The contents can be retrieved or matched later in the string. | |
(?aiLmsux) The letters set the corresponding flags defined below. | |
(?:...) Non-grouping version of regular parentheses. | |
(?P<name>...) The substring matched by the group is accessible by name. | |
(?P=name) Matches the text matched earlier by the group named name. | |
(?#...) A comment; ignored. | |
(?=...) Matches if ... matches next, but doesn't consume the string. | |
(?!...) Matches if ... doesn't match next. | |
(?<=...) Matches if preceded by ... (must be fixed length). | |
(?<!...) Matches if not preceded by ... (must be fixed length). | |
(?(id/name)yes|no) Matches yes pattern if the group with id/name matched, | |
the (optional) no pattern otherwise. | |
The special sequences consist of "\\" and a character from the list | |
below. If the ordinary character is not on the list, then the | |
resulting RE will match the second character. | |
\number Matches the contents of the group of the same number. | |
\A Matches only at the start of the string. | |
\Z Matches only at the end of the string. | |
\b Matches the empty string, but only at the start or end of a word. | |
\B Matches the empty string, but not at the start or end of a word. | |
\d Matches any decimal digit; equivalent to the set [0-9] in | |
bytes patterns or string patterns with the ASCII flag. | |
In string patterns without the ASCII flag, it will match the whole | |
range of Unicode digits. | |
\D Matches any non-digit character; equivalent to [^\d]. | |
\s Matches any whitespace character; equivalent to [ \t\n\r\f\v] in | |
bytes patterns or string patterns with the ASCII flag. | |
In string patterns without the ASCII flag, it will match the whole | |
range of Unicode whitespace characters. | |
\S Matches any non-whitespace character; equivalent to [^\s]. | |
\w Matches any alphanumeric character; equivalent to [a-zA-Z0-9_] | |
in bytes patterns or string patterns with the ASCII flag. | |
In string patterns without the ASCII flag, it will match the | |
range of Unicode alphanumeric characters (letters plus digits | |
plus underscore). | |
With LOCALE, it will match the set [0-9_] plus characters defined | |
as letters for the current locale. | |
\W Matches the complement of \w. | |
\\ Matches a literal backslash. | |
This module exports the following functions: | |
match Match a regular expression pattern to the beginning of a string. | |
fullmatch Match a regular expression pattern to all of a string. | |
search Search a string for the presence of a pattern. | |
sub Substitute occurrences of a pattern found in a string. | |
subn Same as sub, but also return the number of substitutions made. | |
split Split a string by the occurrences of a pattern. | |
findall Find all occurrences of a pattern in a string. | |
finditer Return an iterator yielding a Match object for each match. | |
compile Compile a pattern into a Pattern object. | |
purge Clear the regular expression cache. | |
escape Backslash all non-alphanumerics in a string. | |
Each function other than purge and escape can take an optional 'flags' argument | |
consisting of one or more of the following module constants, joined by "|". | |
A, L, and U are mutually exclusive. | |
A ASCII For string patterns, make \w, \W, \b, \B, \d, \D | |
match the corresponding ASCII character categories | |
(rather than the whole Unicode categories, which is the | |
default). | |
For bytes patterns, this flag is the only available | |
behaviour and needn't be specified. | |
I IGNORECASE Perform case-insensitive matching. | |
L LOCALE Make \w, \W, \b, \B, dependent on the current locale. | |
M MULTILINE "^" matches the beginning of lines (after a newline) | |
as well as the string. | |
"$" matches the end of lines (before a newline) as well | |
as the end of the string. | |
S DOTALL "." matches any character at all, including the newline. | |
X VERBOSE Ignore whitespace and comments for nicer looking RE's. | |
U UNICODE For compatibility only. Ignored for string patterns (it | |
is the default), and forbidden for bytes patterns. | |
This module also defines an exception 'error'. | |
""" | |
import enum | |
from . import _compiler, _parser | |
import functools | |
# public symbols | |
__all__ = [ | |
"match", "fullmatch", "search", "sub", "subn", "split", | |
"findall", "finditer", "compile", "purge", "template", "escape", | |
"error", "Pattern", "Match", "A", "I", "L", "M", "S", "X", "U", | |
"ASCII", "IGNORECASE", "LOCALE", "MULTILINE", "DOTALL", "VERBOSE", | |
"UNICODE", "NOFLAG", "RegexFlag", | |
] | |
__version__ = "2.2.1" | |
class RegexFlag: | |
NOFLAG = 0 | |
ASCII = A = _compiler.SRE_FLAG_ASCII # assume ascii "locale" | |
IGNORECASE = I = _compiler.SRE_FLAG_IGNORECASE # ignore case | |
LOCALE = L = _compiler.SRE_FLAG_LOCALE # assume current 8-bit locale | |
UNICODE = U = _compiler.SRE_FLAG_UNICODE # assume unicode "locale" | |
MULTILINE = M = _compiler.SRE_FLAG_MULTILINE # make anchors look for newline | |
DOTALL = S = _compiler.SRE_FLAG_DOTALL # make dot match newline | |
VERBOSE = X = _compiler.SRE_FLAG_VERBOSE # ignore whitespace and comments | |
# sre extensions (experimental, don't rely on these) | |
TEMPLATE = T = _compiler.SRE_FLAG_TEMPLATE # unknown purpose, deprecated | |
DEBUG = _compiler.SRE_FLAG_DEBUG # dump pattern after compilation | |
__str__ = object.__str__ | |
_numeric_repr_ = hex | |
# sre exception | |
error = _compiler.error | |
# -------------------------------------------------------------------- | |
# public interface | |
def match(pattern, string, flags=0): | |
"""Try to apply the pattern at the start of the string, returning | |
a Match object, or None if no match was found.""" | |
return _compile(pattern, flags).match(string) | |
def fullmatch(pattern, string, flags=0): | |
"""Try to apply the pattern to all of the string, returning | |
a Match object, or None if no match was found.""" | |
return _compile(pattern, flags).fullmatch(string) | |
def search(pattern, string, flags=0): | |
"""Scan through string looking for a match to the pattern, returning | |
a Match object, or None if no match was found.""" | |
return _compile(pattern, flags).search(string) | |
def sub(pattern, repl, string, count=0, flags=0): | |
"""Return the string obtained by replacing the leftmost | |
non-overlapping occurrences of the pattern in string by the | |
replacement repl. repl can be either a string or a callable; | |
if a string, backslash escapes in it are processed. If it is | |
a callable, it's passed the Match object and must return | |
a replacement string to be used.""" | |
return _compile(pattern, flags).sub(repl, string, count) | |
def subn(pattern, repl, string, count=0, flags=0): | |
"""Return a 2-tuple containing (new_string, number). | |
new_string is the string obtained by replacing the leftmost | |
non-overlapping occurrences of the pattern in the source | |
string by the replacement repl. number is the number of | |
substitutions that were made. repl can be either a string or a | |
callable; if a string, backslash escapes in it are processed. | |
If it is a callable, it's passed the Match object and must | |
return a replacement string to be used.""" | |
return _compile(pattern, flags).subn(repl, string, count) | |
def split(pattern, string, maxsplit=0, flags=0): | |
"""Split the source string by the occurrences of the pattern, | |
returning a list containing the resulting substrings. If | |
capturing parentheses are used in pattern, then the text of all | |
groups in the pattern are also returned as part of the resulting | |
list. If maxsplit is nonzero, at most maxsplit splits occur, | |
and the remainder of the string is returned as the final element | |
of the list.""" | |
return _compile(pattern, flags).split(string, maxsplit) | |
def findall(pattern, string, flags=0): | |
"""Return a list of all non-overlapping matches in the string. | |
If one or more capturing groups are present in the pattern, return | |
a list of groups; this will be a list of tuples if the pattern | |
has more than one group. | |
Empty matches are included in the result.""" | |
return _compile(pattern, flags).findall(string) | |
def finditer(pattern, string, flags=0): | |
"""Return an iterator over all non-overlapping matches in the | |
string. For each match, the iterator returns a Match object. | |
Empty matches are included in the result.""" | |
return _compile(pattern, flags).finditer(string) | |
def compile(pattern, flags=0): | |
"Compile a regular expression pattern, returning a Pattern object." | |
return _compile(pattern, flags) | |
def purge(): | |
"Clear the regular expression caches" | |
_cache.clear() | |
_compile_repl.cache_clear() | |
def template(pattern, flags=0): | |
"Compile a template pattern, returning a Pattern object, deprecated" | |
import warnings | |
warnings.warn("The re.template() function is deprecated " | |
"as it is an undocumented function " | |
"without an obvious purpose. " | |
"Use re.compile() instead.", | |
DeprecationWarning) | |
with warnings.catch_warnings(): | |
warnings.simplefilter("ignore", DeprecationWarning) # warn just once | |
return _compile(pattern, flags|T) | |
# SPECIAL_CHARS | |
# closing ')', '}' and ']' | |
# '-' (a range in character set) | |
# '&', '~', (extended character set operations) | |
# '#' (comment) and WHITESPACE (ignored) in verbose mode | |
_special_chars_map = {i: '\\' + chr(i) for i in b'()[]{}?*+-|^$\\.&~# \t\n\r\v\f'} | |
def escape(pattern): | |
""" | |
Escape special characters in a string. | |
""" | |
if isinstance(pattern, str): | |
return pattern.translate(_special_chars_map) | |
else: | |
pattern = str(pattern, 'latin1') | |
return pattern.translate(_special_chars_map).encode('latin1') | |
Pattern = type(_compiler.compile('', 0)) | |
Match = type(_compiler.compile('', 0).match('')) | |
# -------------------------------------------------------------------- | |
# internals | |
_cache = {} # ordered! | |
_MAXCACHE = 512 | |
def _compile(pattern, flags): | |
# internal: compile pattern | |
if isinstance(flags, RegexFlag): | |
flags = flags.value | |
try: | |
return _cache[type(pattern), pattern, flags] | |
except KeyError: | |
pass | |
if isinstance(pattern, Pattern): | |
if flags: | |
raise ValueError( | |
"cannot process flags argument with a compiled pattern") | |
return pattern | |
if not _compiler.isstring(pattern): | |
raise TypeError("first argument must be string or compiled pattern") | |
if flags & T: | |
import warnings | |
warnings.warn("The re.TEMPLATE/re.T flag is deprecated " | |
"as it is an undocumented flag " | |
"without an obvious purpose. " | |
"Don't use it.", | |
DeprecationWarning) | |
p = _compiler.compile(pattern, flags) | |
if not (flags & DEBUG): | |
if len(_cache) >= _MAXCACHE: | |
# Drop the oldest item | |
try: | |
del _cache[next(iter(_cache))] | |
except (StopIteration, RuntimeError, KeyError): | |
pass | |
_cache[type(pattern), pattern, flags] = p | |
return p | |
def _compile_repl(repl, pattern): | |
# internal: compile replacement pattern | |
return _parser.parse_template(repl, pattern) | |
def _expand(pattern, match, template): | |
# internal: Match.expand implementation hook | |
template = _parser.parse_template(template, pattern) | |
return _parser.expand_template(template, match) | |
def _subx(pattern, template): | |
# internal: Pattern.sub/subn implementation helper | |
template = _compile_repl(template, pattern) | |
if not template[0] and len(template[1]) == 1: | |
# literal replacement | |
return template[1][0] | |
def filter(match, template=template): | |
return _parser.expand_template(template, match) | |
return filter | |
# register myself for pickling | |
import copyreg | |
def _pickle(p): | |
return _compile, (p.pattern, p.flags) | |
copyreg.pickle(Pattern, _pickle, _compile) | |
# -------------------------------------------------------------------- | |
# experimental stuff (see python-dev discussions for details) | |
class Scanner: | |
def __init__(self, lexicon, flags=0): | |
from ._constants import BRANCH, SUBPATTERN | |
if isinstance(flags, RegexFlag): | |
flags = flags.value | |
self.lexicon = lexicon | |
# combine phrases into a compound pattern | |
p = [] | |
s = _parser.State() | |
s.flags = flags | |
for phrase, action in lexicon: | |
gid = s.opengroup() | |
p.append(_parser.SubPattern(s, [ | |
(SUBPATTERN, (gid, 0, 0, _parser.parse(phrase, flags))), | |
])) | |
s.closegroup(gid, p[-1]) | |
p = _parser.SubPattern(s, [(BRANCH, (None, p))]) | |
self.scanner = _compiler.compile(p) | |
def scan(self, string): | |
result = [] | |
append = result.append | |
match = self.scanner.scanner(string).match | |
i = 0 | |
while True: | |
m = match() | |
if not m: | |
break | |
j = m.end() | |
if i == j: | |
break | |
action = self.lexicon[m.lastindex-1][1] | |
if callable(action): | |
self.match = m | |
action = action(self, m.group()) | |
if action is not None: | |
append(action) | |
i = j | |
return result, string[i:] | |