Python Regular Expressions
By Han Yang Small ([email protected])
Regular expressions are powerful tools for working with strings , with unique syntax and a separate processing engine.
When we match a string in a large text, there are cases where the function (such as find, in) that comes with STR can be done, some of which are slightly more complicated (for example, to find all "mailbox-like" strings, all the sentences related to julyedu), and we need a tool for some kind of pattern, The regular expression comes in handy at this point.
The regular expression may not be as efficient as Str's own method, but the matching function is much more powerful. Yes, regular expressions are not unique to Python, and if you've already used regular expressions in other languages, the instructions here are simple enough to take a look at.
1. Syntax
Talk less, go straight to the skills
Here is a few students familiar with the figure, we commonly known as Python regular expression small copy, write regular expression as an open-book examination, obviously much easier.
When you want to match one/more/any number/letter/non-numeric/non-alphabetic/certain characters/any character , want greedy/non-greedy match, want to capture match out first/all content, Remember there is a small manual for your reference.
2. Verification Tools
One of our favorite regular expression online validation tools is http://regexr.com/
Who knows who, used once once to be addicted.
3. Challenges and improvements
Long-term natural language processing of the students are very familiar with the expression, there have been half a year to write a large number of regular expressions, so that colleagues joked that, as long as it is in line with a certain law or pattern of the string, certainly minutes can match out.
For students who want to practice regular expressions, or quickly get complex skills in the short term, or want to challenge more complex regular expressions. Please poke regular expression step-by-step exercises
So, you baby enjoy yourself
4.Python Case RE Module
Python provides support for regular expressions through the RE module.
The general steps for using re are
- 1. Compiling the string form of a regular expression into a pattern instance
- 2. Using the pattern instance to process the text and get the matching result (a match instance)
- 3. Use the match instance to get information and do other things.
In [13]:
# encoding:utf-8# Compiles the regular expression into the pattern object re. Compile(R ' hello.*\! ' # matches the text with pattern, gets the match result, and returns the none pattern when it does not match. Match(' Hello, hanxiaoyang! How is it? ' matchmatch. Group()
Hello, hanxiaoyang!.
Re.compile (strpattern[, flag]):
This method is the factory method of the pattern class, which compiles a regular expression in the form of a string into a pattern object.
The second parameter, flag, is the matching pattern, and the value can use the bitwise OR operator ' | ' To take effect at the same time, such as re. I | Re. M.
Of course, you can also specify patterns in the Regex string, such as re.compile (' pattern ', re. I | Re. M) equivalent to re.compile (' (? im) pattern ')
Flag selectable values are:
- Re. I (re. IGNORECASE): Ignore case (full notation in parentheses, same as below)
- Re. M (MULTILINE): Multiline mode, changing the behavior of ' ^ ' and ' $ ' (see)
- Re. S (dotall): Point any matching pattern, change '. ' The behavior
- Re. L (LOCALE): Make a predetermined character class \w \w \b \b \s \s depends on the current locale setting
- Re. U (UNICODE): Make a predetermined character class \w \w \b \b \s \s \d \d Depending on the UNICODE-defined character attribute
- Re. X (VERBOSE): Verbose mode. In this mode, the regular expression can be multiple lines, ignore whitespace characters, and can be added to comments. The following two regular expressions are equivalent:
in []:
Re. Compile(r "" "\d + # number part \. # Decimal Part \d * # decimal number part" ""re. X)re. Compile(r "\d+\.\d*")
Match
The match object is a matching result that contains a lot of information about this match and can be obtained using the readable properties or methods provided by match.
Match property:
- String: The text to use when matching.
- Re: The pattern object to use when matching.
- POS: The index in which the text expression begins the search. The value is the same as the parameter with the same name as the Pattern.match () and Pattern.seach () methods.
- Endpos: The index of the end-of-search text expression. The value is the same as the parameter with the same name as the Pattern.match () and Pattern.seach () methods.
- Lastindex: The index of the last captured grouping in the text. If there are no captured groupings, it will be none.
- Lastgroup: The alias of the last captured group. If the group has no aliases or no captured groupings, it will be none.
Method:
- Group ([Group1, ...]):
Gets the string that is intercepted by one or more groups, and returns a tuple when multiple parameters are specified. Group1 can use numbers or aliases; number 0 represents the entire matched substring; returns Group (0) when no parameters are filled; Groups that have not intercepted a string return none; The group that intercepted multiple times returns the last substring intercepted.
- Groups ([default]):
Returns the string intercepted by all groups as a tuple. Equivalent to calling group (,... last). Default indicates that a group that does not intercept a string is replaced with this value, which defaults to none.
- Groupdict ([default]):
Returns a dictionary with aliases for the alias of the group, the value of the substring intercepted by the group, and no alias for the group. The default meaning is the same.
- Start ([group]):
Returns the starting index of the substring intercepted by the specified group in string (the index of the first character of the substring). The group default value is 0.
- End ([group]):
Returns the end index of the substring intercepted by the specified group in string (the index of the last character of the substring + 1). The group default value is 0.
- span ([group]):
Returns (Start (group), End (group)).
- Expand (Template):
Substituting the matched grouping into the template and then returns. The template can be grouped using \id or \g, \g reference, but cannot use number 0. \id and \g are equivalent, but \10 will be considered a 10th grouping, if you want to express \1 after the character ' 0 ', use only \g<1>0.
In [14]:
ImportReM=Re.Match(R ' (\w+) (\w+) (? p<sign>.*) ',' Hello hanxiaoyang! ')Print"M.string:",M.StringPrint"M.re:",M.RePrint"M.pos:",M.PosPrint"M.endpos:",M.EndposPrint"M.lastindex:",M.LastindexPrint"M.lastgroup:",M.LastgroupPrint"M.group:",M.Group(1,2)Print"M.groups ():",M.Groups()Print"M.groupdict ():",M.Groupdict () print "M.start (2):" m start (2) print "M.end (2):" m. (2) print "M.span (2):" m. Span (2) print r "M.expand (R ' \2 \ 1\3 '): "m. Expand (r ' \2 \1\3 ' )
M.string:hello hanxiaoyang!m.re: <_sre. Sre_pattern object at 0x10b111be0>m.pos:0m.endpos:18m.lastindex:3m.lastgroup:signm.group: (' Hello ', ' Hanxiaoyang ') m.groups (): (' Hello ', ' Hanxiaoyang ', '! ') M.groupdict (): {' sign ': '! '} M.start (2): 6m.end (2): 17m.span (2): (6, +) M.expand (R ' \2 \1\3 '): Hanxiaoyang hello!
Pattern
The pattern object is a compiled regular expression that can be matched to the text by a series of methods provided by pattern.
Pattern cannot be instantiated directly and must be constructed using Re.compile ().
The pattern provides several readable properties for getting information about an expression:
- Pattern: The expression string used at compile time.
- Flags: The matching pattern used at compile time. Digital form.
- Groups: The number of groupings in an expression.
- Groupindex: The alias of the group with the alias in the expression is the key, the dictionary with the number corresponding to that group, and the group without the alias is not included.
In [15]:
ImportReP= re. Compile (r "(\w+) (\w+) (? p<sign>.*) ' re. Dotall) print "P.pattern:" p< Span class= "O". patternprint "p.flags:" pflagsprint "p.groups:" pgroupsprint "P.groupindex:" p< Span class= "O". groupindex
P.pattern: (\w+) (\w+) (? p<sign>.*) P.flags:16p.groups:3p.groupindex: {' sign ': 3}
Using the pattern
- match (string[, pos[, Endpos]) | Re.match (pattern, string[, flags]):
This method attempts to match pattern from the POS subscript of string:
- Returns a Match object if the pattern is still matched at the end
- None is returned if pattern does not match during the match, or if the match does not end and the Endpos is reached.
- The default values for POS and Endpos are 0 and Len (String), respectively.
Note: This method is not an exact match. If the string has any remaining characters at the end of the pattern, it is still considered successful. If you want an exact match, you can add the boundary match ' $ ' at the end of the expression.
- Search (string[, pos[, Endpos]) | Re.search (pattern, string[, flags]):
This method attempts to match the pattern from the POS subscript of the string
- Returns a Match object if the pattern is still matched at the end
- If the match is not matched, the POS is added 1 after the match is tried again, until the pos=endpos is still not matched then none is returned.
- The default values for POS and Endpos are 0 and len (string) respectively
In [18]:
Re re. Compile(R ' h.*g 'pattern. Search(' Hello hanxiaoyang! ' matchmatch. Group
Hanxiaoyang
- Split (string[, Maxsplit]) | Re.split (Pattern, string[, Maxsplit]):
- Returns a list after splitting a string by a substring that can be matched.
- The maxsplit is used to specify the maximum number of splits and does not specify that all will be split.
In [19]:
Re. Compile(R ' \d+ ')p. Split(' one1two2three3four4 ')
[' One ', ' one ', ' three ', ' four ', ']
- FindAll (string[, pos[, Endpos]) | Re.findall (pattern, string[, flags]):
- Searches for a string, returning all matching substrings as a list.
In [21]:
Re. Compile(R ' \d+ ')p. FindAll(' one1two2three3four4 ')
[' 1 ', ' 2 ', ' 3 ', ' 4 ']
- Finditer (string[, pos[, Endpos]) | Re.finditer (pattern, string[, flags]):
- Searches for a string that returns an iterator that accesses each matching result (match object) sequentially.
In [23]:
Re. Compile(R ' \d+ ')p. Finditer(' one1two2three3four4 'm. Group()
1234
- Sub (repl, string[, Count]) | Re.sub (Pattern, REPL, string[, Count]):
- Returns the replaced string after each matched substring in string is replaced with REPL.
- When Repl is a string, you can use \id or \g, \g reference grouping, but you cannot use number 0.
- When Repl is a method, this method should only accept one parameter (the match object) and return a string for substitution (the returned string cannot be referenced in the grouping). Count is used to specify the maximum number of replacements, not all when specified.
In [26]:
ImportReP=Re.Compile(R ' (\w+) (\w+) ')S=' I say, hello hanxiaoyang! 'PrintP.Sub(R ' \2 \1 'sdef func ( span class= "n" >m): return m. Group (1) . Title () + "+ mgroup (2) . Title () print p. Sub (funcs)
Say I, Hanxiaoyang hello! I Say, Hello hanxiaoyang!
- Subn (REPL, string[, Count]) |re.sub (pattern, REPL, string[, Count]):
- Returns (Sub (REPL, string[, Count]), number of replacements).
In [28]:
ImportReP=Re.Compile(R ' (\w+) (\w+) ')S=' I say, hello hanxiaoyang! 'PrintP.Subn(R ' \2 \1 'sdef func ( span class= "n" >m): return m. Group (1) . Title () + "+ mgroup (2) . Title () print p. Subn (funcs)
(' Say I, Hanxiaoyang hello! ', 2) (' I Say, Hello hanxiaoyang! ', 2)
02-nlp-01-python Regular Expressions