Advanced usage of the RE module search
re.search(pattern, string[, flags])
? If the string contains the pattern substring, the match object is returned, otherwise none is returned, and note that if there are multiple pattern substrings in the string, only the first one is returned.
Re.search (): method is used to precisely match and extract the first conforming object, and the extraction of object content is implemented by using the property Group () of the search method;
Group (0): It is the entire matching content, returns an entity object; Group (1) matches the contents of the first parenthesis, removing the matching entity object in parentheses;
Group (2): matches the contents of the second parenthesis, removes the matching entity object in parentheses, and group (0,1,2) returns the tuple of an entity object;
? Requirements: Matches the number of articles read
#coding=utf-8import re>>>= re.search(r"\d+""阅读次数为 9999")>>> ret.group()'9999'
Search two cases
There are groups:
>>>Origin= "Hello Alex Alix BCD DSFA LEFG ABC 199">>>Ret=Re.search ("A (\w+)", origin)>>>Ret<_sre. Sre_matchObject;Span=(6,Ten), match=' Alex '>>>>Ret.group ()' Alex '>>>Ret.groups () (' Lex ',)>>>Ret=Re.search ("(? P<key1>a) (? P<key2> (\w+)) ", origin)>>>Ret.groupdict () {' Key1 ':' A ',' Key2 ':' Lex '}
No grouping:
>>>="hello alex alix bcd dsfa lefg abc 199">>>= re.search("ali\w+",origin)>>> ret.group()'alix'>>> ret.groups()()>>> ret.groupdict(){}
FindAll
re.findall(pattern, string[, flags])
Returns all strings in string that match pattern, returned as an array.
- Group (): A substring in a string that matches pattern patterns;
- Group (0): The result is the same as group ();
- Groups (): A tuple of all groups, group (1) is a successful substring that matches the first group in Patttern, Group (2) is the second, and so on, if index is over the boundary, throws a indexerror;
- FindAll (): Returns the array of all groups, which is the array of tuples composed of groups, which is a tuple in the parent string, which makes up a tuple, which together form a list, which is the return result of the FindAll (). Also, if groups is a tuple with only one element, the return result of FindAll is the list of substrings, not the tuple's list.
Requirements: Statistics of Python, C, C + + corresponding articles read the number of times
#coding=utf-8>>>= re.findall(r"\d+""python = 9999, c = 7890, c++ = 12345")>>> ret['9999''7890''12345']>>> re.findall("",'abc')['''''''']# 如果正则能有空的匹配的话,那么匹配字符串最后还会匹配到一个空
1. When there are multiple parentheses in the given regular expression, the elements of the list are the same as the parentheses in the tuple,tuple of multiple strings, and the contents of the string correspond to the regular expressions within each parenthesis, and the order of emission is in parentheses.
2. When a given regular expression is enclosed in parentheses, the element of the list is a string, and the contents of the string correspond to the regular expression in parentheses (not the match for the entire regular expression).
3. When the given regular expression is not enclosed in parentheses, the element of the list is a string that matches the entire regular expression.
Sub replaces the matched data
re.sub(pattern, repl, string, count=0, flags=0)
Pattern, which represents a modal string in a regular
- REPL, which is replacement, is replaced by a string, Repl can be a string, or it can be a function.
When Repl is a string, any backslash escape characters will be processed. \ n: will be processed as the corresponding newline character;
\ r: will be processed as carriage return, backslash plus g, and a name in brackets, i.e.: \g, corresponding to the named group, named
When Repl is a function, the function is called by the matching object, and its return value is inserted into the text, such as Method 2
A string that represents the string string that is to be processed, to be replaced.
Count, the maximum number of times a pattern match is replaced, and count must be a non-negative integer. The default value is 0 to replace all matches.
Requirements: Match to the number of reads plus 1
Method 1:
#coding=utf-8>>>= re.sub(r"\d+"'998'"python = 997")>>> ret'python = 998'
Method 2:
#coding=utf-8>>>def add(temp): = temp.group() =int+1 returnstr>>>= re.sub(r"\d+""python = 997")>>> ret'python = 998'>>>= re.sub(r"\d+""python = 99")>>> ret'python = 100'
Split cuts the string based on the match and returns a list
str.split(str="", num=string.count(str))
Slices a string by specifying a delimiter, separating only the NUM substring if the parameter num has a specified value
- STR--the delimiter, which defaults to all null characters, including spaces, line breaks (\ n), tabs (\ t), and so on.
- Num--the number of splits.
Requirement: Cut string "Info:xiaozhang Shandong"
#coding=utf-8>>>= re.split(r":| ","info:xiaoZhang 33 shandong")>>> ret['info''xiaoZhang''33''shandong']
Python greedy and non-greedy
The number of words in Python is greedy by default (which may be the default non-greedy in a few languages), always trying to match as many characters as possible;
Non-greedy is the opposite, always trying to match as few characters as possible.
Add after "*", "?", "+", "{m,n}"? , so that greed becomes non-greedy.
>>> s="This is a number 234-235-22-423">>> r=re.match(".+(\d+-\d+-\d+-\d+)",s)>>> r.group(1)'4-235-22-423'>>> r=re.match(".+?(\d+-\d+-\d+-\d+)",s)>>> r.group(1)'234-235-22-423'
The regular expression pattern is used in a wildcard word, when it evaluates from left to right, it tries to "crawl" to match the longest string, in the example above, ". +" fetches the longest character of the pattern from the beginning of the string, including most of the first integer field we want, "\d+ "Only one character can match, so it matches the number" 4 ", and". + "matches all characters from the beginning of the string to the first digit 4.
Workaround: Non-greedy operator "? ", this operator can be used in" * "," + ","? " , the less a regular match is required, the better.
>>> re.match(r"aa(\d+)","aa2343ddd").group(1)'2343'>>> re.match(r"aa(\d+?)","aa2343ddd").group(1)'2'>>> re.match(r"aa(\d+)ddd","aa2343ddd").group(1'2343'>>> re.match(r"aa(\d+?)ddd","aa2343ddd").group(1)'2343'
Python Regular Expressions Improve