1. Regular expressions
It is a logical formula for string manipulation, which is to make a "rule string" that is used to express a filter logic for a string by using a predefined set of specific characters and combinations of these specific characters.
2. Re module
2.1, Re module use steps:
Use compile()
a function to compile the string form of a regular expression into an Pattern
object
A match object is obtained by matching the text by a Pattern
series of methods provided by the object to find the matching result.
- Finally, use the
Match
properties and methods provided by the object to get information and perform other actions as needed
2.2, compile ()
The compile function compiles regular expressions and generates a Pattern object, which is typically used in the following form:
Import re# compiles regular expressions into pattern objects pattern = Re.compile (R ' \d+ ')
Some common methods of Pattern objects include:
- Match method: Search from start position, one match
- Search method: Search from any location, one match
- FindAll method: Match all, return list
- Finditer method: Match all, return iterator
- Split method: Split string, return list
- Sub method: Replace
2.2.1, match method:
The match method is used to find the head of a string (you can also specify a starting position), which is a match, and returns if a matching result is found, rather than finding all matching results.
Its general form of use is as follows:match(string[, pos[, endpos]])
Where the string is the strings to be matched, POS and endpos are optional parameters, specifying the starting and ending positions of the string, and the default values are 0 and len (string length). So
When you do not specify POS and Endpos, the match method defaults to the header of the string.
When the match succeeds, a match object is returned and none is returned if there is no match.
Example 1:
>>> Import re>>> pattern = Re.compile (R ' \d+ ') # is used to match at least one number >>> m = Pattern.match (' One12twothree34four ') # Find header, no match >>> print mnone>>> m = pattern.match (' One12twothree34four ', 2, 10) # from The ' E ' position begins to match, no match >>> print mnone>>> m = pattern.match (' One12twothree34four ', 3, 10) # Starting with ' 1 ' position match, exactly match & gt;>> Print M # Returns a Match object <_sre. Sre_match object at 0x10a42aac0>>>> m.group (0) # can omit 0 ' >>> m.start (0) # can omit 03>>> m.end (0 ) # can be omitted 05>>> m.span (0) # can be omitted 0 (3, 5) Group ([Group1, ...]) method is used to obtain one or more grouping matching strings, which can be used directly when the entire matched substring is to be obtained, either group () or group (0); Start ([ Group]) method to get the starting position of a packet-matched substring in the entire string (the index of the first character of the substring), the default value of the 0;end ([group]) method used to get the ending position of the grouped matched substring in the entire string (the index of the last character of the substring + 1), The default value of the parameter is 0;span ([group]) method return (Start (group), End (group)).
Example 2:
>>> Import re>>> pattern = Re.compile (R ' ([a-z]+) ([a-z]+) ', Re. I) # Re. I means ignore case >>> m = Pattern.match (' Hello World Wide Web ') >>> print M # match succeeds, returns a match object <_sre. Sre_match object at 0x10bea83e8>>>> m.group (0) # Returns the entire substring matching success ' Hello World ' >>> m.span (0) # Returns the index of the entire substring that matches successfully (0, one) >>> m.group (1) # Returns the first packet matching successful substring ' Hello ' >>> m.span (1) # Returns the index of the successful substring of the first grouping match (0, 5) >>> M.group (2) # returns the second packet matching successful substring ' world ' >>> m.span (2) # Returns the second packet matching successful substring (6, one-by-one) >>> m.groups () # equivalent to (M.group (1), M.group (2), ...) (' Hello ', ' World ') >>> M.group (3) # There is no third grouping traceback (most recent call last): File "<stdin>", Line 1, in <module>indexerror:no such group
2.2.2, search method
The search method is used to find any location of a string, it is also a match, and returns if a matching result is found, rather than finding all matching results.
Its general form of use is as follows:search(string[, pos[, endpos]])
Where the string is the strings to be matched, POS and endpos are optional parameters, specifying the starting and ending positions of the string, and the default values are 0 and len (string length).
When the match succeeds, a match object is returned and none is returned if there is no match.
Example 1:>>> import re>>> pattern = re.compile (' \d+ ') >>> m = Pattern.search (' One12twothree34four ') # here if you use the match method does not match >>> M<_sre. Sre_match object at 0x10cc03ac0>>>> m.group () ' >>> m = pattern.search (' One12twothree34four ', 10 ) # Specify the string range >>> m<_sre. Sre_match object at 0x10cc03b28>>>> m.group () ' >>> M.span () (13, 15) example 2:#-*-coding:utf-8-*- Import re# compiles the regular expression into the pattern object pattern = Re.compile (R ' \d+ ') # Use Search () to find a matching substring, no matching substring will be returned none# here using match () cannot successfully match m = Pattern.search (' Hello 123456 789 ') if M: # Use Match to get grouping information print ' matching string: ', M.group () # Start and end position
print ' position: ', m.span () execution result: matching string:123456position: (6, 12)
2.2.3, FindAll method
FindAll can search the entire string for all matching results.
The FindAll method is used in the following form:findall(string[, pos[, endpos]])
Where the string is the strings to be matched, POS and endpos are optional parameters, specifying the starting and ending positions of the string, and the default values are 0 and len (string length).
FindAll returns all matched substrings in a list, and returns an empty list if there is no match.
Example 1:import Repattern = Re.compile (R ' \d+ ') # find Numbers result1 = pattern.findall (' Hello 123456 789 ') result2 = Pattern.findall (' One1two2three3four4 ', 0,) print result1print result2 execution result: [' 123456 ', ' 789 '] [' 1 ', ' 2 '] example 2:# re_ The Test.pyimport Re#re module provides a method called the Compile module, which provides us with a matching rule for input # and then returns a pattern instance that we follow to match the string pattern = Re.compile (R ' \d+\.\d* ') #通过partten. FindAll () method will be able to match all of the string we get to result = Pattern.findall ("123.141593, ' Bigcat ', 232312, 3.15") #findall in list form Returns all matching substrings to resultfor item in result: print item run Result: 123.1415933.15
2.2.4, Finditer method
The behavior of the Finditer method is similar to the behavior of FindAll, and it also searches the entire string for all matching results. But it returns an iterator that sequentially accesses each match result (Match object).
#-*-coding:utf-8-*-import repattern = re.compile (R ' \d+ ') Result_iter1 = Pattern.finditer (' Hello 123456 789 ') result_ite r2 = pattern.finditer (' One1two2three3four4 ', 0, ten) print type (result_iter1) print type (result_iter2) print ' Result1 ... ' For M1 in Result_iter1: # M1 is the Match object print ' matching string: {}, Position: {} '. Format (M1.group (), M1.span ()) PRI NT ' result2 ... ' for m2 in result_iter2: print ' matching string: {}, Position: {} '. Format (M2.group (), M2.span ()) Execution Result: & Lt;type ' callable-iterator ' ><type ' callable-iterator ' >result1...matching string:123456, Position: (6, 12) Matching string:789, Position: (4) result2...matching string:1, Position: (3,) matching string:2, Position: (7, 8)
2.2.5, spilt method
The Split method returns the list after splitting the string by a substring that can be matched.
It is used in the following form:split(string[, maxsplit])
Where Maxsplit is used to specify the maximum number of splits and does not specify that all will be split.
Import rep = Re.compile (R ' [\s\,\;] + ') Print p.split (' A, b;; c d ') execution result: [' A ', ' B ', ' C ', ' d ']
2.2.6, Sub method
The sub method is used for substitution.
It is used in the following form:sub(repl, string[, count])
Where repl can be a string, or it can be a function:
-
If Repl is a string, it uses REPL to replace each matched substring of the string and returns the substituted string, and Repl can also refer to the grouping using the ID, but not the number 0;
If Repl is a function, this method should only accept one argument (the Match object) and return a string for substitution (the returned string cannot be referenced in a group).
- Count is used to specify the maximum number of replacements, not all when specified.
Import rep = Re.compile (R ' (\w+) (\w+) ') # \w = [a-za-z0-9]s = ' Hello 123, hello 456 ' print p.sub (R ' Hello World ', s) # make With ' Hello World ' replace ' Hello 123 ' and ' Hello 456 ' print p.sub (R ' \2 \1 ', s) # Reference Group def func (m): return ' Hi ' + ' + M.grou P (2) Print P.sub (func, s) print P.sub (func, S, 1) # Replace at most one execution result: Hello world, hello world123 Hello, 456 hellohi 123, HI 45 6hi 123, Hello 456
2.2.7, matching Chinese
Suppose now want to put the string title = U ' Hello, hello, world ' in Chinese to extract it, can do this: import retitle = u ' Hello, hello, world ' pattern = re.compile (ur ' [\u4e00-\u9fa5]+ ') result = Pattern.findall (title) print result
The regular expression is preceded by a two prefix ur, where r means the original string, and U is the Unicode string. Execution result: [u ' \u4f60\u597d ', U ' \u4e16\u754c ']
3. Greedy mode and non-greedy mode
- Greedy mode: As many matches as possible (*) on the premise that the entire expression matches successfully;
- Non-greedy mode: as few matches as possible if the entire expression matches successfully (?) ;
- The number of words in Python is greedy by default.
Example 1: source string:
abbbc
The regular expression ab*, which uses the greedy number of words, matches the result: abbb. * decided to match B as much as possible, so all B after a has appeared. A regular expression that uses a non-greedy number of words ab*?, matches the result: a. Even if there is *, but? decided to match B as little as possible, so there's no B.
Example 2: Source string:
aa<div>test1</div>bb<div>test2</div>cc
Regular expressions using greedy quantitative words:<div>.*</div> match Results:<div>test1</div>bb<div>test2</div> The greedy pattern is used here. The entire expression can be successfully matched to the first "</div>", but because of the greedy pattern, you still try to match to the right to see if there is a longer substring that can be successfully matched. After matching to the second "</div>", there are no strings that can be successfully matched to the right, the match ends, and the match result is "<div>test1</div>bb<div>test2</div>"
4. Regular expression Test URL
Python crawler Development "1th" "Regular expression"