What is a regular expression
A regular expression is a logical formula for the manipulation of a string, which is defined in advance by certain characters, and the combination of these specific characters, forming a "regular character", the "regular character" to express a filtering logic for the character.
The regular is not unique to Python, and other languages have regular
Regular in Python, encapsulating the RE module
Python regular detailed explanation of common matching patterns
\w match alphanumeric and underscore \w match f non-alphanumeric underscore \s match any whitespace character, equivalent to [\t\n\r\f]\s match any non-null character \d match any number \d match any non-numeric \a Match string starts \z match string ends, if there is a newline, matches only the end string before the line break \z match string ends \g matches the last match completed location \ n matches a newline character \ t matches the end of a tab ^ match string at the beginning of the $ match string. Matches any character except the line break, re. When the Dotall tag is specified, it is possible to match any character that includes a line break [....] Used to represent a set of characters, listed separately: [AMK] matches a,m or k[^ ...] Characters not in []: [^ABC] matches characters other than A,b,c * match 0 or more expressions + match 1 or more expressions? Matches 0 or 1 fragments defined by the preceding regular expression, non-greedy mode {n} exactly matches n preceding representation {m,m} matches n to m times by preceding regular expression definition fragment, greedy pattern a| b matches A or B () matches the expression in parentheses, and also represents a group
Re.match ()
Try to match a pattern from the beginning of the string, and match () will return none if it is not a start position match
Syntax format:
Re.match (pattern,string,flags=0)
The most common match
Import recontent " Hello 123 4567 world_this is a regex Demo " = Re.match ('^hello\s\d\d\d\s\d{4}\s\w{10}.*demo$', content) print (result)print(result.group ())print(Result.span ())
The results are as follows:
Result.group () Gets the results of the match
Result.span () The length range of the matched string to be obtained
Pan-Match
In fact, the above approach is not very convenient, in fact, the above regular rules can be changed
Import recontent " Hello 123 4567 world_this is a regex Demo " = Re.match ("^hello.*demo$", content)print(result) Print(result.group ())print(Result.span ())
The result of this code is the same as the regular match above, but it is much easier to write.
Match target
If you want to match the specific target in the string, you need to enclose it (), as in the following example:
Import recontent " Hello 1234567 world_this is a regex Demo " = Re.match ('^hello\s (\d+) \sworld.*demo$', content)print (Result) Print (Result.group ()) Print (Result.group (1)) Print (Result.span ())
The results are as follows:
What needs to be said here is that when the result is obtained through Re.group (), if there are parentheses in the regular expression, then Re.group (1) Gets the result that matches the first one in parentheses.
Greedy match
Let's look at the following code:
Import recontent " Hello 1234567 world_this is a regex Demo " result= Re.match ('^hello.* (\d+). *demo', content)print (Result) Print (Result.group (1))
The result of this piece of code is
From the results can be seen only matched to 7, and did not match to 1234567, the cause of this situation is the previous. * Match out,. * Here will match as much content as we call the greedy match,
If we want to match to 1234567, we need to change the regular expression to:
result= re.match (' ^he.*? ( \d+). *Demo ', content)
So the results can be matched to 1234567.
Matching mode
A lot of times the matching content is the problem of line-wrapping, this time you need to use the matching mode of re. s to match the contents of the line break
Import"" "Hello 123456 world_thismy name is Zhaofan""" =re.match ( ' ^he.*? (\d+). *?zhaofan$', Content,re. (S)print (result)print (result.group ())print( Result.group (1))
The results are as follows
Escape
When we want to match the content in the presence of special characters, we need to use the transfer symbol \, the example is as follows:
Import recontent " Price is $5.00 " = Re.match ('price is\$5\.00', content)print(result) Print(Result.group ())
A summary of the above:
Use generic matches as much as possible, use parentheses to get matching targets, use non-greedy patterns as much as possible, and re with newline characters. S
Emphasize that Re.match is a pattern that matches the starting position of a string
Re.search
Re.search scans the entire string to return the result of the first successful match
Import"extra things Hello 123455 world_this is a Re extra things"= Re.search ("hello.*?" ( \d+). *? Re", content)print (result)print(Result.group ()) Print(Result.group (1))
The results are as follows:
In fact this time we don't need to write ^ and $, because search is scanning the entire string
Note: So in order to match conveniently, we will use search more, do not need match,match must match the head, so many times is not particularly convenient
Matching walkthroughs
Example 1:
Importrehtml=" "<div id= "songs-list" > " "result= Re.search ('<li.*?active.*?singer= "(. *?)" > (. *?) </a>', Html,re. S)Print(Result)Print(Result.groups ())Print(Result.group (1))Print(Result.group (2))
The result is:
Re.findall
Search string to return all matching substrings as a list
The code example is as follows:
Importrehtml=" "<div id= "songs-list" > " "Results= Re.findall ('<li.*?href= "(. *?)". *?singer= "(. *?)" > (. *?) </a>', HTML, re. S)Print(Results)Print(Type (results)) forResultinchResults:Print(Result)Print(Result[0], result[1], result[2])
The results are as follows:
Example 2:
Importrehtml=" "<div id= "songs-list" > " "Results= Re.findall ('<li.*?>\s*? (<a.*?>)? (\w+) (</a>)?\s*?</li>', Html,re. S)Print(Results) forResultinchResults:Print(Result[1])
The results are as follows:
In fact, here we can see
\s*? This usage is in order to solve the problem that some have a newline, some do not have a line change.
(<a.*?>)? This usage is because some of the HTML has a tag, some do not,? Represents one or 0 matches that can be used to match
Re.sub
Returns the replaced string after each matched substring in the replacement string
Re.sub (regular expression, replaced by string, original string)
Example 1
Import"Extra things Hello 123455 world_this is a regex Demo Extra things"
= re.sub ('\d+',', content)print(content)
The result is that the number is replaced with empty:
Example 2, in some cases when we replace the character, we also want to get our matching string, and then add some content, can be implemented in the following way:
Import"Extra things Hello 123455 world_this is a regex Demo Extra things"
= re.sub ('(\d+)', R'\1 7890', content) Print (content)
The results are as follows:
One of the issues to note here is that \1 is getting the first match result, and in order to prevent the escape character, we need to precede the R
Re.compile
Compiles regular expressions into regular expression objects for easy reuse of regular expressions
Import recontent """ Hello 12345 world_this123 fan """ =re.compile ("hello.*fan"= re.match (pattern,content) Print (Result) Print (Result.group ())
A regular comprehensive exercise
Get book information for Douban book pages, via regular implementations
ImportRequestsImportrecontent= Requests.get ('https://book.douban.com/'). Textpattern= Re.compile ('<li.*?cover.*?href= "(. *?)". *?title= "(. *?)". *?more-meta.*?author "> (. *?) </span>.*?year "> (. *?) </span>.*?</li>', Re. S) Results=Re.findall (pattern, content)Print(Results) forResultinchresults:url,name,author,date=Result author= Re.sub ('\s',"', author) date= Re.sub ('\s',"', date)Print(url,name,author,date)
The results are as follows:
Basic use of Python crawler from beginner to discard (v)