Metacharacters: The meta-character is the soul of the regular expression. There are too many elements in the meta-characters.
It is often used.
1. Character groups
The character set is simple, enclosed in []. The content that appears in [] is matched. Example: [ABC] matches a or B or C
If you have too much content in a character group, you can use-for example: [A-z] matches all words from A to Z? female [0-9]
Match all Arabic numerals
Thinking: What does [a-za-z0-9] match?
2. Simple meta-characters
The basic meta-character. This East West net? A search, a large pile. But what do you use so often? Several:
. Match any character except a line break
\w match word? Female or numeric or underline
\s matches any empty white character
\d Matching numbers
\ n Match? A line break
\ t matches? a tab
\b Match? The end of a word
^ matches the start of string strings
$ matches the end of string strings
\w match? Female or numeric or underline
\d match? Non-digital
\s match? Non-empty white character
A|b match character A or character B
() matches the expression in parentheses, also represents a group
[...] Match characters in a character group
[^...] Matches all characters except the characters in the character group
3. Measure measure Words
We're here? All the content that is currently matched is a single character symbol. How about that? match a lot of characters at once.
We're going to use the amount of quantifiers.
* Repeat 0 or more times
+ repeat? One or more times
? Repeat 0 times or?
{n} repeats n times
{n,} repeats n or more times
{N,m} repeats n to M times
4. Lazy matching and greedy matching
The *, +,{} in the measure quantifier is a greedy match. is to match as many results as possible.
STR: The twist cane was closed to the League of Legends yesterday.
Reg: Twist rattan. *
This is the whole sentence.
In the use of the. * After the face if it is added? is to match as little as possible. Table showing lazy matching
STR: The twist cane was closed to the League of Legends yesterday.
Reg: Twist rattan. *?
This is the twist cane.
STR: <div> hu Spicy Soup </div>
REG: <.*>
Results: <div> hu Spicy Soup </div>
STR: <div> hu Spicy Soup </div>
REG: <.*?>
Results:
<div>
</div>
STR: <div> hu Spicy Soup </div>
Reg: < (div|/div*)?>
Results:
<div>
</div>
. *?x's special meaning to find the next? An X is a stop.
Str:abcdefgxhijklmn
Reg:. *?x
Results: ABCDEFGX
5. Grouping
To group in a regular with () into rows. Like what. We're going to match? A relatively complex ID number. The ID number of the
into two kinds. The old one. The ID number is 15 digits. The new ID number has 18 digits. and the new ID number may be x at the end.
The following regular is given:
^[1-9]\d{13,16}[0-9x]$
^[1-9]\D{14} (\d{2}[0-9x])? $
^ ([1-9]\d{16}[0-9x]| [1-9]\d{14}) $
6. Escaping
In regular expressions, there are a lot of special meanings are metacharacters, such as \ n and \s, if you want to in the regular
With the normal "\ n"? Instead of the "line break", you need to escape the "\" line line and turn it into ' \ \ '. In Python, whether it's a regular expression or
is the content to be matched, both in the form of string strings, in string strings \ also have a special meaning, itself also need to go
So if the match is "\ n" and the string is written as ' \\n ', then the "\\\\n" will be written in the regular, so it will be too much trouble.
At this point we will use the concept of R ' \ n ', and the regular is R ' \\n '.
Practice:
1. Matching mailboxes
2. Match phone number
3. Match the birth date. Date format (YYYY-MM-DD)
4. Match phone number
5. Matching IP
? two. Re module
is the re module provided by Python? A set of modules dealing with the regular expression of the rationale. There are four core functions:
1. Ndall Find all. Back to List
LST = Re.findall ("M", "Mai le fo len, mai ni mei!")
Print (LST) # [' m ', ' m ', ' m ']
LST = Re.findall (r "\d+", "before 5 o'clock". You're going to give me 50 million ")
Print (LST) # [' 5 ', ' 5000 ']
2. Search will go into row line matching. But what if it matches the first? One result. The result will be returned. If the match does not
The search returned is None
ret = Re.search (R ' \d ', ' 5 o'clock ago. You're going to give me 50 million '). Group ()
Print (ret) # 5
3. Match can only be entered from the beginning of string strings. Row line Matching
ret = Re.match (' A ', ' ABC '). Group ()
Print (ret) # A
4. Nditer and Ndall are similar. Only then the iterator is returned.
it = Re.finditer ("M", "Mai le fo len, mai ni mei!")
For El in it:
Print (El.group ()) # still needs to be grouped
5. Other operations
ret = Re.split (' [ab] ', ' QWERAFJBCD ') # First Press ' a ' division to get ' qwer ' and ' FJBCD ', in
Split ' B ' separately for ' qwer ' and ' FJBCD '
Print (ret) # [' Qwer ', ' FJ ', ' CD ']
ret = Re.sub (r "\d+", "_sb_", "alex250taibai250wusir250ritian38") # put strings in string
Number to __sb__.
Print (ret) # alex_sb_taibai_sb_wusir_sb_ritian_sb_
ret = RE.SUBN (r "\d+", "_sb_", "alex250taibai250wusir250ritian38") # Replace the numbers
Change to ' __sb__ ', return tuple (replace result, replace number of times)
Print (ret) # (' Alex_sb_taibai_sb_wusir_sb_ritian_sb_ ', 4)
obj = Re.compile (R ' \d{3} ') # Compiles a regular expression into a regular expression object, the rule to match the
It's 3 numbers.
ret = Obj.search (' abc123eeee ') # Regular Expression object tune with search, the argument is a string to be matched
Print (Ret.group ()) # results: 123
The key to crawling insects:
obj = Re.compile (R ' (? p<id>\d+) (? p<name>e+) ' # matches content from regular expressions each group is named
Word
ret = Obj.search (' abc123eeee ') # Search
Print (Ret.group ()) # Result: 123eeee
Print (Ret.group ("id")) # Result: 123 # Get the contents of the ID Group
Print (Ret.group ("name")) # Result: Eeee # Get the contents of the name group
6. Two Pits
Note: The results in the RE module and in our online testing tool may not be the same.
ret = Re.findall (' www. ( baidu|oldboy). com ', ' www.oldboy.com ')
Print (ret) # [' Oldboy '] This is because FindAll will prioritize the matching results group, if you want to
With the result, cancel the permission can
ret = Re.findall (' www. (?: baidu|oldboy). com ', ' www.oldboy.com ')
Print (ret) # [' Www.oldboy.com ']
There's a hole in split?
Ret=re.split ("\d+", "Eva3egon4yuan")
Print (ret) #结果: [' Eva ', ' Egon ', ' Yuan ']
Ret=re.split ("(\d+)", "Eva3egon4yuan")
Print (ret) #结果: [' Eva ', ' 3 ', ' Egon ', ' 4 ', ' Yuan ']
After the #在匹配部分加上 () The result is not different,
The #没有 () does not retain the matching entries, but has () the ability to retain the matching entries.
#这个在某些需要保留留匹配部分的使? It is very important to use the process.
This priority problem sometimes helps us do a lot of things. Let's see. A more complex example?
Import re
From urllib.request import Urlopen
Import SSL
# Kill the Digital signature certificate
Ssl._create_default_https_context = Ssl._create_unverified_context
def getpage (URL):
Response = Urlopen (URL)
Return Response.read (). Decode (' Utf-8 ')
def parsepage (s):
ret = Re.findall (
' <div class= ' item ' >.*?<div class= ' pic ' >.*?<em .*?> (? p<id>\d+). *?
<span class= "title" > (? P<title>.*?) </span> '
'. *?<span class= "Rating_num" .*?> (? P<rating_num>.*?) </span>.*?<span>
(? P<comment_num>.*?) Reviews </span> ', S, re. S
return ret
def main (num):
url = ' https://movie.douban.com/top250?start=%s&filter= '% num
response_html = getpage (URL)
ret = Parsepage (response_html)
Print (ret)
Count = 0
For I in range (10): # 10? page
Main (count)
Count + = 25
At this point, Lely is used after grouping. After a successful match, the results are obtained after grouping. (? p<id>\d+) At this time the current
The data that is matched by the group is grouped in the ID group. At this point the program can be changed to write:
Import SSL
Import re
From urllib.request import Urlopen
# Kill the Digital signature certificate
Ssl._create_default_https_context = Ssl._create_unverified_context
def getpage (URL):
response = urlopen (URL)
return Response.read (). Decode (' Utf-8 ')
def parsepage (s):
com = re.compile (
' <div class= "item" >.*?<div class= "pic" >.*?<em .*?> (? p<id>\d+). *?
<span class= "title" > (? P<title>.*?) </span> '
'. *?<span class= "Rating_num" .*?> (? P<rating_num>.*?) </span>.*?
<span> (? P<comment_num>.*?) Evaluation </span> ', re. S)
ret = Com.finditer (s)
for i in RET:
Yield {
"id": I.group ("id"),
"title": I.group ("title"),
"Ra Ting_num ": I.group (" Rating_num "),
" Comment_num ": I.group (" Comment_num "),
}
def main (num):
url = ' https://movie.douban.com/top250?start=%s&filter= '% num
response_html = getpage (URL)
ret = Parsepage (response_html)
# Print (ret)
f = open ("Move_info7", "a", encoding= "UTF8")
For obj in RET:
Print (obj)
data = str (obj)
F.write (data + "\ n")
Count = 0
For I in range (10):
Main (count)
Count + = 25
The regular expression and the re module say so much. If you want to make all the contents of the regular, clear, white, or at least a week.
Above the time. For us, the daily use of, and the words, the above knowledge point is enough? If you run into it? Some extreme situations suggest that you
Method Division processing rationale. Splits the string into the line line first. Then consider the use of regular.
Regular expression 2: Feeling