1 Programmers---Basic---- need to use 2 data extraction = 3 crawler base
The regular expression itself is independent of Python and is a common----for all languages a rule that matches the contents of a string
Character Set [0123456789] [0-9]---only from large to small [a-za-z]---Middle Other characters a character set can only match one character . Match All======= except for line breaks'\ n' \w (word) alphanumeric underline \w\s (space) \s\d (digit) \d^ Beginning in character set [^] means mismatch ======outside the expression start $ End () Group a | b or [ ^...]
2 quantifier
The number of occurrences of the character preceding the quantifier is matched by the number of pattern quantifiers. Denotes non-greedy
* greedy match multiple times *? --- one match at a time or multiple + greedy +? --- match 0 or more times ? greed ?? ---0 times {n} repeats n times {n,} repeats n or more times greedy {n,m} repeats n-m greed { ? ---once
Example 1
Lee [Jackie Ying two stick]{1,3} Lee [^ and]*My_str='Li Jie and Buddy and Lee two sticks'ImportRe#pattern=re.compile (R ' Li [Jackie Ying two Stick]{1,3} ')Pattern=re.compile (R'Lee [^ and]{1,3}') Re=Pattern.findall (MY_STR)Print(RE) example 2 ID number ImportRe#Str=input ('ID Number:') Pattern=re.compile (R'[0-9]\d{16}[0-9x]| [1-9]\d{14}') Re=Pattern.findall (str)Print(re) [1-9]\D{14} (\d{2}[0-9x])?
3 Escape character \
escape character \ Python encounters \ Need to add another \ match \d ---->>> need to escape \\d------python needs \\\\d
R' \\d\n ' preceded by a native string
The essence of regular greedy matching is the ====== backtracking algorithm ===== . *?x followed by any character----take any preceding character until match X to stop
Re-module Python regular expression module
Import re
Re-module method
Import Repattern=re.compile (R'<.*?>') String='Script>xxxxx<script'result=re. findall(pattern,string) # return all found objects no = = [] Print(Result) result1=re. Search(pattern,string) # returns the first object = = = requires a group () to get a containing pattern Print(RESULT1)ifRESULT1:Print(Result1.group ())#Prevent it from being found when none
RESULT3=re. match(pattern,string) # starts with ^ Pattren not found none Print(RESULT3)
Re. Split ('AB', string) First press'a'Split in press'b'Split
Re. Sub ('\d','H', string) to'\d'Replace with ' H '
============= Re.Finditer ()=================>get an iterator and then use Group () to take a value=================re.compile ()==============Pttern=re.compile (regular expression) = = =compile as regular expression object directly using without recompiling saves time Pattern.findallpattern.searchpattern.match
=================== Group Priority match ======= priority display ====================#Import re##ret=re.findall (' www. ( baidu|oldboy). com ', ' www.oldboy.com ')#print (ret) # [' Oldboy ']##ret=re.findall (' www. (?: baidu|oldboy). com ', ' www.oldboy.com ')#print (ret) # [' Www.oldboy.com ']#===================Split priority = = Plus () Group priority increase ========================#Import re##ret=re.split (' \d+ ', ' 123gg6gg4ds45fff ')#print (ret) # [', ' GG ', ' gg ', ' ds ', ' FFF ']##ret= re.split (' (\d+) ', ' 123gg6gg4ds45fff ')##print (ret) # ["', ' 123 ', ' GG ', ' 6 ', ' GG ', ' 4 ', ' DS ', '" , ' FFF ')##=================== label groups to a group named (? P <name>) ========================constraints on multiple quantifiers for use in regular expressionsGroups only the content that is needed for a matching regular
#Import re# ## ret=re.search (' <\w+>\w+</\w+> ', ' # #Ret1=re.search (' <? p<t>\w+) >\w+</(? p=t) > ', ' #print (Ret1.group (), Ret1.group (' t '))# #ret2=re.search R ' <\w+> (? p<content>\w+) </\w+> ', ' #print (Ret2.group (), Ret2.group (' content '))# # #Ret=re.search (R ' < (\w+) >\w+</\1> ', ' #print (Ret.group (), Ret.group (1))#If you do not name the group, you can also use the \ ordinal to find the corresponding group, indicating that the content you are looking for is consistent with the previous group content .#The resulting matching results can be obtained directly from the group (serial number) to the corresponding value
Regular Expression--python re module