Regular Expression Regular Expression
-Introduction of RE module
-rule Definition patternname = r "ABC ..."
1. The Concept-Regular expression (RE) is a small, highly specialized language.
-it's embedded in Python, implemented through the RE module
2. RoleProcess the string.
-Match
-Replace
-Separate
3. Character Matching-Ordinary characters
-Meta character
. ^ $ * + ? {} [] \ | ()
# Match Common characters
>>> Import re
>>> pattern = r "AB"
>>> Re.findall (Pattern, "123abc")
[' AB ']
4. Meta character4.0.
-Any character
4.1 []
-Select one of the characters in the sequence
-Commonly used to specify a character set: [ABC], [0-9], [a-za-z]
-Metacharacters are treated as normal characters in the character set: [abc$]
-Fill set: [^a-z]
>>> Import re
# Character Set
>>> pattern = "[A-z]"
>>> Re.findall (Pattern, "abc")
[' A ', ' B ', ' C ']
# Fill Set
>>> pattern = "[^a-z]"
>>> Re.findall (Pattern, "abc")
[]
# Special Characters
>>> pattern = "[a^$]"
>>> Re.findall (Pattern, "abc^$")
[' A ', ' ^ ', ' $ ']
4.2 ^
-Match the beginning of the line
>>> pattern = "^a"
>>> Re.findall (Pattern, "baaa")
[]
>>> Re.findall (Pattern, "abbb")
[' A ']
4.3 $
-Match end of line
>>> pattern = "a$"
>>> Re.findall (Pattern, "Aaab")
[]
>>> Re.findall (Pattern, "Bbba")
[' A ']
4.4 \-Escape character
-Cancels the special meaning of the meta character and treats it as a normal character
-Special meaning
-\d <==> [0-9], matching decimal digits, decimal
-\d <==> [^0-9], matching non-numeric characters
-\s <==> [\t\n\r\f\v], matching white-space characters
-\s <==> [^\t\n\r\f\v]
-\w <==> [a-za-z0-9_], matching alphanumeric underline
-\w <==> [^a-za-z0-9_]
4.5 Repeat
4.5.1 *
-Number of repetitions: [0, + infinity]
>>> pattern = r "ab*"
>>> Re.findall (Pattern, "a")
[' A ']
>>> Re.findall (Pattern, "AB")
[' AB ']
>>> Re.findall (Pattern, "ABB")
[' ABB ']
>>> Re.findall (Pattern, "abbbbbbbbbb")
[' ABBBBBBBBBB ']
4.5.2 +
-Number of repetitions: [1, + infinity]
>>> pattern = r "ab+"
>>> Re.findall (Pattern, "a")
[]
>>> Re.findall (Pattern, "AB")
[' AB ']
>>> Re.findall (Pattern, "abbbbbb")
[' ABBBBBB ']
4.5.3?
-Number of repetitions: [0, 1], that is, there is or is no
>>> pattern = r "Ab?"
>>> Re.findall (Pattern, "a")
[' A ']
>>> Re.findall (Pattern, "AB")
[' AB ']
>>> Re.findall (Pattern, "abbbbb")
[' AB ']
4.5.4 {M,n}
-{M,n} repeat Count: [M, N]
-{m} Repeat count: M
-{m,} repeat times: [M, + infinity]
-the M default value is 0
>>> pattern = r "\d{1,3}"
>>> Re.findall (Pattern, "1234")
[' 123 ', ' 4 ']
>>> pattern = r "\d{1,}"
>>> Re.findall (Pattern, "1234")
[' 1234 ']
>>> pattern = r "\d{1}"
>>> Re.findall (Pattern, "1234")
[' 1 ', ' 2 ', ' 3 ', ' 4 ']
5. Compiling regular Expressions5.1 Compiling
-The RE module provides a regular expression engine interface,
You can compile restring into an object
>>> Import re
>>> telpatternstring = r "\d{3}"
>>> Telpattern = Re.compile (telpatternstring)
>>> Telpattern
<_sre. Sre_pattern Object at 0x01806170>
>>> Telpattern.findall ("1")
[]
>>> Telpattern.findall ("123")
[' 123 ']
>>> Telpattern.findall ("1234")
[' 123 ']
5.2 Using parameters at compile time
-Ignore case
>>> Import re
>>> namepatternstring = r "[A-z]{3}"
>>> Namepattern = Re.compile (namepatternstring, re. IGNORECASE)
>>> Namepattern.findall ("abc")
[' ABC ']
>>> Namepattern.findall ("AbC")
[' AbC ']
5.3 Counter Slash trouble
-Before the string "R", the backslash will not be handled in any special way
>>> pattern = r "\"
>>> Re.findall (Pattern, "C:\dirA")
['\\']
>>> pattern = "\"
>>> Re.findall (Pattern, "C:\dirA")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\Python\install-2.7\lib\re.py", line 177, in FindAll
return _compile (pattern, flags). FindAll (String)
File "D:\Python\install-2.7\lib\re.py", line 242, in _compile
Raise error, V # invalid expression
Sre_constants.error:bogus Escape (end of line)
6. Some methods of Regex object-Match () matches only the string at the beginning of the rule, failed-returns none
-Search () matches a rule string at any location
-FindAll () returns the string that matches the rule as a list
-Finditer () returns the string conforming to the rule as an iterator
-Sub () substitution, returning the replaced string
-SUBN () Replace, return (replacement string, number of substitutions)
-Split () cut
6.1 Match ()
>>> Import re
>>> pattern = r "a"
>>> Re.match (Pattern, "abc")
<_sre. Sre_match Object at 0x01407fa8>
>>> Re.match (Pattern, "ba")
>>> Re.match (Pattern, "BAC")
>>>
6.2 Search ()
>>> Import re
>>> pattern = r "a"
>>> Re.search (Pattern, "abc")
<_sre. Sre_match Object at 0x01419330>
>>> Re.search (Pattern, "BAC")
<_sre. Sre_match Object at 0x01407fa8>
>>> Re.search (Pattern, "BCA")
<_sre. Sre_match Object at 0x01419330>
>>>
6.3 FindAll ()
>>> Import re
>>> pattern = r "a"
>>> Re.findall (Pattern, "Abacad")
[' A ', ' a ', ' a ']
6.4 Finditer
>>> Import re
>>> pattern = r "[0-9]"
>>> Re.finditer (Pattern, "1234")
<callable-iterator Object at 0x017fe990>
>>> for X in Re.finditer (pattern, "1234"):
... print X
...
<_sre. Sre_match Object at 0x01407fa8>
<_sre. Sre_match Object at 0x01419330>
<_sre. Sre_match Object at 0x01407fa8>
<_sre. Sre_match Object at 0x01419330>
>>>
6.5 Sub () Subn ()
-Subn (Pattern, Repl, String, count=0, flags=0)
>>> Re.sub (R "a", "X", "ABCA")
' Xbcx '
>>> Re.subn (R "a", "X", "ABCA")
(' Xbcx ', 2)
6.6 Split ()
-Split (pattern, String, maxsplit=0, flags=0)
>>> re.split ("[^\d]", "1999-09/19 23:34:59")
[' 1999 ', ' 09 ', ' 19 ', ' 23 ', ' 34 ', ' 59 ']
>>> re.split ("[^\d]", "1 + 2 + 3-4 * 5")
[' 1 ', ' 2 ', ' 3 ', ' 4 ', ' 5 ']
7. Some functions of Match object-Group () returns a matching string Obj.group ()
-Start () matches the starting position of the string
-End () matches the ending position of the string
-SPAN () (starting position, end position)
-Check if Match object is None, to determine if match is successful.
8. Re Properties-Compile Identity flags
-dotall/s to match all characters including newline
-Ignorecase/i ignores case
-LOCALE/L Localized matching
-multiline/m multiple line matching, affecting ^$
-verbose/x removes the newline character when writing regular
# Re. S
>>> Re.findall (r "a.b", "A\NB")
[]
>>> Re.findall (r "a.b", "A\NB", re. S
[' A\NB ']
# Re. M
>>> s = "" "
... line1:a1
... line2:a2
... line3:a3
... """
>>> s
' \nline1:a1\nline2:a2\nline3:a3\n '
>>> Re.findall (r "^line[0-9]", s)
[]
>>> Re.findall (r "^line[0-9]", S, re. M
[' line1 ', ' line2 ', ' line3 ']
# Re. X
>>> Telpatternstr = r "" "
... \d{3,4}
... -?
... \d{7}
... """
>>> Telpatternstr
' \n\\d{3,4}\n-?\n\\d{7}\n '
>>> Re.findall (telpatternstr, "011-1234567")
[]
>>> Re.findall (telpatternstr, "011-1234567", re. X
[' 011-1234567 ']
9. Regular groupings-()-(Pattern1 | pattern2) Two Select a
-Group Precedence is returned
# Crawl Web site
>>> s = "" "
... <a href= "www.baidu.com" >baidu</a>
... <a href= "www.sina.com.cn" >sina</a>
... """
>>> Print S
<a href= "www.baidu.com" >baidu</a>
<a href= "www.sina.com.cn" >sina</a>
>>> Re.findall (r "<a href=\". +\ ">.+</a>", s)
[' <a href= ' www.baidu.com ' >baidu</a> ', ' <a href= ' www.sina.com.cn ' >sina</a> ']
>>> Re.findall (r "<a href=\" (. +) \ ">.+</a>", s)
[' www.baidu.com ', ' www.sina.com.cn ']
>>>
10. Small Reptiles-Download bar or QQ space in all the pictures
-grappicture.py
' Created on 2013-10-4 @author: Wuqinfei ' "Import re import urllib # url:web Site # return:get src code from t
He url def gethtml (URL): page = Urllib.urlopen (URL) # Connect to the URL html = page.read () # Read it
Return HTML # return src code # html:html src Code # return:a list of JPG URL def getimg (HTML): Reg = R ' src= ' (http://[^\s]*\.jpg) ' width ' Imgre = Re.compile (reg) imgurllist = Re.findall (imgre, HTML) RET Urn Imgurllist # url:download by this URL # name:saved by this name in the current dir def downbyurl (URL, name): Ur Llib.urlretrieve (URL, name) ################################################ if __name__ = = "__main__": HTML = Gethtml ("http://tieba.baidu.com/p/2306540022") Imgurllist = getimg (html) count = 1 Stopnum = ten fo R imgurl in imgurllist:print "Download ...", Imgurl picturename = "e:\\desktop\\python\\py_src\\jpg\\%s." JPG '% count DowNbyurl (Imgurl, Picturename) count+=1 if Count > stopnum:break;
Print "The number of pictures =", count-1