17-python_ Regular Expression

Source: Internet
Author: User
Tags expression engine
Regular Expression Regular Expression
-Introduction of RE module
-rule Definition patternname = r "ABC ..."

1. The Concept-Regular expression (RE) is a small, highly specialized language.
-it's embedded in Python, implemented through the RE module

2. RoleProcess the string.
-Match
-Replace
-Separate

3. Character Matching-Ordinary characters
-Meta character
. ^ $ * + ? {} [] \ | ()

# Match Common characters
>>> Import re
>>> pattern = r "AB"
>>> Re.findall (Pattern, "123abc")
[' AB ']

4. Meta character4.0.
-Any character

4.1 []
-Select one of the characters in the sequence
-Commonly used to specify a character set: [ABC], [0-9], [a-za-z]
-Metacharacters are treated as normal characters in the character set: [abc$]
-Fill set: [^a-z]

>>> Import re
# Character Set
>>> pattern = "[A-z]"
>>> Re.findall (Pattern, "abc")
[' A ', ' B ', ' C ']
# Fill Set
>>> pattern = "[^a-z]"
>>> Re.findall (Pattern, "abc")
[]
# Special Characters
>>> pattern = "[a^$]"
>>> Re.findall (Pattern, "abc^$")
[' A ', ' ^ ', ' $ ']

4.2 ^
-Match the beginning of the line
>>> pattern = "^a"
>>> Re.findall (Pattern, "baaa")
[]
>>> Re.findall (Pattern, "abbb")
[' A ']

4.3 $
-Match end of line

>>> pattern = "a$"
>>> Re.findall (Pattern, "Aaab")
[]
>>> Re.findall (Pattern, "Bbba")
[' A ']

4.4 \-Escape character
-Cancels the special meaning of the meta character and treats it as a normal character
-Special meaning
-\d <==> [0-9], matching decimal digits, decimal
-\d <==> [^0-9], matching non-numeric characters
-\s <==> [\t\n\r\f\v], matching white-space characters
-\s <==> [^\t\n\r\f\v]
-\w <==> [a-za-z0-9_], matching alphanumeric underline
-\w <==> [^a-za-z0-9_]

4.5 Repeat

4.5.1 *
-Number of repetitions: [0, + infinity]

>>> pattern = r "ab*"
>>> Re.findall (Pattern, "a")
[' A ']
>>> Re.findall (Pattern, "AB")
[' AB ']
>>> Re.findall (Pattern, "ABB")
[' ABB ']
>>> Re.findall (Pattern, "abbbbbbbbbb")
[' ABBBBBBBBBB ']

4.5.2 +
-Number of repetitions: [1, + infinity]

>>> pattern = r "ab+"
>>> Re.findall (Pattern, "a")
[]
>>> Re.findall (Pattern, "AB")
[' AB ']
>>> Re.findall (Pattern, "abbbbbb")
[' ABBBBBB ']

4.5.3?
-Number of repetitions: [0, 1], that is, there is or is no

>>> pattern = r "Ab?"
>>> Re.findall (Pattern, "a")
[' A ']
>>> Re.findall (Pattern, "AB")
[' AB ']
>>> Re.findall (Pattern, "abbbbb")
[' AB ']

4.5.4 {M,n}
-{M,n} repeat Count: [M, N]
-{m} Repeat count: M
-{m,} repeat times: [M, + infinity]
-the M default value is 0

>>> pattern = r "\d{1,3}"
>>> Re.findall (Pattern, "1234")
[' 123 ', ' 4 ']
>>> pattern = r "\d{1,}"
>>> Re.findall (Pattern, "1234")
[' 1234 ']
>>> pattern = r "\d{1}"
>>> Re.findall (Pattern, "1234")
[' 1 ', ' 2 ', ' 3 ', ' 4 ']

5. Compiling regular Expressions5.1 Compiling
-The RE module provides a regular expression engine interface,
You can compile restring into an object

>>> Import re
>>> telpatternstring = r "\d{3}"
>>> Telpattern = Re.compile (telpatternstring)
>>> Telpattern
<_sre. Sre_pattern Object at 0x01806170>
>>> Telpattern.findall ("1")
[]
>>> Telpattern.findall ("123")
[' 123 ']
>>> Telpattern.findall ("1234")
[' 123 ']

5.2 Using parameters at compile time
-Ignore case

>>> Import re
>>> namepatternstring = r "[A-z]{3}"
>>> Namepattern = Re.compile (namepatternstring, re. IGNORECASE)
>>> Namepattern.findall ("abc")
[' ABC ']
>>> Namepattern.findall ("AbC")
[' AbC ']

5.3 Counter Slash trouble
-Before the string "R", the backslash will not be handled in any special way

>>> pattern = r "\"
>>> Re.findall (Pattern, "C:\dirA")
['\\']
>>> pattern = "\"
>>> Re.findall (Pattern, "C:\dirA")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\Python\install-2.7\lib\re.py", line 177, in FindAll
return _compile (pattern, flags). FindAll (String)
File "D:\Python\install-2.7\lib\re.py", line 242, in _compile
Raise error, V # invalid expression
Sre_constants.error:bogus Escape (end of line)

6. Some methods of Regex object-Match () matches only the string at the beginning of the rule, failed-returns none
-Search () matches a rule string at any location
-FindAll () returns the string that matches the rule as a list
-Finditer () returns the string conforming to the rule as an iterator
-Sub () substitution, returning the replaced string
-SUBN () Replace, return (replacement string, number of substitutions)
-Split () cut

6.1 Match ()
>>> Import re
>>> pattern = r "a"
>>> Re.match (Pattern, "abc")
<_sre. Sre_match Object at 0x01407fa8>
>>> Re.match (Pattern, "ba")
>>> Re.match (Pattern, "BAC")
>>>

6.2 Search ()
>>> Import re
>>> pattern = r "a"
>>> Re.search (Pattern, "abc")
<_sre. Sre_match Object at 0x01419330>
>>> Re.search (Pattern, "BAC")
<_sre. Sre_match Object at 0x01407fa8>
>>> Re.search (Pattern, "BCA")
<_sre. Sre_match Object at 0x01419330>
>>>

6.3 FindAll ()
>>> Import re
>>> pattern = r "a"
>>> Re.findall (Pattern, "Abacad")
[' A ', ' a ', ' a ']

6.4 Finditer
>>> Import re
>>> pattern = r "[0-9]"
>>> Re.finditer (Pattern, "1234")
<callable-iterator Object at 0x017fe990>
>>> for X in Re.finditer (pattern, "1234"):
... print X
...
<_sre. Sre_match Object at 0x01407fa8>
<_sre. Sre_match Object at 0x01419330>
<_sre. Sre_match Object at 0x01407fa8>
<_sre. Sre_match Object at 0x01419330>
>>>

6.5 Sub () Subn ()
-Subn (Pattern, Repl, String, count=0, flags=0)

>>> Re.sub (R "a", "X", "ABCA")
' Xbcx '
>>> Re.subn (R "a", "X", "ABCA")
(' Xbcx ', 2)

6.6 Split ()
-Split (pattern, String, maxsplit=0, flags=0)

>>> re.split ("[^\d]", "1999-09/19 23:34:59")
[' 1999 ', ' 09 ', ' 19 ', ' 23 ', ' 34 ', ' 59 ']
>>> re.split ("[^\d]", "1 + 2 + 3-4 * 5")
[' 1 ', ' 2 ', ' 3 ', ' 4 ', ' 5 ']

7. Some functions of Match object-Group () returns a matching string Obj.group ()
-Start () matches the starting position of the string
-End () matches the ending position of the string
-SPAN () (starting position, end position)
-Check if Match object is None, to determine if match is successful.

8. Re Properties-Compile Identity flags
-dotall/s to match all characters including newline
-Ignorecase/i ignores case
-LOCALE/L Localized matching
-multiline/m multiple line matching, affecting ^$
-verbose/x removes the newline character when writing regular

# Re. S
>>> Re.findall (r "a.b", "A\NB")
[]
>>> Re.findall (r "a.b", "A\NB", re. S
[' A\NB ']

# Re. M
>>> s = "" "
... line1:a1
... line2:a2
... line3:a3
... """
>>> s
' \nline1:a1\nline2:a2\nline3:a3\n '
>>> Re.findall (r "^line[0-9]", s)
[]
>>> Re.findall (r "^line[0-9]", S, re. M
[' line1 ', ' line2 ', ' line3 ']

# Re. X
>>> Telpatternstr = r "" "
... \d{3,4}
... -?
... \d{7}
... """
>>> Telpatternstr
' \n\\d{3,4}\n-?\n\\d{7}\n '
>>> Re.findall (telpatternstr, "011-1234567")
[]
>>> Re.findall (telpatternstr, "011-1234567", re. X
[' 011-1234567 ']

9. Regular groupings-()-(Pattern1 | pattern2) Two Select a
-Group Precedence is returned

# Crawl Web site
>>> s = "" "
... <a href= "www.baidu.com" >baidu</a>
... <a href= "www.sina.com.cn" >sina</a>
... """
>>> Print S

<a href= "www.baidu.com" >baidu</a>
<a href= "www.sina.com.cn" >sina</a>

>>> Re.findall (r "<a href=\". +\ ">.+</a>", s)
[' <a href= ' www.baidu.com ' >baidu</a> ', ' <a href= ' www.sina.com.cn ' >sina</a> ']
>>> Re.findall (r "<a href=\" (. +) \ ">.+</a>", s)
[' www.baidu.com ', ' www.sina.com.cn ']
>>>

10. Small Reptiles-Download bar or QQ space in all the pictures

-grappicture.py

' Created on 2013-10-4 @author: Wuqinfei ' "Import re import urllib # url:web Site # return:get src code from t
    He url def gethtml (URL): page = Urllib.urlopen (URL) # Connect to the URL html = page.read () # Read it
    Return HTML # return src code # html:html src Code # return:a list of JPG URL def getimg (HTML): Reg = R ' src= ' (http://[^\s]*\.jpg) ' width ' Imgre = Re.compile (reg) imgurllist = Re.findall (imgre, HTML) RET Urn Imgurllist # url:download by this URL # name:saved by this name in the current dir def downbyurl (URL, name): Ur Llib.urlretrieve (URL, name) ################################################ if __name__ = = "__main__": HTML = Gethtml ("http://tieba.baidu.com/p/2306540022") Imgurllist = getimg (html) count = 1 Stopnum = ten fo R imgurl in imgurllist:print "Download ...", Imgurl picturename = "e:\\desktop\\python\\py_src\\jpg\\%s." JPG '% count DowNbyurl (Imgurl, Picturename) count+=1 if Count > stopnum:break;    
 Print "The number of pictures =", count-1


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.