17-python_ Regular Expression

Last Update:2018-07-28 Source: Internet

Author: User

Tags expression engine

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Regular Expression Regular Expression
-Introduction of RE module
-rule Definition patternname = r "ABC ..."

1. The Concept-Regular expression (RE) is a small, highly specialized language.
-it's embedded in Python, implemented through the RE module

2. RoleProcess the string.
-Match
-Replace
-Separate

3. Character Matching-Ordinary characters
-Meta character
. ^ $ * + ? {} [] \ | ()

# Match Common characters
>>> Import re
>>> pattern = r "AB"
>>> Re.findall (Pattern, "123abc")
[' AB ']

4. Meta character4.0.
-Any character

4.1 []
-Select one of the characters in the sequence
-Commonly used to specify a character set: [ABC], [0-9], [a-za-z]
-Metacharacters are treated as normal characters in the character set: [abc$]
-Fill set: [^a-z]

>>> Import re
# Character Set
>>> pattern = "[A-z]"
>>> Re.findall (Pattern, "abc")
[' A ', ' B ', ' C ']
# Fill Set
>>> pattern = "[^a-z]"
>>> Re.findall (Pattern, "abc")
[]
# Special Characters
>>> pattern = "[a^$]"
>>> Re.findall (Pattern, "abc^$")
[' A ', ' ^ ', ' $ ']

4.2 ^
-Match the beginning of the line
>>> pattern = "^a"
>>> Re.findall (Pattern, "baaa")
[]
>>> Re.findall (Pattern, "abbb")
[' A ']

4.3 $
-Match end of line

>>> pattern = "a$"
>>> Re.findall (Pattern, "Aaab")
[]
>>> Re.findall (Pattern, "Bbba")
[' A ']

4.4 \-Escape character
-Cancels the special meaning of the meta character and treats it as a normal character
-Special meaning
-\d <==> [0-9], matching decimal digits, decimal
-\d <==> [^0-9], matching non-numeric characters
-\s <==> [\t\n\r\f\v], matching white-space characters
-\s <==> [^\t\n\r\f\v]
-\w <==> [a-za-z0-9_], matching alphanumeric underline
-\w <==> [^a-za-z0-9_]

4.5 Repeat

4.5.1 *
-Number of repetitions: [0, + infinity]

>>> pattern = r "ab*"
>>> Re.findall (Pattern, "a")
[' A ']
>>> Re.findall (Pattern, "AB")
[' AB ']
>>> Re.findall (Pattern, "ABB")
[' ABB ']
>>> Re.findall (Pattern, "abbbbbbbbbb")
[' ABBBBBBBBBB ']

4.5.2 +
-Number of repetitions: [1, + infinity]

>>> pattern = r "ab+"
>>> Re.findall (Pattern, "a")
[]
>>> Re.findall (Pattern, "AB")
[' AB ']
>>> Re.findall (Pattern, "abbbbbb")
[' ABBBBBB ']

4.5.3?
-Number of repetitions: [0, 1], that is, there is or is no

>>> pattern = r "Ab?"
>>> Re.findall (Pattern, "a")
[' A ']
>>> Re.findall (Pattern, "AB")
[' AB ']
>>> Re.findall (Pattern, "abbbbb")
[' AB ']

4.5.4 {M,n}
-{M,n} repeat Count: [M, N]
-{m} Repeat count: M
-{m,} repeat times: [M, + infinity]
-the M default value is 0

>>> pattern = r "\d{1,3}"
>>> Re.findall (Pattern, "1234")
[' 123 ', ' 4 ']
>>> pattern = r "\d{1,}"
>>> Re.findall (Pattern, "1234")
[' 1234 ']
>>> pattern = r "\d{1}"
>>> Re.findall (Pattern, "1234")
[' 1 ', ' 2 ', ' 3 ', ' 4 ']

5. Compiling regular Expressions5.1 Compiling
-The RE module provides a regular expression engine interface,
You can compile restring into an object

>>> Import re
>>> telpatternstring = r "\d{3}"
>>> Telpattern = Re.compile (telpatternstring)
>>> Telpattern
<_sre. Sre_pattern Object at 0x01806170>
>>> Telpattern.findall ("1")
[]
>>> Telpattern.findall ("123")
[' 123 ']
>>> Telpattern.findall ("1234")
[' 123 ']

5.2 Using parameters at compile time
-Ignore case

>>> Import re
>>> namepatternstring = r "[A-z]{3}"
>>> Namepattern = Re.compile (namepatternstring, re. IGNORECASE)
>>> Namepattern.findall ("abc")
[' ABC ']
>>> Namepattern.findall ("AbC")
[' AbC ']

5.3 Counter Slash trouble
-Before the string "R", the backslash will not be handled in any special way

>>> pattern = r "\"
>>> Re.findall (Pattern, "C:\dirA")
['\\']
>>> pattern = "\"
>>> Re.findall (Pattern, "C:\dirA")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\Python\install-2.7\lib\re.py", line 177, in FindAll
return _compile (pattern, flags). FindAll (String)
File "D:\Python\install-2.7\lib\re.py", line 242, in _compile
Raise error, V # invalid expression
Sre_constants.error:bogus Escape (end of line)

6. Some methods of Regex object-Match () matches only the string at the beginning of the rule, failed-returns none
-Search () matches a rule string at any location
-FindAll () returns the string that matches the rule as a list
-Finditer () returns the string conforming to the rule as an iterator
-Sub () substitution, returning the replaced string
-SUBN () Replace, return (replacement string, number of substitutions)
-Split () cut

6.1 Match ()
>>> Import re
>>> pattern = r "a"
>>> Re.match (Pattern, "abc")
<_sre. Sre_match Object at 0x01407fa8>
>>> Re.match (Pattern, "ba")
>>> Re.match (Pattern, "BAC")
>>>

6.2 Search ()
>>> Import re
>>> pattern = r "a"
>>> Re.search (Pattern, "abc")
<_sre. Sre_match Object at 0x01419330>
>>> Re.search (Pattern, "BAC")
<_sre. Sre_match Object at 0x01407fa8>
>>> Re.search (Pattern, "BCA")
<_sre. Sre_match Object at 0x01419330>
>>>

6.3 FindAll ()
>>> Import re
>>> pattern = r "a"
>>> Re.findall (Pattern, "Abacad")
[' A ', ' a ', ' a ']

6.4 Finditer
>>> Import re
>>> pattern = r "[0-9]"
>>> Re.finditer (Pattern, "1234")
<callable-iterator Object at 0x017fe990>
>>> for X in Re.finditer (pattern, "1234"):
... print X
...
<_sre. Sre_match Object at 0x01407fa8>
<_sre. Sre_match Object at 0x01419330>
<_sre. Sre_match Object at 0x01407fa8>
<_sre. Sre_match Object at 0x01419330>
>>>

6.5 Sub () Subn ()
-Subn (Pattern, Repl, String, count=0, flags=0)

>>> Re.sub (R "a", "X", "ABCA")
' Xbcx '
>>> Re.subn (R "a", "X", "ABCA")
(' Xbcx ', 2)

6.6 Split ()
-Split (pattern, String, maxsplit=0, flags=0)

>>> re.split ("[^\d]", "1999-09/19 23:34:59")
[' 1999 ', ' 09 ', ' 19 ', ' 23 ', ' 34 ', ' 59 ']
>>> re.split ("[^\d]", "1 + 2 + 3-4 * 5")
[' 1 ', ' 2 ', ' 3 ', ' 4 ', ' 5 ']

7. Some functions of Match object-Group () returns a matching string Obj.group ()
-Start () matches the starting position of the string
-End () matches the ending position of the string
-SPAN () (starting position, end position)
-Check if Match object is None, to determine if match is successful.

8. Re Properties-Compile Identity flags
-dotall/s to match all characters including newline
-Ignorecase/i ignores case
-LOCALE/L Localized matching
-multiline/m multiple line matching, affecting ^$
-verbose/x removes the newline character when writing regular

# Re. S
>>> Re.findall (r "a.b", "A\NB")
[]
>>> Re.findall (r "a.b", "A\NB", re. S
[' A\NB ']

# Re. M
>>> s = "" "
... line1:a1
... line2:a2
... line3:a3
... """
>>> s
' \nline1:a1\nline2:a2\nline3:a3\n '
>>> Re.findall (r "^line[0-9]", s)
[]
>>> Re.findall (r "^line[0-9]", S, re. M
[' line1 ', ' line2 ', ' line3 ']

# Re. X
>>> Telpatternstr = r "" "
... \d{3,4}
... -?
... \d{7}
... """
>>> Telpatternstr
' \n\\d{3,4}\n-?\n\\d{7}\n '
>>> Re.findall (telpatternstr, "011-1234567")
[]
>>> Re.findall (telpatternstr, "011-1234567", re. X
[' 011-1234567 ']

9. Regular groupings-()-(Pattern1 | pattern2) Two Select a
-Group Precedence is returned

# Crawl Web site
>>> s = "" "
... <a href= "www.baidu.com" >baidu</a>
... <a href= "www.sina.com.cn" >sina</a>
... """
>>> Print S

<a href= "www.baidu.com" >baidu</a>
<a href= "www.sina.com.cn" >sina</a>

>>> Re.findall (r "<a href=\". +\ ">.+</a>", s)
[' <a href= ' www.baidu.com ' >baidu</a> ', ' <a href= ' www.sina.com.cn ' >sina</a> ']
>>> Re.findall (r "<a href=\" (. +) \ ">.+</a>", s)
[' www.baidu.com ', ' www.sina.com.cn ']
>>>

10. Small Reptiles-Download bar or QQ space in all the pictures

-grappicture.py

' Created on 2013-10-4 @author: Wuqinfei ' "Import re import urllib # url:web Site # return:get src code from t
    He url def gethtml (URL): page = Urllib.urlopen (URL) # Connect to the URL html = page.read () # Read it
    Return HTML # return src code # html:html src Code # return:a list of JPG URL def getimg (HTML): Reg = R ' src= ' (http://[^\s]*\.jpg) ' width ' Imgre = Re.compile (reg) imgurllist = Re.findall (imgre, HTML) RET Urn Imgurllist # url:download by this URL # name:saved by this name in the current dir def downbyurl (URL, name): Ur Llib.urlretrieve (URL, name) ################################################ if __name__ = = "__main__": HTML = Gethtml ("http://tieba.baidu.com/p/2306540022") Imgurllist = getimg (html) count = 1 Stopnum = ten fo R imgurl in imgurllist:print "Download ...", Imgurl picturename = "e:\\desktop\\python\\py_src\\jpg\\%s." JPG '% count DowNbyurl (Imgurl, Picturename) count+=1 if Count > stopnum:break;    
 Print "The number of pictures =", count-1

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More