Python Crawler Learning Chapter fifth Regular

Source: Internet
Author: User
Tags modifier printable characters

Regular expression basics of using regular expressions with cookies
  1. Atomic
    Atoms are the most basic constituent units of a regular expression, with at least one atom in each regular expression, and common atoms in the following categories:
    • Ordinary characters as atoms
    • Non-printable characters as atoms
    • Universal characters as atoms
    • Yard table
    • The case of ordinary characters as atoms. Ordinary characters such as numbers, uppercase and lowercase letters, underscores, etc. can all be used as atoms. For example, in the following program, we use "Yue" as an atomic use, respectively, ' Y ', ' u ' e '. import re
      pattern="yue"
      string="http://yum.iqianyue.com"
      result1=re.search(pattern,string)
      print(result1)
    • Non-printable characters as atoms
      The so-called nonprinting characters, refers to some of the symbols in the string used for formatting control, such as line break.import re
      pattern="\n"
      string="http://yum.iqianyue.com
      http://baidu.com"
      result1=re.search(pattern.string)
      print(result1)
    • The
    • Universal character is an atomic
      so-called universal character, that is, an atom can match a class of characters, and we often use this type of atom in actual projects.
      | symbol | meaning | | \w | match any letter, number, or underscore |
      | \w | matches any character except letters, numbers, and underscores |
      | \d | match any decimal number |
      | \d | matches any other character except a decimal number |
      | \s | match any white space character |
      | \s | match any other character except for white space characters | import re
      pattern= "\w\dpython\w"
      string= "Abcdfphp345python_py"
      Result1=re.search (pattern,string)
      Print (RESULT1)
    • Atomic table
      Define a set of equal-status atoms and then match any atom in the Atom table, and in Python, the atomic table has a [] expression such as [XYZ] is an atomic table, the atomic table defines 3 atoms, the 3 atoms of equal status, for example, we define a regular expression of "[ Xyz]py ", corresponding to the original string is" Xpython ", if you use the Re.search () function to match, you can match the result" xpy ", because at this time as long as the PY before the first one is the x y z letter one of the letters, you can match the success.
      Similarly, [^] stands for a match between the atoms in the brackets, such as "[^xyz]py" can Match "APY", but not the "xpy".
      import re
      pattern1="\w\dpython[xyz]\w"
      pattern2="\w\dpython[^xyz]\w"
      pattern3="\w\dpython[xyz]\W"
      string="abcdfphp345pythony_py"
      result1=re.search(pattern1,string)
      result2=re.search(pattern2,string)
      result3=re.search(pattern3,string)
      print(result1)
      print(result2)
      print(result3)
  2. Metacharacters
    The so-called meta-character, is the regular expression has some special meanings of characters, such as repeating n times before the characters and so on. | Symbols | meaning |
    | . | Match any character except line break |
    | ^ | Start position of matching string |
    | $ | The end position of the matching string |
    | * | Matches 0, 1, or more previous atoms |
    | ? | Match 0 or 1 times before the yard |
    | + | Match 1 or more previous atoms |
    | {N} | The front yard just happens N times |
    | {N,} | The front atom appears at least n times |
    | {N,m} | The front atom appears at least n times, at most, M times |
    | ' | ' | Pattern Selector |
    | () | Pattern Cell |
    In particular, metacharacters can be divided into: arbitrary matching meta-characters, boundary-restricted meta-characters, qualifiers, pattern selectors, pattern units, and so on. <<<<<<< HEAD
    • Arbitrary match meta characters
      First, we'll explain the arbitrary match meta-character "." And we can use "." Matches an arbitrary character other than the line break. For example, we can use the regular expression ". Python ..." to match the first 1 bits of a "python" character, followed by a 3-bit formatted character, where the previous and subsequent 3 bits can be any character except newline characters.import re pattern=".python..."string="abcdfphp345pythony_py"result1=re.search(pattern,string)print(result1)
    • Boundary-restricted meta-characters
      You can use "^" to match the beginning of the string, using "$" to match the end import re
      pattern1="^abd"
      pattern2="^abc"
      pattern3="py$"
      pattern4="ay$"
      string="abcdfphp345pythony_py"
      result1=re.search(pattern1,string)
      result2=re.search(pattern2,string)
      result3=re.search(pattern3,string)
      result4=re.search(pattern4,string)
      print(result1)
      print(result2)
      print(result3)
      print(result4)
      of the string
    • Qualifier
      Common qualifiers include * 、?、 +, {n}, {n,}, {n,m}.
      import re
      pattern1="py.*n"
      pattern2="cd{2}"
      pattern3="cd{3}"
      pattern4="cd{2,}"
      string="abcdddfphp345pythony_py"
      result1=re.search(pattern1,string)
      result2=re.search(pattern2,string)
      result3=re.search(pattern3,string)
      result4=re.search(pattern4,string)
      print(result1)
      print(result2)
      print(result3)
      print(result4)
    • Pattern Selector
      The pattern selector "|" allows you to set multiple modes from which you can select any pattern match. For example, in the regular expression "python|php", both the string "Python" and "PHP" meet the matching criteria.import re
      pattern="python|php"
      string="abcdfphp345pythony_py"
      result1=re.search(pattern,string)
      print(result1)
    • Pattern cell character
      The Pattern Unit character "()" can combine some atoms into a large atom, and the enclosed parts of parentheses are used as a whole.import re
      pattern1="(cd){1,}"
      pattern2="cd{1,}"
      string="abcdcdcdcdfphp345pythony_py"
      result1=re.search(pattern1,string)
      result2=re.search(pattern2,string)
      print(result1)
      print(result2)
  3. Mode correction
    The so-called pattern modifier, which can change the meaning of the regular expression by the pattern modifier without changing the regular expression, realizes some functions such as adjusting the matching result. For example, you can use the pattern modifier I to make it case-insensitive for a module when it is matched.
    | Symbols | meaning | | I | Ignore Case when matching |
    | M | Multi-line Matching |
    | L | Do Localization Identification matching |
    | S Based on Unicode characters and parsing characters |
    | S | Let the. Match include a newline character, which is corrected with the module, "." Match to match any character |
    import re
    pattern1="python"
    pattern2="python"
    string="abcdfphp34Pythony_py"
    result1=re.search(pattern1,string)
    result2=re.search(pattern2,string,re.I)
    print(result1)
    print(result2)
  4. Greedy mode and lazy mode
    The core point of greedy mode is to match as many as possible, while the core of lazy mode is to match as few as possible.
    import re
    pattern1="p.*y"#贪婪模式
    pattern2="p.*?y"#懒惰模式
    string="abcdfphp345pythony_py"
    result1=re.search(pattern1,string)
    result2=re.search(pattern2,string)
    print(result1)
    print(result2)
Regular Expressions Common functions
  1. Re.match () function
    If you want to match a pattern from the starting position, you can use the Re.match () function, which formats
    re.match(pattern,string,flag)
    The first parameter is a regular expression, the second argument is the source string, and the third is an optional parameter, which represents the flag bit.
    import re
    string="apythonhellomypythonhispythonourpythonend"
    pattern=".python."
    result=re.match(pattern,string)
    result2=re.match(pattern,string).span()
    print(result)
    print(result2)
  2. Re.search () function
    The Re.search () function is retrieved and matched in the full text.
    import re
    string="hellomypythonhispythonourpythonend"
    pattern=".python."
    result=re.match(pattern,string)
    result2=re.search(pattern,string)
    print(result)
    print(result2)
  3. Global matching function
    Matches multiple results in a string of strings.
    • Use Re.compile () to precompile a regular expression.
    • After compiling, use FindAll () to find all the matching results from the source string based on the regular expression. 、import re
      string="hellomypythonhispythonourpythonend"
      pattern=re.compile(".python.")#预编译
      result=pattern.findall(string)#找出符合模式的所有结果
      print(result)

      import re
      string="hellomypythonhistpythonourpythonend"
      pattern=".python."
      result=re.compile(pattern).findall(string)
      print(result)
  4. Re.sub function
    Use regular expressions to replace certain string functions, which can be implemented using the Re.sub () function.
    re.sub(pattern,rep,string,max)
    The first parameter is a regular expression, the second argument is the string to be replaced, the third argument is the source string, the fourth argument is optional, represents the maximum number of replacements, and if not written, it is all replaced.
    import re
    string="hellomypytonhispythonourpythonend"
    pattern="python."
    result1=re.sub(pattern,"php",string)#全部替换
    result2=re.sub(pattern,"php",string,2)#最多替换两次
    print(result1)
    print(result2)
Common instance Parsing
    • URL URLs that match the. com or. cn suffix
      import re
      pattern="[a-zA-Z]+://[^\s]*[.com|.cn]"
      string="<a href=‘http://www.baidu.com‘>百度首页</a>"
      result=re.search(pattern,string)
      print(result)
    • Match phone number
      import re
      pattern="\d{4}-\d{7}|\d{3}-\d{8}"匹配电话号码的正则表达式
      string="0216728263682382265236"
      result=re.search(pattern,string)
      print(result)
    • Match e-mail address
      import re
      pattern="\w+([.+-]\w+)*@\w+([.-]\w+)*\.\w+([.-]\w+)*"#匹配电子邮箱的正则表达式
      string="<a href=‘http://www.baidu.com‘>百度首页</a><br><a href=‘mailto:[email protected]‘>电子邮箱地址</a>"
      result=re.search(pattern,string)
      print(result)
What is a cookie

HTTP is a stateless protocol, which means that the state between sessions cannot be maintained. Cookies can be saved in a state.

Cookiejar Combat

import urllib.request
import urllib.parse
import http.cookiejar
url="post请求地址"
postdata=urllib.parse.urlencode({
"username":"用户名"
"password":"密码"
}).encode("utf-8")
req=urllib.request.Request(url,postdata)
req.add_header(‘User-Agent‘,‘User-Agent的值‘)
cjar=http.cookieJar.CookieJar() #使用http.cookiejar.CookieJar()创建CookieJar对象
opener=urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cjar))#使用HTTPCookieProcessor创建cookie处理器,并以其为参数构建opener对象
urllib.request.install_opener(opener) #将opener安装为全局
file=opener.open(req)
data=file.read()
file=open("文件存储地址","wb")
file.write(data)
file.close()
url2="网页地址"
data2=urllib.request.urlopen(url2).read()
fh=open("文件地址",‘wb‘)
fh.write(data2)
fh.close()

=======

3f577ee4216aa3d991983f31467a7d7578dd9797

Python Crawler Learning Chapter fifth Regular

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.