In the new Python, Urllib and URLLIB2 are merged and unified as Urllib
(1) Simple crawl page
Import Urllib
Content = Urllib.request.urlopen (req). Read (). Decode ("Utf-8")
(2) Add header
Import Urllib
req = urllib.request.Request (URL)
Req.add_header (' user-agent ', ' mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) gecko/20100101 firefox/31.0 ')
Req.add_header (' Referer ', ' http://www.***.com ')
My_page = Urllib.request.urlopen (req). Read (). Decode ("Utf-8")
(3) Setting cookies
Import Urllib
Import Http.cookiejar
CJ = Http.cookiejar.LWPCookieJar ()
Cookie_support = Urllib.request.HTTPCookieProcessor (CJ)
Opener = Urllib.request.build_opener (Cookie_support, Urllib.request.HTTPHandler)
Urllib.request.install_opener (opener)
FAQ:
1. The request URL is in Chinese, reported abnormal
Workaround: Use the Chinese part for urllib.parse.quote processing
About Urllib.parse.quote:
Block special characters, such as if the URL inside the space! There is no space allowed in the URL.
The usage in python2.x is:
Urllib.quote (text)
In the python3.x is
Urllib.parse.quote (text)
By standard, URLs allow only a subset of ASCII characters (alphanumeric and partial symbols), and other characters (such as Chinese characters) are not compliant with the URL standard.
Therefore, the use of other characters in the URL requires URL encoding.
The part of the URL that passes the parameter (query String), in the format:
If you have a "&" or "=" symbol in your name or value, there is a problem. Therefore, the parameter string in the URL also needs to encode "&=" symbols.
URL encoding is the way to convert the characters that need to be encoded into%xx form. Usually URL encoding is based on UTF-8 (which is, of course, related to the browser platform).
2. Web page Parsing exception
Workaround: Urllib.request.urlopen (URL). read (). Decode ("Utf-8", ' ignore '), ignoring exception characters
Useful Links:
http://blog.csdn.net/pi9nc/article/details/9734437
Http://www.pythonclub.org/python-network-application/observer-spider
Writing python crawlers using urllib