Writing python crawlers using urllib

Source: Internet
Author: User

In the new Python, Urllib and URLLIB2 are merged and unified as Urllib

(1) Simple crawl page

Import Urllib

Content = Urllib.request.urlopen (req). Read (). Decode ("Utf-8")

(2) Add header

Import Urllib

req = urllib.request.Request (URL)
Req.add_header (' user-agent ', ' mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) gecko/20100101 firefox/31.0 ')
Req.add_header (' Referer ', ' http://www.***.com ')
My_page = Urllib.request.urlopen (req). Read (). Decode ("Utf-8")

(3) Setting cookies

Import Urllib
Import Http.cookiejar

CJ = Http.cookiejar.LWPCookieJar ()
Cookie_support = Urllib.request.HTTPCookieProcessor (CJ)
Opener = Urllib.request.build_opener (Cookie_support, Urllib.request.HTTPHandler)
Urllib.request.install_opener (opener)

FAQ:

1. The request URL is in Chinese, reported abnormal

Workaround: Use the Chinese part for urllib.parse.quote processing

About Urllib.parse.quote:

Block special characters, such as if the URL inside the space! There is no space allowed in the URL.
The usage in python2.x is:
Urllib.quote (text)
In the python3.x is
Urllib.parse.quote (text)
By standard, URLs allow only a subset of ASCII characters (alphanumeric and partial symbols), and other characters (such as Chinese characters) are not compliant with the URL standard.
Therefore, the use of other characters in the URL requires URL encoding.
The part of the URL that passes the parameter (query String), in the format:

If you have a "&" or "=" symbol in your name or value, there is a problem. Therefore, the parameter string in the URL also needs to encode "&=" symbols.
URL encoding is the way to convert the characters that need to be encoded into%xx form. Usually URL encoding is based on UTF-8 (which is, of course, related to the browser platform).

2. Web page Parsing exception

Workaround: Urllib.request.urlopen (URL). read (). Decode ("Utf-8", ' ignore '), ignoring exception characters

Useful Links:

http://blog.csdn.net/pi9nc/article/details/9734437

Http://www.pythonclub.org/python-network-application/observer-spider

Writing python crawlers using urllib

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.