The Urllib library in PYTHON2/3

Source: Internet
Author: User
Tags urlencode website server

Summary: Introduces the changes of the Urllib library in different versions of Python, and explains the usage of Urllib library with python3.x.

Urllib Library control Speed Check table

python2.x

python3.x

Urllib

Urllib.request, Urllib.error, Urllib.parse

Urllib2

Urllib.request, Urllib.error

Urllib2.urlopen

Urllib.request.urlopen

Urllib.urlencode

Urllib.parse.urlencode

Urllib.quote

Urllib.request.quote

Urllib2. Request

Urllib.request.Request

Urlparse

Urllib.parse

Urllib.urlretrieve

Urllib.request.urlretrieve

Urllib2. Urlerror

Urllib.error.URLError

Cookielib. Cookiejar

http. Cookiejar

The Urllib library is a python third-party library for manipulating URLs, crawling pages, and the same library with requests, HTTPLIB2.

In python2.x, Urllib and Urllib2 are divided, but in python3.x, they are consolidated into Urllib. You can see the most common changes in the table above, which allows you to quickly write out the corresponding version of the Python program.

In contrast, python3.x's support for Chinese is better than python2.x, so the blog follows python3.x to introduce some common uses of the Urllib library.

Send Request
import urllib.requestr = Urllib.request.urlopen ("http://www.python.org/")

First, import the urllib.request module, use urlopen () to send a request to the URL in the parameter, and return a Http.client.HTTPResponse object.

In Urlopen () , use the timeout field to set the corresponding number of seconds to stop waiting for a response. In addition, you can use r.info (), R.getcode (), R.geturl ( ) to obtain the appropriate current environment information, status code, and the current page URL.

Read response content
import urllib.requesturl = "http://www.python.org/" with Urllib.request.urlopen (URL) as r:    R. Read ()

Use r.read () to read the contents of the response to memory, which is the source code of the Web page (as seen in the corresponding browser "view page source" function), and the returned string can be decoded accordingly decode ().

Passing URL parameters
Import urllib.requestimport urllib.parseparams = Urllib.parse.urlencode ({' Q ': ' Urllib ', ' Check_ Keywords ': ' yes ', ' area ': ' Default '} ' url = 'https://docs.python.org/3/search.html?{}'. Format (params)
r = Urllib.request.urlopen (URL)

In the form of a string dictionary, the data is passed through the UrlEncode () encoding for the query string of the URL.

The encoded params is a string, the dictionary each key value pair with ' & ' connection: ' Q=urllib&check_keywords=yes&area=default '

Post-Build Url:https://docs.python.org/3/search.html?q=urllib&check_keywords=yes&area=default

Of course,Urlopen () supports directly constructed URLs, and simple get requests can be manually constructed without the UrlEncode () encoding and directly requested. The above method makes the code modular and more elegant.

Pass Chinese parameters
import Urllib.requestsearchword = urllib.request.quote (Input ("Enter the keyword to query:")) URL = "https://cn.bing.com /images/async?q={}&first=0&mmasync=1". Format (searchword) R = Urllib.request.urlopen (URL)

The URL is a picture that uses the Bing image interface to query the keyword Q. If you request it directly from the Chinese incoming URL, the encoding error will result. We need to use quote ()to URL-encode the Chinese keyword, which can be decoded using unquote () .

Custom Request Headers
import urllib.requesturl = 'https://docs.python.org/3/library/urllib.request.html' headers = {    ' user-agent ': ' mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/61.0.3163.100 safari/537.36 ',    ' Referer ': 'https:// docs.python.org/3/library/urllib.html'}req = urllib.request.Request (URL, headers=headers) R = Urllib.request.urlopen (req)

Sometimes when you crawl some pages, you get a 403 error (Forbidden), which means no access. This is because the site server authenticates the visitor's headers property, such as a request sent through the Urllib library, with the default "Python-urllib/x.y" as the User-agent, where X is the major version of Python, and Y is the secondary version number. So, we need to build the request object through urllib.request.Request () , pass in the dictionary form of the Headers property, simulate the browser.

The corresponding headers information can be viewed via the browser's developer debugging tool, the "Network" tab of the "Check" function, or the Fiddler, Wireshark using the Grab packet analysis software.

In addition to the above methods, you can also use urllib.request.build_opener () or req.add_header () to customize the request header, see the official example.

In python2.x, urllib modules and URLLIB2 modules are commonly used together because Urllib.urlencode () can encode URL parameters, while urllib2. Request () can build the request object, customize the header, and then use urllib2.urlopen () to send requests uniformly.

Passing a POST request
Import urllib.requestimport urllib.parseurl = 'https://passport.cnblogs.com/user/signin?' Post = {    ' username ': ' xxx ',    ' password ': ' xxxx '}postdata = Urllib.parse.urlencode (post).  Encode(' utf-8 ') req = urllib.request.Request (URL, postdata) R = Urllib.request.urlopen (req)

We use the Post form to deliver information when registering, logging in, and so on.

At this point, we need to analyze the page structure, build the form data post, encode with UrlEncode () , return the string, and then specify the ' utf-8 ' encoding format, because postdata can only be bytes or file object. Finally, pass the postdata through the request () object and send the requests using Urlopen () .

Download Remote Data to local
import urllib.requesturl = "https://www.python.org/static/img/python-logo.png" Urllib.request.urlretrieve (URL, "python-logo.png")

You can use Urlretrieve () to download remote data, such as pictures and videos, to local.

The first parameter is the URL to download, and the second parameter is the stored path after the download.

This sample downloads the Python website logo to the current directory, returning the tuple (filename, headers).

Set Proxy IP
import urllib.requesturl = "https://www.cnblogs.com/" proxy_ip = "180.106.16.132:8118" Proxy = Urllib.request.ProxyHandler ({' http ': proxy_ip}) opener = Urllib.request.build_opener (proxy , Urllib.request.HTTPHandler) Urllib.request.install_opener (opener) R = Urllib.request.urlopen (URL)

Sometimes crawls a webpage frequently, will be blocked by the website server IP. At this point, the proxy IP can be set by the above method.

First, through the Web proxy IP site to find a can use IP, build proxyhandler () object, the ' HTTP ' and proxy IP as a dictionary parameter, set proxy server information. Then build the opener object and pass in the proxy and HttpHandler classes. The opener is set to global by Installl_opener () , and when the request is sent with Urlopen () , the requested information is used to send the corresponding request.

Exception handling
Import urllib.requestimport urllib.errorurl = "http://www.balabalabala.org" Try:    r = Urllib.request.urlopen (URL) except Urllib.error.URLError as E:    if hasattr (E, 'Code'):         Print(E.code)    if hasattr (E, ' reason '):        print(E.reason)

You can use the Urlerror class to handle some URL-related exceptions. Import urllib.error, catch Urlerror exception, because only Httperror exception (Urlerror subclass), there will be exception status code E.code, So you need to determine if the exception has attribute code.

Use of cookies
Import urllib.requestimport http.cookiejarurl = "http://www.balabalabala.org/" Cjar = Http.cookiejar.CookieJar () opener = Urllib.request.build_opener (Urllib.request.HTTPCookieProcessor (Cjar)) Urllib.request.install_opener (opener) R = Urllib.request.urlopen (URL)

Cookies maintain state between sessions when Web pages are accessed through the stateless protocol HTTP. For example, some Web sites require login, and the first time you can log in by submitting a post form, you can use cookies to stay signed in when crawling other sites under the site, instead of having to log in by submitting a form every time.

First, build the Cookiejar () object Cjar, then use the httpcookieprocessor () processor, process the Cjar, and pass the Build_opener () Constructs the opener object, sets the global, and sends the request through Urlopen () .

Official documents

Python2.x:urllib, Urllib2

Python3.x:urllib.request

Urllib Library in PYTHON2/3

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.