The Urllib library in PYTHON2/3

Last Update:2017-12-19 Source: Internet

Author: User

Tags urlencode website server

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Summary: Introduces the changes of the Urllib library in different versions of Python, and explains the usage of Urllib library with python3.x.

Urllib Library control Speed Check table

python2.x	python3.x
Urllib	Urllib.request, Urllib.error, Urllib.parse
Urllib2	Urllib.request, Urllib.error
Urllib2.urlopen	Urllib.request.urlopen
Urllib.urlencode	Urllib.parse.urlencode
Urllib.quote	Urllib.request.quote
Urllib2. Request	Urllib.request.Request
Urlparse	Urllib.parse
Urllib.urlretrieve	Urllib.request.urlretrieve
Urllib2. Urlerror	Urllib.error.URLError
Cookielib. Cookiejar	http. Cookiejar

The Urllib library is a python third-party library for manipulating URLs, crawling pages, and the same library with requests, HTTPLIB2.

In python2.x, Urllib and Urllib2 are divided, but in python3.x, they are consolidated into Urllib. You can see the most common changes in the table above, which allows you to quickly write out the corresponding version of the Python program.

In contrast, python3.x's support for Chinese is better than python2.x, so the blog follows python3.x to introduce some common uses of the Urllib library.

Send Request

import urllib.requestr = Urllib.request.urlopen ("http://www.python.org/")

First, import the urllib.request module, use urlopen () to send a request to the URL in the parameter, and return a Http.client.HTTPResponse object.

In Urlopen () , use the timeout field to set the corresponding number of seconds to stop waiting for a response. In addition, you can use r.info (), R.getcode (), R.geturl ( ) to obtain the appropriate current environment information, status code, and the current page URL.

Read response content

import urllib.requesturl = "http://www.python.org/" with Urllib.request.urlopen (URL) as r:    R. Read ()

Use r.read () to read the contents of the response to memory, which is the source code of the Web page (as seen in the corresponding browser "view page source" function), and the returned string can be decoded accordingly decode ().

Passing URL parameters

Import urllib.requestimport urllib.parseparams = Urllib.parse.urlencode ({' Q ': ' Urllib ', ' Check_ Keywords ': ' yes ', ' area ': ' Default '} ' url = 'https://docs.python.org/3/search.html?{}'. Format (params)

r = Urllib.request.urlopen (URL)

In the form of a string dictionary, the data is passed through the UrlEncode () encoding for the query string of the URL.

The encoded params is a string, the dictionary each key value pair with ' & ' connection: ' Q=urllib&check_keywords=yes&area=default '

Post-Build Url:https://docs.python.org/3/search.html?q=urllib&check_keywords=yes&area=default

Of course,Urlopen () supports directly constructed URLs, and simple get requests can be manually constructed without the UrlEncode () encoding and directly requested. The above method makes the code modular and more elegant.

Pass Chinese parameters

import Urllib.requestsearchword = urllib.request.quote (Input ("Enter the keyword to query:")) URL = "https://cn.bing.com /images/async?q={}&first=0&mmasync=1". Format (searchword) R = Urllib.request.urlopen (URL)

The URL is a picture that uses the Bing image interface to query the keyword Q. If you request it directly from the Chinese incoming URL, the encoding error will result. We need to use quote ()to URL-encode the Chinese keyword, which can be decoded using unquote () .

Custom Request Headers

import urllib.requesturl = 'https://docs.python.org/3/library/urllib.request.html' headers = {    ' user-agent ': ' mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/61.0.3163.100 safari/537.36 ',    ' Referer ': 'https:// docs.python.org/3/library/urllib.html'}req = urllib.request.Request (URL, headers=headers) R = Urllib.request.urlopen (req)

Sometimes when you crawl some pages, you get a 403 error (Forbidden), which means no access. This is because the site server authenticates the visitor's headers property, such as a request sent through the Urllib library, with the default "Python-urllib/x.y" as the User-agent, where X is the major version of Python, and Y is the secondary version number. So, we need to build the request object through urllib.request.Request () , pass in the dictionary form of the Headers property, simulate the browser.

The corresponding headers information can be viewed via the browser's developer debugging tool, the "Network" tab of the "Check" function, or the Fiddler, Wireshark using the Grab packet analysis software.

In addition to the above methods, you can also use urllib.request.build_opener () or req.add_header () to customize the request header, see the official example.

In python2.x, urllib modules and URLLIB2 modules are commonly used together because Urllib.urlencode () can encode URL parameters, while urllib2. Request () can build the request object, customize the header, and then use urllib2.urlopen () to send requests uniformly.

Passing a POST request

Import urllib.requestimport urllib.parseurl = 'https://passport.cnblogs.com/user/signin?' Post = {    ' username ': ' xxx ',    ' password ': ' xxxx '}postdata = Urllib.parse.urlencode (post).  Encode(' utf-8 ') req = urllib.request.Request (URL, postdata) R = Urllib.request.urlopen (req)

We use the Post form to deliver information when registering, logging in, and so on.

At this point, we need to analyze the page structure, build the form data post, encode with UrlEncode () , return the string, and then specify the ' utf-8 ' encoding format, because postdata can only be bytes or file object. Finally, pass the postdata through the request () object and send the requests using Urlopen () .

Download Remote Data to local

import urllib.requesturl = "https://www.python.org/static/img/python-logo.png" Urllib.request.urlretrieve (URL, "python-logo.png")

You can use Urlretrieve () to download remote data, such as pictures and videos, to local.

The first parameter is the URL to download, and the second parameter is the stored path after the download.

This sample downloads the Python website logo to the current directory, returning the tuple (filename, headers).

Set Proxy IP

import urllib.requesturl = "https://www.cnblogs.com/" proxy_ip = "180.106.16.132:8118" Proxy = Urllib.request.ProxyHandler ({' http ': proxy_ip}) opener = Urllib.request.build_opener (proxy , Urllib.request.HTTPHandler) Urllib.request.install_opener (opener) R = Urllib.request.urlopen (URL)

Sometimes crawls a webpage frequently, will be blocked by the website server IP. At this point, the proxy IP can be set by the above method.

First, through the Web proxy IP site to find a can use IP, build proxyhandler () object, the ' HTTP ' and proxy IP as a dictionary parameter, set proxy server information. Then build the opener object and pass in the proxy and HttpHandler classes. The opener is set to global by Installl_opener () , and when the request is sent with Urlopen () , the requested information is used to send the corresponding request.

Exception handling

Import urllib.requestimport urllib.errorurl = "http://www.balabalabala.org" Try:    r = Urllib.request.urlopen (URL) except Urllib.error.URLError as E:    if hasattr (E, 'Code'):         Print(E.code)    if hasattr (E, ' reason '):        print(E.reason)

You can use the Urlerror class to handle some URL-related exceptions. Import urllib.error, catch Urlerror exception, because only Httperror exception (Urlerror subclass), there will be exception status code E.code, So you need to determine if the exception has attribute code.

Use of cookies

Import urllib.requestimport http.cookiejarurl = "http://www.balabalabala.org/" Cjar = Http.cookiejar.CookieJar () opener = Urllib.request.build_opener (Urllib.request.HTTPCookieProcessor (Cjar)) Urllib.request.install_opener (opener) R = Urllib.request.urlopen (URL)

Cookies maintain state between sessions when Web pages are accessed through the stateless protocol HTTP. For example, some Web sites require login, and the first time you can log in by submitting a post form, you can use cookies to stay signed in when crawling other sites under the site, instead of having to log in by submitting a form every time.

First, build the Cookiejar () object Cjar, then use the httpcookieprocessor () processor, process the Cjar, and pass the Build_opener () Constructs the opener object, sets the global, and sends the request through Urlopen () .

Official documents

Python2.x:urllib, Urllib2

Python3.x:urllib.request

Urllib Library in PYTHON2/3

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More