Basics of using the Urllib.request library

Last Update:2017-07-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The so-called Web crawl, is the URL address specified in the network resources from the network stream read out, save to Local. There are many libraries in python that can be used to crawl Web pages, so let's learn urllib.request first. (for URLLIB2 in python2.x)

Urlopen

Let's start by reading the following code:

#!/usr/bin/python3#-*-conding:utf-8-*-__author__ = ' Mayi ' # Imports Urllib.request Library Import urllib.request# sends a request to the specified URL, The class file object that returns the server response response = Urllib.request.urlopen ("http://www.baidu.com/") # class file object supports the action method of the file object, such as the Read () method, which reads the entire contents of the file, Returns the string html = Response.read () # Prints the string print (HTML)

In fact, if we open the Baidu homepage in the browser, right click "View Source", you will find that we execute the above program output is the same as the result. In other words, the above line of code has helped us to the Baidu homepage of the entire code to crawl down.

Reuqest

In the example above, the parameter of Urlopen () is a URL address.

However, if you need to perform more complex operations, such as adding an HTTP header, you must create a request instance as a parameter to Urlopen (), and the URL address you need to access as a parameter to the request instance.

#!/usr/bin/python3#-*-conding:utf-8-*-__author__ = ' Mayi ' # importing Urllib.request library import urllib.request# URL as request () method, construct and return a request object, request = Urllib.request.Request ("http://www.baidu.com/") # to the server to send this response = Urllib.request.urlopen (Request) HTML = Response.read () print (HTML)

The result is exactly the same:

The new request instance can be set up with two additional parameters in addition to the URL parameter:

1.data (default is empty): is the data that accompanies the URL submission (such as post data), and the HTTP request will be changed from "GET" to "POST" mode.

2.headers (default is empty): is a dictionary that contains the key-value pairs of the HTTP headers that need to be sent.

User-agent

If we want our bot to be more like a real user, then our first step is to pretend to be a recognized browser. Different browsers will have different user-agent headers when sending requests. Urllib.request default User-agent header is: python-urllib/x.y (x and Y are Python major and minor version numbers, such as python-urllib/3.5)

#!/usr/bin/python3#-*-conding:utf-8-*-__author__ = ' Mayi ' # importing Urllib.request library import urllib.request# Chrome User-agent, included in the header header = {' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/39.0.2171.71 safari/537.36 '}# URL, together with headers, constructs the requests request, which will be shipped with Chrome browser user-agentrequest = Urllib.request.Request ("http://www.baidu.com/", headers = header) # Send this request to the server response = Urllib.request.urlopen (Request) HTML = Response.read () print (HTML)

Add more header information

A complete HTTP request message is constructed by adding a specific header to the HTTP requests.

You can add/modify a specific header by calling Request.add_header () or you can view an existing header by calling Request.get_header ().

Add a specific header

#!/usr/bin/python3#-*-conding:utf-8-*-__author__ = ' Mayi ' # Import the Urllib.request library import urllib.request# Chrome user-agent, included in the header header = {' user-agent ': ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/39.0.2171.71 safari/537.36 '}# URL, together with headers, constructs the requests request, which will be shipped with Chrome browser user-agentrequest = Urllib.request.Request ("http://www.baidu.com/", headers = header) # can also be called by the Request.add_ Header () Add/Modify a specific Headerrequest.add_header ("Connection", "Keep-alive") # Send this request to the server response = Urllib.request.urlopen (Request) HTML = Response.read () print (HTML)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Basics of using the Urllib.request library

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support