Basic tutorial on using the Python Urllib Library

Last Update:2018-07-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article mainly introduces the basic tutorial for using the Python Urllib library. It is a required knowledge for programming crawlers in Python. For more information, see 1. Pull a webpage in minutes

How to obtain webpages? In fact, the webpage information is obtained based on the URL. Although we see a beautiful picture in the browser, it is actually presented by the browser, in essence, it is an HTML code, adding JS and CSS. If you compare a webpage to a person, HTML is his skeleton, JS is his muscle, and CSS is his clothes. So the most important part is that it exists in HTML. Below we will write an example to download a webpage.

import urllib2 response = urllib2.urlopen("http://www.baidu.com")print response.read()

Yes, you are not mistaken. The real program has two lines. Save it as demo. py and enter the directory of the file. Run the following command to view the running result and feel it.

python demo.py

Look, the source code of this webpage has been removed by us, isn't it sour?
2. How to analyze the webpage

So let's analyze the two lines of code, the first line

response = urllib2.urlopen("http://www.baidu.com")

First, we call the urlopen method in the urllib2 library to input a URL. The URL is the Baidu homepage and the Protocol is the HTTP protocol. Of course, you can also change HTTP to FTP or FILE, HTTPS only represents an access control protocol. urlopen generally accepts three parameters. The parameters are as follows:

urlopen(url, data, timeout)

The first parameter url is the URL, the second parameter data is the data to be transmitted when accessing the URL, and the third timeout is to set the timeout time.

The second three parameters can be left blank. data is empty by default, and timeout is socket. _ GLOBAL_DEFAULT_TIMEOUT by default.

The URL of the first parameter must be transmitted. In this example, we transmit the URL of Baidu. After the urlopen method is executed, a response object is returned, and the returned information is saved here.

print response.read()

The response object has a read method that returns the obtained webpage content.

What will be printed directly without read? The answer is as follows:

The description of this object is printed directly, so remember to add the read method. Otherwise, it will not blame me if it does not come out!
3. Construct Requset

In fact, the above urlopen parameter can be used to pass in a request Request, which is actually an instance of the request class. During construction, the Url, Data, and other content must be passed in. For example, we can rewrite the above two lines of code.

import urllib2 request = urllib2.Request("http://www.baidu.com")response = urllib2.urlopen(request)print response.read()

The running results are exactly the same, but there is one more request object in the middle. We recommend that you write this statement, because a lot of content needs to be added when building a request. By building a request, the server receives a response to the request, which is logically clear and clear.
4. POST and GET Data Transmission

The above program demonstrates the most basic web page capture, but now most websites are dynamic web pages, you need to dynamically PASS Parameters to it, it responds accordingly. Therefore, we need to pass data to it during access. What is the most common situation? By the way, it is the time to log on to the registration.

Send the data username and password to a URL and then you get the response from the server after processing. What should I do? Let me reveal it to my friends!

Data transmission can be divided into two methods: POST and GET. What is the difference between the two methods?

The most important difference is that the GET method can be accessed directly in the form of a link. The Link contains all the parameters. Of course, if the password is included, it is an insecure choice, however, you can intuitively see what you have submitted. POST will not display all the parameters on the website. However, it is not convenient if you want to directly view what is submitted. You can choose as appropriate.
POST method:

What did we say about the data parameter above? By the way, it is used here, and the data we transmit is the parameter data. The following shows the POST method.

import urllibimport urllib2 values = {"username":"1016903103@qq.com","password":"XXXX"}data = urllib.urlencode(values) url = "https://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn"request = urllib2.Request(url,data)response = urllib2.urlopen(request)print response.read()

We introduced the urllib library. Now we simulate CSDN login. Of course, the above Code may not be logged in, because we need to set the header, or some parameters are not fully set, I will not mention it here, but here I will explain the principle of logon. We need to define a dictionary named "values". I have set username and password for the parameter. The following uses urllib's urlencode method to encode the dictionary and name it "data". Two parameters are input during request construction, url and data, run the program to implement login. The returned content is the page content displayed after login. Of course, you can build a server to test it.

Note that there is another way to define the dictionary above. The following statements are equivalent.

import urllibimport urllib2 values = {}values['username'] = "1016903103@qq.com"values['password'] = "XXXX"data = urllib.urlencode(values) url = "http://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn"request = urllib2.Request(url,data)response = urllib2.urlopen(request)print response.read()

The above method implements POST transmission.
GET method:

As for the GET method, we can directly write the parameters to the URL and directly construct a URL with parameters.

import urllibimport urllib2 values={}values['username'] = "1016903103@qq.com"values['password']="XXXX"data = urllib.urlencode(values) url = "http://passport.csdn.net/account/login"geturl = url + "?"+datarequest = urllib2.Request(geturl)response = urllib2.urlopen(request)print response.read()

You can print the geturl and print the output url. What is the original url? Encoded Parameters

http://passport.csdn.net/account/login?username=1016903103%40qq.com&password=XXXX

It is the same as the common GET access method, so that data can be transmitted in GET mode.

This section describes some basic usage and captures some basic webpage information!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More