The basic use of the URLLIB2 Library of the 5.Python crawler entry three

Last Update:2016-05-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Crawl a webpage in minutes

How to crawl the page? In fact, it is based on the URL to obtain its web page information, although we see in the browser a beautiful picture, but in fact, it is interpreted by the browser to show, in essence it is a piece of HTML code, plus JS, CSS, if the page compared to a person, then HTML is his skeleton, JS is his muscle, CSS is his clothes. So the most important part is in the HTML, let's write an example to pick a page down.

Import urllib2response=urllib2.urlopen ('https://www.baidu.com/')  Print Response.read ()

Save it as case1.py, enter the directory of the file, execute the command

Here we can see that the source code of the webpage has been stripped down by us

2. How to analyze the Web page

Let's analyze the code above, the first line

Response = Urllib2.urlopen ('https://www.baidu.com/')

First we call the URLLIB2 library inside the Urlopen method, passed in a URL, this URL is Baidu home, protocol is HTTPS protocol, of course, you can also change HTTPS to FTP, file, HTTP, etc., just represent a kind of access control permissions, Urlopen generally accepts three parameters, and its parameters are as follows:

Urlopen (URL, data, timeout)

The first parameter URL is the URL, the second parameter is the data to be transmitted when the URL is accessed, and the third timeout is the set time-out.

The 23rd parameter can not be passed, and data defaults to NULL none,timeout default is Socket._global_default_timeout.

The first parameter URL is to be passed, in this case we sent the URL of Baidu, after executing the Urlopen method, return a response object, the return information is saved in here.

print response.read ()

The response object has a read method that returns the content of the Web page that was obtained and then prints it out with print.

3. Structuring the request

In fact, the above Urlopen parameters can be passed to a request requests, it is actually a request class instance, constructs the need to pass in the Url,data and so on content. Like the two lines of code above, we can rewrite this,

Import urllib2request=urllib2. Request ('https://www.baidu.com/') Response=Urllib2.urlopen (Request)  Print response.read ()

The result is exactly the same, except that there is a request object in the middle, it is recommended that you write this, because there is a lot of content to be added when building a request, and the server responds to the request by building a demand, which is logically clear.

4.POST and get data transfer

The above program demonstrates the most basic web crawl, however, most Web sites are now dynamic pages that require you to dynamically pass parameters to it, which responds accordingly. So, when we visit, we need to pass the data to it. What is the most common situation? By the way, it's time to sign up.

Send the data user name and password to a URL, and then you get the response after the server processing, what should I do? Let me make it up to the little friends!

Data transmission is divided into post and get two kinds of ways, what is the difference between the two ways?

The most important difference is that the get method is accessed directly as a link, which contains all the parameters and, of course, is an unsafe option if the password is included, but you can visually see what you have submitted. post will not show all the parameters on the URL , but if you want to directly see what is submitted is not very convenient, you can choose as appropriate.

Post mode:

What do we mean by the data parameter? It is used here, the data we transmit is this parameter, the following shows the Post method.

ImportUrllibImporturllib2values= {'username':'[email protected]','Password':'XXXX'}data=urllib.urlencode (values) URL='https://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn'Request=Urllib2. Request (url,data) Response=Urllib2.urlopen (Request)PrintResponse.read ()

We introduced the Urllib library, and now we simulate the landing csdn, of course, the above code may not go in, because there are also some set header header work, or some parameters are not set all, has not been mentioned in this is not written up, here is just the principle of login. We need to define a dictionary, named values, parameters I set the username and password, the following use Urllib's UrlEncode method to encode the dictionary, named data, build request when the two parameters, url and data, Run the program, you can achieve landing, return is the landing page content rendered. Of course you can build a server to test it.

Note that there is another way to define the dictionary above, and the following notation is equivalent

ImportUrllibImportURLLIB2 Values={}values['username'] ="[email protected]"values['Password'] ="XXXX"Data=urllib.urlencode (values) URL="http://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn"Request=Urllib2. Request (url,data) Response=Urllib2.urlopen (Request)PrintResponse.read ()

The above method can realize the post mode transmission

Get mode:

As for the Get mode we can directly write the parameters to the URL, directly build a URL with parameters to come out.

ImportUrllibImporturllib2values={}values['username']='[email protected]'values['Password']='XXXX'Data=urllib.urlencode (values) URL='Https://passport.csdn.net/account/login'Geturl= url+'?'+DataPrintgeturlrequest=Urllib2. Request (geturl) Response=Urllib2.urlopen (Request)PrintResponse.read ()

You can print Geturl, printed out the URL, and found that the original URL is actually added? Then add the encoded parameters

Https://passport.csdn.net/account/login?username=1016903103%40qq.com&password=XXXX

Same as our usual get access method, so that the data can be sent

The basic use of the URLLIB2 Library of the 5.Python crawler entry three

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More