The basic use of the URLLIB2 Library of the 5.Python crawler entry three

Source: Internet
Author: User

1. Crawl a webpage in minutes

How to crawl the page? In fact, it is based on the URL to obtain its web page information, although we see in the browser a beautiful picture, but in fact, it is interpreted by the browser to show, in essence it is a piece of HTML code, plus JS, CSS, if the page compared to a person, then HTML is his skeleton, JS is his muscle, CSS is his clothes. So the most important part is in the HTML, let's write an example to pick a page down.

Import urllib2response=urllib2.urlopen ('https://www.baidu.com/')  Print Response.read ()

Save it as case1.py, enter the directory of the file, execute the command

Here we can see that the source code of the webpage has been stripped down by us

2. How to analyze the Web page

Let's analyze the code above, the first line

Response = Urllib2.urlopen ('https://www.baidu.com/')

First we call the URLLIB2 library inside the Urlopen method, passed in a URL, this URL is Baidu home, protocol is HTTPS protocol, of course, you can also change HTTPS to FTP, file, HTTP, etc., just represent a kind of access control permissions, Urlopen generally accepts three parameters, and its parameters are as follows:

Urlopen (URL, data, timeout)

The first parameter URL is the URL, the second parameter is the data to be transmitted when the URL is accessed, and the third timeout is the set time-out.

The 23rd parameter can not be passed, and data defaults to NULL none,timeout default is Socket._global_default_timeout.

The first parameter URL is to be passed, in this case we sent the URL of Baidu, after executing the Urlopen method, return a response object, the return information is saved in here.

print response.read ()

The response object has a read method that returns the content of the Web page that was obtained and then prints it out with print.

3. Structuring the request

In fact, the above Urlopen parameters can be passed to a request requests, it is actually a request class instance, constructs the need to pass in the Url,data and so on content. Like the two lines of code above, we can rewrite this,

Import urllib2request=urllib2. Request ('https://www.baidu.com/') Response=Urllib2.urlopen (Request)  Print response.read ()

The result is exactly the same, except that there is a request object in the middle, it is recommended that you write this, because there is a lot of content to be added when building a request, and the server responds to the request by building a demand, which is logically clear.

4.POST and get data transfer

The above program demonstrates the most basic web crawl, however, most Web sites are now dynamic pages that require you to dynamically pass parameters to it, which responds accordingly. So, when we visit, we need to pass the data to it. What is the most common situation? By the way, it's time to sign up.

Send the data user name and password to a URL, and then you get the response after the server processing, what should I do? Let me make it up to the little friends!

Data transmission is divided into post and get two kinds of ways, what is the difference between the two ways?

The most important difference is that the get method is accessed directly as a link, which contains all the parameters and, of course, is an unsafe option if the password is included, but you can visually see what you have submitted. post will not show all the parameters on the URL , but if you want to directly see what is submitted is not very convenient, you can choose as appropriate.

Post mode:

What do we mean by the data parameter? It is used here, the data we transmit is this parameter, the following shows the Post method.

ImportUrllibImporturllib2values= {'username':'[email protected]','Password':'XXXX'}data=urllib.urlencode (values) URL='https://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn'Request=Urllib2. Request (url,data) Response=Urllib2.urlopen (Request)PrintResponse.read ()

We introduced the Urllib library, and now we simulate the landing csdn, of course, the above code may not go in, because there are also some set header header work, or some parameters are not set all, has not been mentioned in this is not written up, here is just the principle of login. We need to define a dictionary, named values, parameters I set the username and password, the following use Urllib's UrlEncode method to encode the dictionary, named data, build request when the two parameters, url and data, Run the program, you can achieve landing, return is the landing page content rendered. Of course you can build a server to test it.

Note that there is another way to define the dictionary above, and the following notation is equivalent

ImportUrllibImportURLLIB2 Values={}values['username'] ="[email protected]"values['Password'] ="XXXX"Data=urllib.urlencode (values) URL="http://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn"Request=Urllib2. Request (url,data) Response=Urllib2.urlopen (Request)PrintResponse.read ()

The above method can realize the post mode transmission

Get mode:

As for the Get mode we can directly write the parameters to the URL, directly build a URL with parameters to come out.

ImportUrllibImporturllib2values={}values['username']='[email protected]'values['Password']='XXXX'Data=urllib.urlencode (values) URL='Https://passport.csdn.net/account/login'Geturl= url+'?'+DataPrintgeturlrequest=Urllib2. Request (geturl) Response=Urllib2.urlopen (Request)PrintResponse.read ()

You can print Geturl, printed out the URL, and found that the original URL is actually added? Then add the encoded parameters

Https://passport.csdn.net/account/login?username=1016903103%40qq.com&password=XXXX

Same as our usual get access method, so that the data can be sent

The basic use of the URLLIB2 Library of the 5.Python crawler entry three

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.