Basic use of the Urllib library for the introduction of Sesame Http:python crawler

Source: Internet
Author: User
Tags set time urlencode

1. Chop a webpage down in minutes

How to grill Web pages? In fact, it is based on the URL to get its web page information, although we see in the browser is a beautiful picture, but in fact, is interpreted by the browser is presented, in essence it is a piece of HTML code, plus JS, CSS, if the page compared to a person, then HTML is his skeleton, JS is his muscle, CSS is its clothes. So the most important part is in the HTML, let's write an example to pick a page down.

= Urllib2.urlopen ("http://www.baidu.com") print response.read ()

Yes, you are right, the real program on two lines, save it to demo.py, enter the directory of the file, execute the following command to see the results of the operation, feel it.

Python demo.py

See, the source of this web page has been stripped down by us, is not very sour cool?

2. How to analyze the Web page

So let's analyze the two lines of code, the first line

Response = Urllib2.urlopen ("http://www.baidu.com")

First we call the URLLIB2 library inside the Urlopen method, passed in a URL, this URL is Baidu home, protocol is the HTTP protocol, of course, you can also change HTTP Ftp,file,https and so on, just represent a kind of access control protocol, Urlopen generally accepts three parameters, and its parameters are as follows:

Urlopen (URL, data, timeout)

The first parameter URL is the URL, the second parameter is the data to be transmitted when the URL is accessed, and the third timeout is the set time-out.

The 23rd parameter is non-transmitting, data default is null none,timeout default is Socket._global_default_timeout

The first parameter URL is to be transmitted, in this example we sent the URL of Baidu, after executing the Urlopen method, return a response object, the return information is saved in here.

Print Response.read ()

The response object has a read method that returns the content of the Web page to which it was acquired.

What if I do not add read direct printing? The answers are as follows:

139728495260376 Object 0x7f1513fb3ad0>>

Directly print out the description of the object, so remember to add the Read method, otherwise it does not come out of the content can not blame me!

3. Structuring the request

In fact, the above Urlopen parameters can be passed to a request requests, it is actually a request class instance, constructs the need to pass in the Url,data and so on content. Like the two lines of code above, we can rewrite this.

= Urllib2. Request ("http://www.baidu.com"= urllib2.urlopen (request) Print Response.read ()

The result is exactly the same, except that there is a request object in the middle, it is recommended that you write this, because there is a lot of content to be added when building a request, and the server responds to the request by building a demand, which is logically clear.

4.POST and get data transfer

The above program demonstrates the most basic web crawl, however, most Web sites are now dynamic pages that require you to dynamically pass parameters to it, which responds accordingly. So, when we visit, we need to pass the data to it. What is the most common situation? By the way, it's time to sign up.

Send the data user name and password to a URL, and then you get the response after the server processing, what should I do? Let me make it up to the little friends!

Data transmission is divided into post and get two kinds of ways, what is the difference between the two ways?

The most important difference is that the Get method is accessed directly as a link, which contains all the parameters and, of course, is an unsafe option if the password is included, but you can visually see what you have submitted. Post does not display all the parameters on the URL, but it is not very convenient if you want to see what is being submitted directly, and you can choose as appropriate.

Post mode:

What do we mean by the data parameter? By the way, it's used here, and the data we're transmitting is this parameter, which shows the Post method.

Import Urllibimport urllib2 values= {"username":"[email protected]","Password":"XXXX"}data=urllib.urlencode (values) URL="https://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn"Request=Urllib2. Request (url,data) Response=Urllib2.urlopen (Request) print response.read ()

We introduced the Urllib library, now we simulate the landing csdn, of course, the above code may not go in, because CSDN also has a serial number of the field, not set the whole, more complex in here do not write up, here is just a description of the principle of login. The general login site is usually this kind of notation.

We need to define a dictionary, named values, parameters I set the username and password, the following use Urllib's UrlEncode method to encode the dictionary, named data, build request when the two parameters, url and data, Run the program and return the content of the page rendered after post.

Note that there is another way to define the dictionary above, and the following notation is equivalent

Import Urllibimport urllib2 values={}values['username'] ="[email protected]"values['Password'] ="XXXX"Data=urllib.urlencode (values) URL="http://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn"Request=Urllib2. Request (url,data) Response=Urllib2.urlopen (Request) print response.read ()

The above method can realize the post mode transmission

Get mode:

As for Get mode we can directly write the parameters to the URL, directly build a URL with parameters to come out.

Import Urllibimport urllib2 values={}values['username'] ="[email protected]"values['Password']="XXXX"Data=urllib.urlencode (values) URL="Http://passport.csdn.net/account/login"Geturl= URL +"?"+datarequest=Urllib2. Request (geturl) Response=Urllib2.urlopen (Request) print response.read ()

You can print Geturl, printed out the URL, and found that the original URL is actually added? Then add the encoded parameters

http://passport.csdn.net/account/login?username=1016903103%40qq.com&password=xxxx

And we usually get access to the same way, so that the data get the way to transfer.

Basic use of the Urllib library for the introduction of Sesame Http:python crawler

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.