Python_ Crawler 1

Last Update:2015-07-21 Source: Internet

Author: User

Tags soap urlencode

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Basic use of the Urllib library

So then, the small partners together and I really move towards our reptile road.

1. Chop a webpage down in minutes

How to grill Web pages? In fact, it is based on the URL to get its web page information, although we see in the browser is a beautiful picture, but in fact, is interpreted by the browser is presented, in essence it is a piece of HTML code, plus JS, CSS, if the page compared to a person, then HTML is his skeleton, JS is his muscle, CSS is its clothes. So the most important part is in the HTML, let's write an example to pick a page down.

Import Urllib2

Response = Urllib2.urlopen ("http://www.baidu.com")

Print Response.read ()

Yes, you are right, the real program on two lines, save it to demo.py, enter the directory of the file, execute the following command to see the results of the operation, feel it.

Python demo.py

See, the source of this web page has been stripped down by us, is not very sour cool?

2. How to analyze the Web page

So let's analyze the two lines of code, the first line

Response = Urllib2.urlopen ("http://www.baidu.com")

First we call the URLLIB2 library inside the Urlopen method, passed in a URL, this URL is Baidu home, protocol is the HTTP protocol, of course, you can also change HTTP Ftp,file,https and so on, just represent a kind of access control protocol, Urlopen generally accepts three parameters, and its parameters are as follows:

Urlopen (URL, data, timeout)

The first parameter URL is the URL, the second parameter is the data to be transmitted when the URL is accessed, and the third timeout is the set time-out.

The 23rd parameter is non-transmitting, data default is null none,timeout default is Socket._global_default_timeout

The first parameter URL is to be transmitted, in this example we sent the URL of Baidu, after executing the Urlopen method, return a response object, the return information is saved in here.

Print Response.read ()

The response object has a read method that returns the content of the Web page to which it was acquired.

What if I do not add read direct printing? The answers are as follows:

<addinfourl at 139728495260376 whose FP = <socket._fileobject object at 0x7f1513fb3ad0>>

Directly print out the description of the object, so remember to add the Read method, otherwise it does not come out of the content can not blame me!

3. Structuring the request

In fact, the above Urlopen parameters can be passed to a request requests, it is actually a request class instance, constructs the need to pass in the Url,data and so on content. Like the two lines of code above, we can rewrite this.

Import Urllib2

Request = Urllib2. Request ("http://www.baidu.com")

Response = Urllib2.urlopen (Request)

Print Response.read ()

The result is exactly the same, except that there is a request object in the middle, it is recommended that you write this, because there is a lot of content to be added when building a request, and the server responds to the request by building a demand, which is logically clear.

4.POST and get data transfer

The above program demonstrates the most basic web crawl, however, most Web sites are now dynamic pages that require you to dynamically pass parameters to it, which responds accordingly. So, when we visit, we need to pass the data to it. What is the most common situation? By the way, it's time to sign up.

Send the data user name and password to a URL, and then you get the response after the server processing, what should I do? Let me make it up to the little friends!

Data transmission is divided into post and get two kinds of ways, what is the difference between the two ways?

The most important difference is that the Get method is accessed directly as a link, which contains all the parameters and, of course, is an unsafe option if the password is included, but you can visually see what you have submitted. Post does not display all the parameters on the URL, but it is not very convenient if you want to see what is being submitted directly, and you can choose as appropriate.

Post mode:

What do we mean by the data parameter? By the way, it's used here, and the data we're transmitting is this parameter, which shows the Post method.

Import Urllib

Import Urllib2

Values = {"username": "[email protected]", "Password": "XXXX"}

data = Urllib.urlencode (values)

url = "Https://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn"

Request = Urllib2. Request (Url,data)

Response = Urllib2.urlopen (Request)

Print Response.read ()

We introduced the Urllib library, and now we simulate the landing csdn, of course, the above code may not go in, because there are also some set header header work, or some parameters are not set all, has not been mentioned in this is not written up, here is just the principle of login. We need to define a dictionary, named values, parameters I set the username and password, the following use Urllib's UrlEncode method to encode the dictionary, named data, build request when the two parameters, URL and data, shipped Line program, you can achieve landing, the return is the landing page content rendered. Of course you can build a server to test it.

Note that there is another way to define the dictionary above, and the following notation is equivalent

Import Urllib

Import Urllib2

Values = {}

values[' username ' = "[email protected]"

values[' password ' = "XXXX"

data = Urllib.urlencode (values)

url = "Http://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn"

Request = Urllib2. Request (Url,data)

Response = Urllib2.urlopen (Request)

Print Response.read ()

The above method can realize the post mode transmission

Get mode:

As for Get mode we can directly write the parameters to the URL, directly build a URL with parameters to come out.

Import Urllib

Import Urllib2

values={}

values[' username ' = "[email protected]"

values[' password ']= "XXXX"

data = Urllib.urlencode (values)

url = "Http://passport.csdn.net/account/login"

Geturl = URL + "?" +data

Request = Urllib2. Request (Geturl)

Response = Urllib2.urlopen (Request)

Print Response.read ()

You can print Geturl, printed out the URL, and found that the original URL is actually added? Then add the encoded parameters

Http://passport.csdn.net/account/login?username=1016903103%40qq.com&password=XXXX

And we usually get access to the same way, so that the data get the way to transfer.

This section explains some basic use, can crawl to some basic web information, small partners refueling!

Python crawler Primer (4): Advanced usage of the Urllib library

1. Set Headers

Some sites do not agree to the program directly in the way of access, if the identification of the problem, then the site will not respond, so in order to fully simulate the work of the browser, we need to set some headers properties.

First of all, open our browser, debugging browser F12, I use Chrome, open the network monitoring, as shown below, for example, after the login, we will find that the interface has changed after landing, a new interface, essentially this page contains a lot of content, These content is not a one-time loading completed, in essence, the implementation of a good number of requests, is generally the first request for HTML files, and then load js,css and so on, after many requests, the skeleton and muscle of the Web page, the effect of the entire Web page out.

Split these requests, we only see a first request, you can see, there is a request URL, and headers, the following is response, the picture is not full, the small partners can experiment with their own hands. So this header contains a lot of information, file encoding, compression, request agent, and so on.

Where the agent is the identity of the request, if there is no write request identity, then the server does not necessarily respond, so you can set up the agent in headers, such as the following example, this example just explains how to set the headers, the small partners to look at the format is good.

Import Urllib

Import Urllib2

url = ' Http://www.server.com/login '

User_agent = ' mozilla/4.0 (compatible; MSIE 5.5; Windows NT) '

Values = {' username ': ' CQC ', ' Password ': ' XXXX '}

headers = {' User-agent ': user_agent}

data = Urllib.urlencode (values)

Request = Urllib2. Request (URL, data, headers)

Response = Urllib2.urlopen (Request)

page = Response.read ()

In this way, we set up a headers, passed in when the request was built, and, when requested, joined the headers transfer, and the server would be responsive if it identified a request from the browser.

In addition, we have to deal with the "anti-hotlinking" way to deal with the anti-theft chain, the server will identify headers in the referer is not its own, if not, some servers will not respond, so we can also add referer in headers

For example, we can build the following headers

headers = {' user-agent ': ' Mozilla/4.0 (compatible; MSIE 5.5; Windows NT) ',

' Referer ': ' Http://www.zhihu.com/articles '}

With the above method, the headers is passed into the request parameter during the transfer, so that the anti-theft chain can be dealt with.

In addition to some of the properties of headers, the following need special attention:

User-agent: Some servers or proxies will use this value to determine whether a request is made by a browser

Content-type: When using the REST interface, the server checks the value to determine how the content in the HTTP Body should be parsed.

Application/xml: Used in XML RPC, such as Restful/soap call

Application/json: Used in JSON RPC calls

Application/x-www-form-urlencoded: Used when a Web form is submitted by the browser

Content-type setting errors cause server denial of service when using RESTful or SOAP services provided by the server

Others are necessary to review the browser's headers content and write the same data at build time.

2. Proxy (agent) settings

URLLIB2 uses the environment variable HTTP_PROXY to set the HTTP proxy by default. If a website detects the number of times an IP is accessed for a certain period of time, it will prohibit you from accessing it if there are too many accesses. So you can set up some proxy server to help you work, every time for a proxy, website June don't know who is the mischief, this acid cool!

The following code illustrates the use of proxy settings

Import Urllib2

Enable_proxy = True

Proxy_handler = Urllib2. Proxyhandler ({"http": ' http://some-proxy.com:8080 '})

Null_proxy_handler = Urllib2. Proxyhandler ({})

If Enable_proxy:

Opener = Urllib2.build_opener (Proxy_handler)

Else

Opener = Urllib2.build_opener (Null_proxy_handler)

Urllib2.install_opener (opener)

3.Timeout settings

The previous section has already said Urlopen method, the third parameter is the timeout setting, you can set how long to wait for the timeout, in order to solve some of the site is too slow to respond to the impact.

For example, the following code, if the second parameter data is empty, then specifically specify how much timeout is, the formal parameter, if the data has been passed, it is not necessary to declare.

Import Urllib2

Response = Urllib2.urlopen (' http://www.baidu.com ', timeout=10)

Import Urllib2

Response = Urllib2.urlopen (' http://www.baidu.com ', data, 10)

4. PUT and DELETE methods using HTTP

HTTP protocol has six kinds of request methods, get,head,put,delete,post,options, we sometimes need to use the Put method or Delete method request.

PUT: This method is relatively rare. This is not supported by HTML forms. In essence, put and post are very similar, are sending data to the server, but there is an important difference between them, put usually specifies the location of the resources, and post is not, post data storage location by the server itself.

Delete: Deletes a resource. This is mostly rare, but there are some places like Amazon's S3 cloud service that use this method to delete resources.

If you want to use HTTP PUT and DELETE, you can only use the lower-level httplib library. Even so, we can make it possible for URLLIB2 to send a PUT or delete request in the following way, but the number of times it is used is really small, as mentioned here.

Import Urllib2

Request = Urllib2. Request (URI, Data=data)

Request.get_method = lambda: ' PUT ' # or ' DELETE '

Response = Urllib2.urlopen (Request)

5. Using Debuglog

You can use the following method to open the debug Log, so that the contents of the transceiver will be printed on the screen, easy to debug, this is not very common, just mention

Import Urllib2

HttpHandler = Urllib2. HttpHandler (debuglevel=1)

Httpshandler = Urllib2. Httpshandler (debuglevel=1)

Opener = Urllib2.build_opener (HttpHandler, Httpshandler)

Urllib2.install_opener (opener)

Response = Urllib2.urlopen (' http://www.baidu.com ')

Python_ Crawler 1

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More