[Python] web crawler (v): Details of urllib2 and grasping techniques _

[Python] web crawler (v): Details of urllib2 and grasping techniques __python

Last Update:2018-07-24 Source: Internet

Author: User

Tags epoch time soap

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

http://blog.csdn.net/pleasecallmewhy/article/details/8925978

In front of the Urllib2 simple introduction, the following collation of a part of the use of urllib2 details.

setting of 1.Proxy

URLLIB2 uses environment variable HTTP_PROXY to set HTTP proxy by default.

If you want to explicitly control the proxy in your program without being affected by the environment variables, you can use the proxy.

New test14 to implement a simple proxy demo: [python] view plain copy import urllib2 enable_proxy = True Proxy_handler = URLLIB2.P Roxyhandler ({"http": ' http://some-proxy.com:8080 '}) Null_proxy_handler = Urllib2. Proxyhandler ({}) if Enable_proxy:opener = Urllib2.build_opener (proxy_handler) Else:opener = Urllib2.buil D_opener (Null_proxy_handler) Urllib2.install_opener (opener)

One detail to note here is that using Urllib2.install_opener () sets the URLLIB2 global opener.

This will be convenient to use later, but can not do more detailed control, such as in the program to use two different Proxy settings.

It is a good practice to change the global setting without using Install_opener, instead of simply calling the opener open method instead of the global Urlopen method.
2.Timeout settings
In the old Python (Python2.6), the Urllib2 API did not expose Timeout settings, and to set the Timeout value, only the Socket's global Timeout value could be changed.
[python] view plain copy import URLLIB2 import socket Socket.setdefaulttimeout (10) # timeout after 10 seconds Urllib2.socke T.setdefaulttimeout (10) # Another way
After Python 2.6, timeouts can be set directly through the timeout parameters of Urllib2.urlopen ().
[python] view plain copy import urllib2 response = Urllib2.urlopen (' http://www.google.com ', timeout=10)

3. To add a specific header to the HTTP request to join the header, you need to use the Request object:
[python] view plain copy import urllib2 request = Urllib2. Request (' http://www.baidu.com/') request.add_header (' user-agent ', ' fake-client ') response = Urllib2.urlopen (Request ) Print Response.read ()
For some headers to pay special attention, the server will check for these headers
User-agent: Some servers or proxies will use this value to determine whether the browser is making a request
Content-type: When using the REST interface, the server checks the value to determine how the contents of the HTTP body are parsed. The common values are:
Application/xml: Used when XML RPC, such as a restful/soap call
Application/json: Used when JSON RPC calls
application/x-www-form-urlencoded: Use when browsers submit Web forms
When using a server-supplied RESTful or SOAP service, content-type Setup errors can cause the server to deny service

4.Redirect
URLLIB2 automatically redirect actions for HTTP 3XX return codes by default, without manual configuration. To detect if a redirect action has occurred, just check the Response url and the URL of the Request to be consistent.
[Python]View Plain copy import urllib2 my_url = ' http://www.google.cn ' response = Urllib2.urlopen (My_url) redirected = Respo Nse.geturl () = = My_url Print redirected My_url = ' Http://rrurl.cn/b1UZuP ' response = Urllib2.urlopen (My_url) re Directed = Response.geturl () = = My_url Print redirected
If you do not want to automatically redirect, you can customize the Httpredirecthandler class in addition to using a lower-level httplib library.
[Python]View Plain Copy Import Urllib2 class Redirecthandler (urllib2. Httpredirecthandler): Def http_error_301 (self, req, FP, code, MSG, headers): print "" def http_error_302 (self, req, FP, code, MSG, headers): print "303" Pass opener = Urllib2.buil D_opener (Redirecthandler) opener.open (' Http://rrurl.cn/b1UZuP ')

5.Cookie urllib2 processing of cookies is also automatic. If you need to get a value for a Cookie item, you can do this:
[python] view plain copy import urllib2 import cookielib cookie = cookielib. Cookiejar () opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (cookie)) response = Opener.open (' http://www.baidu.com ') for item in cookie:print ' Name = ' + Item.name print ' Value = ' +item.value

After the run will output access to Baidu's cookie value:

6. Using the put and Delete methods of HTTP urllib2 only supports HTTP GET and POST methods, and only the lower-level httplib libraries are used if you want to use the HTTP make and delete. Nonetheless, we are able to enable URLLIB2 to send a put or delete request in the following way:
[python] view plain copy import urllib2 request = Urllib2. Request (URI, data=data) Request.get_method = lambda: ' Put ' # or ' DELETE ' response = Urllib2.urlopen (Request)

7. Get HTTP return code for OK, you can get the HTTP return code as long as you use the GetCode () method of the response object returned by Urlopen. But for other return codes, Urlopen throws an exception. At this point, you need to check the code attribute of the exception object:
[python] view plain copy import urllib2 try:response = Urllib2.urlopen (' http://bbs.csdn.net/why ') exce PT Urllib2. Httperror, E:print E.code

8.Debug log using URLLIB2, you can use the following method to open the Debug log, so that the contents of the packet will be printed on the screen, convenient debugging, sometimes can save the work of grasping the bag
[python] view plain copy import urllib2 HttpHandler = Urllib2. HttpHandler (debuglevel=1) Httpshandler = Urllib2. Httpshandler (debuglevel=1) opener = Urllib2.build_opener (HttpHandler, Httpshandler) Urllib2.install_opener (opener) r Esponse = Urllib2.urlopen (' http://www.google.com ')

This allows you to see the contents of the packets being transmitted:

9. Processing of Forms

The

First uses the tool to intercept the content you want to fill out.
For example, I usually use the Firefox+httpfox plugin to see what packets I've sent.
Take VERYCD as an example, first find the POST request that you sent, and the Post form item.
You can see the VERYCD words need to fill in the Username,password,continueuri,fk,login_submit, where the FK is randomly generated (in fact, not too random, it looks like the epoch time through a simple code generation) , you need to get it from a Web page, which means you have to first access the Web page and use tools such as regular expressions to intercept the FK entries in the returned data. Continueuri as the name suggests you can write casually, login_submit is fixed, which can be seen from the source. And Username,password, that's obvious:

[python] view plain copy #-*-coding:utf-8-*-import urllib import Urllib2 Postdata=urllib.urlencode ({ ' username ': ' Wang Xiaoguang ', ' Password ': ' why888 ', ' Continueuri ': ' http://www.verycd.com/', ' FK ': ', ' Logi N_submit ': ' Login '}) req = Urllib2. Request (url = ' Http://secure.verycd.com/signin ', data = postdata) result = Urllib2.urlopen (req) Print Result.read ()

10. Disguised as browser access
Some Web sites resent the arrival of reptiles, and all refuse requests for reptiles.
At this time we need to disguise as a browser, this can be done by modifying the header in the HTTP package
[python] view plain Copy # ... headers = {' user-agent ': ' mozilla/5.0 (Windows; U Windows NT 6.1; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6 '} req = Urllib2. Request (url = ' http://secure.verycd.com/signin/*/http://www.verycd.com/', data = postdata, headers = He Aders) # ...
11. Dealing with "anti-hotlinking"
Some sites have the so-called anti-hotlinking settings, in fact, it is very simple,

is to check the header you send the request inside, the Referer site is not his own,

So we just need to change the headers referer to the site, take Cnbeta as an example:

#...
headers = {
    ' Referer ': ' Http://www.cnbeta.com/articles '
}
# ...

Headers is a DICT data structure, you can put in any desired header, to do some camouflage.

For example, some websites like to read the x-forwarded-for of the header to see the real IP, you can directly change the x-forwarde-for.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More