[Python] web crawler (v): Urllib2 's use of details and tips for grasping the station

Source: Internet
Author: User
Tags epoch time
In front of Urllib2 's simple introduction, the following is a partial urllib2 of the use of the details.


Settings for 1.Proxy

URLLIB2 uses the environment variable HTTP_PROXY to set the HTTP proxy by default.

You can use proxies if you want to explicitly control the proxy in your program and not be affected by environment variables.

Create a new test14 to implement a simple proxy demo:

Import urllib2  enable_proxy = True  Proxy_handler = urllib2. Proxyhandler ({"http": ' http://some-proxy.com:8080 '})  Null_proxy_handler = Urllib2. Proxyhandler ({})  if enable_proxy:      opener = Urllib2.build_opener (Proxy_handler)  else:      opener = Urllib2.build_opener (Null_proxy_handler)  Urllib2.install_opener (opener)

One detail to note here is that using Urllib2.install_opener () sets the URLLIB2 global opener.

The use of the latter will be convenient, but can not be more detailed control, such as to use two different Proxy settings in the program.

It is good practice not to use Install_opener to change the global settings, but simply call opener's Open method instead of the global Urlopen method.


2.Timeout settings
In older Python (before Python2.6), the Urllib2 API did not expose the timeout setting, and to set the timeout value, only the global timeout value of the Socket could be changed.

Import urllib2  Import socket  Socket.setdefaulttimeout (10) # 10 seconds after timeout  urllib2.socket.setdefaulttimeout ( 10) # Another way

After Python 2.6, timeouts can be set directly through the timeout parameter of Urllib2.urlopen ().

Import urllib2  response = Urllib2.urlopen (' http://www.google.com ', timeout=10)

3. Add a specific Header to the HTTP Request

To join the header, you need to use the Request object:

Import urllib2  request = Urllib2. Request (' http://www.baidu.com/')  request.add_header (' user-agent ', ' fake-client ')  response = Urllib2.urlopen (Request)  print response.read ()

Special attention should be paid to some of the headers, which are checked by the server.
User-agent: Some servers or proxies will use this value to determine whether a request is made by a browser
Content-type: When using the REST interface, the server checks the value to determine how the content in the HTTP Body should be parsed. The common values are:
Application/xml: Used in XML RPC, such as Restful/soap call
Application/json: Used in JSON RPC calls
Application/x-www-form-urlencoded: Used when a Web form is submitted by the browser
Content-type setting errors cause server denial of service when using RESTful or SOAP services provided by the server



4.Redirect
URLLIB2 automatically redirect actions for HTTP 3XX return codes by default, without manual configuration. To detect if a redirect action has occurred, just check that the URL of the Response and the URL of the Request are consistent.

Import urllib2  my_url = ' http://www.google.cn '  response = Urllib2.urlopen (my_url)  redirected = Response.geturl () = = My_url  Print redirected    my_url = ' Http://rrurl.cn/b1UZuP '  response = Urllib2.urlopen (My_url)  redirected = Response.geturl () = = My_url  Print redirected

If you do not want to automatically redirect, you can customize the Httpredirecthandler class in addition to the lower Httplib library.

Import Urllib2  class Redirecthandler (urllib2. Httpredirecthandler):      def http_error_301 (self, req, FP, code, MSG, headers):          print "301"          Pass      def http _error_302 (self, req, FP, code, MSG, headers):          print "303"          pass    opener = Urllib2.build_opener ( Redirecthandler)  opener.open (' Http://rrurl.cn/b1UZuP ')

5.Cookie

Urllib2 the processing of cookies is also automatic. If you need to get the value of a Cookie entry, you can do this:

Import urllib2  Import cookielib  cookie = cookielib. Cookiejar ()  opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (cookie))  response = Opener.open (' http://www.baidu.com ') for  item in cookie:      print ' Name = ' +item.name      print ' Value = ' +item.value

After running it will output the value of the cookie that accesses Baidu:

6. PUT and DELETE methods using HTTP

URLLIB2 only supports the GET and POST methods of HTTP, and if you want to use HTTP PUT and DELETE, you can only use the lower-level httplib library. Nonetheless, we can enable URLLIB2 to send a PUT or delete request in the following way:

Import urllib2  request = Urllib2. Request (URI, data=data)  Request.get_method = lambda: ' PUT ' # or ' DELETE '  response = Urllib2.urlopen (Request)

7. Get the return code for HTTP

For a $ OK, the return code for HTTP can be obtained as long as the GetCode () method of the response object returned by Urlopen is used. However, for other return codes, Urlopen throws an exception. At this point, you should check the Code property of the Exception object:

Import urllib2  Try:      response = Urllib2.urlopen (' http://bbs.csdn.net/why ')  except URLLIB2. Httperror, E:      print E.code

8.Debug Log

When using URLLIB2, the debug Log can be opened by the following method, so that the contents of the transceiver will be printed on the screen, easy to debug, sometimes save the job of grasping the package

Import urllib2  HttpHandler = urllib2. HttpHandler (debuglevel=1)  Httpshandler = Urllib2. Httpshandler (debuglevel=1)  opener = Urllib2.build_opener (HttpHandler, Httpshandler)  Urllib2.install_ Opener (opener)  response = Urllib2.urlopen (' http://www.google.com ')

This will allow you to see the contents of the transmitted packet:


9. Processing of forms

Login necessary forms, how to fill out the form?

First, use the tool to intercept the content you want to fill in.
For example, I usually use the Firefox+httpfox plugin to see what I sent the package.
Take VERYCD as an example, first find your own post request, and the Post form item.
Can see VERYCD words need to fill username,password,continueuri,fk,login_submit these items, where FK is randomly generated (actually not too random, it looks like the epoch time through a simple code generation), Need to get from the Web page, that is, you have to first visit a Web page, using regular expressions and other tools to intercept the FK items in the returned data. Continueuri as the name implies can be casually written, login_submit is fixed, this from the source can be seen. And Username,password, that's obvious:

#-*-Coding:utf-8-*-  import urllib  import urllib2  postdata=urllib.urlencode ({      ' username ': ' Wang Xiaoguang ', '      password ': ' why888 ',      ' Continueuri ': ' http://www.verycd.com/',      ' FK ': ',      ' login_submit ': ' Login '  })  req = Urllib2. Request (      url = ' Http://secure.verycd.com/signin ',      data = PostData  )  result = Urllib2.urlopen (req )  Print Result.read ()

10. Disguised as browser access
Some websites resent the crawler's visit, so the crawler refuses to request
This time we need to disguise as a browser, which can be done by modifying the header in the HTTP packet

# ...    headers = {      ' user-agent ': ' mozilla/5.0 (Windows; U Windows NT 6.1; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6 '  }  req = Urllib2. Request (      url = ' http://secure.verycd.com/signin/*/http://www.verycd.com/',      data = postdata,      headers = Headers  )  # ...

11. Dealing with "anti-hotlinking"
Some sites have so-called anti-hotlinking settings, in fact, it is very simple,


is to check that you sent the request to the header inside, Referer site is not his own,

So we just need to change the headers referer to the site, take Cnbeta as an example:

#...headers = {    ' Referer ': ' Http://www.cnbeta.com/articles '}# ...


Headers is a DICT data structure, you can put in any desired header, to do some camouflage.

For example, some websites like to read the x-forwarded-for in the header to see people's real IP, you can directly change the x-forwarde-for.

The above is [Python] web crawler (v): Urllib2 of the use of details and grasp the content of the station skills, more relevant content please pay attention to topic.alibabacloud.com (www.php.cn)!

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.