[Python] web crawler (V): use details and website Capturing Skills of urllib2

Source: Internet
Author: User
Tags epoch time

The simple introduction to urllib2 is mentioned earlier. The following describes how to use urllib2.


1. Proxy Settings

By default, urllib2 uses the environment variable http_proxy to set HTTP proxy.

If you want to explicitly control the proxy in the program without being affected by environment variables, you can use the proxy.

Create test14 to implement a simple proxy Demo:

import urllib2enable_proxy = Trueproxy_handler = urllib2.ProxyHandler({"http" : 'http://some-proxy.com:8080'})null_proxy_handler = urllib2.ProxyHandler({})if enable_proxy:    opener = urllib2.build_opener(proxy_handler)else:    opener = urllib2.build_opener(null_proxy_handler)urllib2.install_opener(opener)

Note that using urllib2.install _ opener () sets the global opener of urllib2.

In this way, the subsequent use will be very convenient, but it cannot be more detailed, for example, you want to use two different proxy settings in the program.

A better way is to directly call opener's open method instead of the global urlopen method.


2. timeout settings
In the old version of Python (before python2.6), The urllib2 API does not expose the timeout settings. To set the timeout value, you can only change the global timeout value of the socket.

Import urllib2import socketsocket. setdefatimetimeout (10) # timeout urllib2.socket. setdefatimetimeout (10) # Another Method

After Python 2.6, timeout can be directly set through the timeout parameter of urllib2.urlopen.

import urllib2response = urllib2.urlopen('http://www.google.com', timeout=10)


3. Add a specific header to the HTTP request

To add a header, you must use the request object:

import urllib2request = urllib2.Request('http://www.baidu.com/')request.add_header('User-Agent', 'fake-client')response = urllib2.urlopen(request)print response.read()

Pay special attention to some headers. The server will check these headers.
User-Agent: Some servers or proxies use this value to determine whether the request is sent by the browser.
Content-Type: When the rest interface is used, the server checks the value to determine how to parse the content in the HTTP body. Common values include:
Application/xml: Used in xml rpc, such as restful/soap calls
Application/JSON: used for json rpc calls
Application/X-WWW-form-urlencoded: used when the browser submits a web form
When using the restful or soap service provided by the server, the Content-Type setting error may cause the server to reject the service.



4. Redirect
By default, urllib2 automatically performs a redirect action on the HTTP 3xx return code, without manual configuration. To check whether a redirect action has occurred, you only need to check whether the response URL and request URL are consistent.

import urllib2my_url = 'http://www.google.cn'response = urllib2.urlopen(my_url)redirected = response.geturl() == my_urlprint redirectedmy_url = 'http://rrurl.cn/b1UZuP'response = urllib2.urlopen(my_url)redirected = response.geturl() == my_urlprint redirected

If you do not want automatic redirect, you can customize the httpredirecthandler class in addition to using the lower-level httplib library.

import urllib2class RedirectHandler(urllib2.HTTPRedirectHandler):    def http_error_301(self, req, fp, code, msg, headers):        print "301"        pass    def http_error_302(self, req, fp, code, msg, headers):        print "303"        passopener = urllib2.build_opener(RedirectHandler)opener.open('http://rrurl.cn/b1UZuP')


5. Cookie

Urllib2 automatically processes cookies. To obtain the value of a cookie, you can do this:

import urllib2import cookielibcookie = cookielib.CookieJar()opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))response = opener.open('http://www.baidu.com')for item in cookie:    print 'Name = '+item.name    print 'Value = '+item.value

After running the command, the cookie value for accessing Baidu is output:


6. Use the put and delete methods of HTTP

Urllib2 only supports http get and post methods. To use http put and delete methods, you can only use httplib libraries of lower layers. Even so, we can make urllib2 send a put or delete request in the following way:

import urllib2request = urllib2.Request(uri, data=data)request.get_method = lambda: 'PUT' # or 'DELETE'response = urllib2.urlopen(request)


7. Get the HTTP return code

For 200 OK, you only need to use the getcode () method of the response object returned by urlopen to obtain the HTTP return code. However, for other return codes, urlopen throws an exception. At this time, we need to check the code attribute of the exception object:

import urllib2try:    response = urllib2.urlopen('http://bbs.csdn.net/why')except urllib2.HTTPError, e:    print e.code

8. debug log

When using urllib2, you can use the following method to open the debug log, so that the content of the packet sending and receiving will be printed on the screen for debugging convenience. Sometimes you can save the packet capture work.

import urllib2httpHandler = urllib2.HTTPHandler(debuglevel=1)httpsHandler = urllib2.HTTPSHandler(debuglevel=1)opener = urllib2.build_opener(httpHandler, httpsHandler)urllib2.install_opener(opener)response = urllib2.urlopen('http://www.google.com')

In this way, we can see the transmitted data packet content:

9. Form Processing

Do I need to fill in the form for Logon?

First, use a tool to intercept the content of the table to be filled in.
For example, I usually use the Firefox + httpfox plug-in to see what packages I have actually sent.
Take verycd as an example. First, find your POST request and post form items.
You can see that if verycd is used, you need to enter the username, password, continueuri, FK, and login_submit items, where FK is randomly generated (in fact, it is not random, it looks like the epoch time is generated by a simple code. You need to obtain the epoch time from the webpage. That is to say, you must first access the webpage and use regular expressions and other tools to intercept the FK items in the returned data. As the name suggests, continueuri can be written at will. login_submit is fixed, which can be seen from the source code. There is also username and password, which is obvious:

#-*-Coding: UTF-8-*-import urllibimport urllib2postdata = urllib. urlencode ({'username': 'wangxiaoguang ', 'Password': 'why888', 'continuuri': 'http: // www.verycd.com/', 'fk ':'', 'login _ submit ': 'login'}) Req = urllib2.request (url = 'HTTP: // secure.verycd.com/signin', Data = postdata) Result = urllib2.urlopen (req) print result. read ()


10. Disguised as browser access
Some websites dislike crawler visits, so they reject requests from crawlers.
In this case, we need to pretend to be a browser, which can be achieved by modifying the header in the HTTP packet.

#…headers = {    'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}req = urllib2.Request(    url = 'http://secure.verycd.com/signin/*/http://www.verycd.com/',    data = postdata,    headers = headers)#...

11. Deal with "anti-leeching"
Some sites have so-called anti-leeching settings, which are simple to describe,

It is to check whether the Referer site itself is in the header of the request you sent,

So we only need to change the Referer of headers to this website. The cnbeta is used as an example:

#...headers = {    'Referer':'http://www.cnbeta.com/articles'}#...

Headers is a dict data structure. You can put any desired header for some disguise.

For example, some websites like to read X-forwarded-for from the header to see the real IP address of others. You can directly change X-forwarde-.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.