Python writes crawlers using the Urllib2 method
Collated some of the details of Urllib2 's Use.
Settings for 1.Proxy
URLLIB2 uses the environment variable HTTP_PROXY to set the HTTP proxy by Default.
Suppose you want to understand the control of a Proxy in a program without being affected by environment Variables. Ability to use Proxies.
Create a new test14 to implement a simple proxy demo:
Import urllib2 enable_proxy = True Proxy_handler = urllib2. Proxyhandler ({"http": ' http://some-proxy.com:8080 '}) null_proxy_handler = urllib2. Proxyhandler ({}) if enable_proxy: opener = Urllib2.build_opener (proxy_handler) else: opener = Urllib2.build_opener (null_proxy_handler) Urllib2.install_opener (opener)
Here is a detail to Note. Using Urllib2.install_opener () sets the global opener of the urllib2.
In this way, the use of the latter is very convenient, but can not be more careful control, such as to use two different Proxy settings in the Program.
It's a good idea not to use Install_opener to change the global settings. Instead of simply calling Opener's open method instead of the global Urlopen Method.
2.Timeout settings
In the old version of Python (before Python2.6). The Urllib2 API does not expose the Timeout setting. To set the timeout value, you can only change the global timeout value of the Socket.
Import urllib2 Import socket socket.setdefaulttimeout (10) # 10 seconds after timeout urllib2.socket.setdefaulttimeout ( 10) # There's Another way
After Python 2.6, timeouts can be set directly through the timeout parameters of Urllib2.urlopen ().
Import urllib2 response = Urllib2.urlopen (' http://www.google.com ', timeout=10)
3. Add a specific Header to the HTTP Request
To add headers, you need to use the Request object:
Import urllib2 request = urllib2. Request (' http://www.baidu.com/') request.add_header (' user-agent ', ' fake-client ') response = Urllib2.urlopen (request) Print Response.read ()
Special attention should be paid to some of the Headers. The server will check for these Headers.
User-agent: some servers or proxies use this value to infer whether a request is made by a browser
Content-type: when using the REST interface, the server checks the value to determine how the content in the HTTP Body should be Parsed.
The common values Are:
Application/xml: used in XML RPC, such as Restful/soap call
Application/json: used in JSON RPC calls
Application/x-www-form-urlencoded: used when a Web form is submitted by the browser
Content-type setting error causes server denial of service when using server-provided RESTful or SOAP services
4.Redirect
Urllib2 by default, the HTTP 3XX return code will take its own active redirect Action. No manual configuration is Required.
To detect if a redirect action has occurred, just check to see if the URL of Response is consistent with the URL of the Request.
Import urllib2 my_url = ' http://www.google.cn ' response = Urllib2.urlopen (my_url) redirected = Response.geturl () = = My_url Print redirected my_url = ' Http://rrurl.cn/b1UZuP ' response = Urllib2.urlopen (my_url) redirected = Response.geturl () = = My_url Print redirected
Assuming that you don't want to redirect yourself, you can define the Httpredirecthandler class yourself, in addition to using a lower-level httplib library.
Import Urllib2 class Redirecthandler (urllib2. httpredirecthandler): def http_error_301 (self, req, fp, code, msg, headers): print "301" pass def http _error_302 (self, req, fp, code, msg, headers): print "303" pass opener = Urllib2.build_opener ( Redirecthandler) opener.open (' Http://rrurl.cn/b1UZuP ')
5.Cookie
Urllib2 's handling of cookies is also self-motivated. Suppose you need to get the value of a Cookie entry. To do This:
Import urllib2 Import cookielib cookie = cookielib. Cookiejar () opener = Urllib2.build_opener (urllib2. Httpcookieprocessor (cookie)) response = Opener.open (' http://www.baidu.com ') for item in cookie: print ' Name = ' +item.name print ' Value = ' +item.value
After execution, you will be asked to export the cookie value of baidu:
6. PUT and DELETE methods using HTTP
URLLIB2 only supports the GET and POST methods of http, assuming that HTTP PUT and DELETE are Used. Only a Lower-level httplib library can be used. still, we can do it the following way. Enable URLLIB2 to issue a PUT or delete request:
Import urllib2 request = urllib2. Request (uri, Data=data) Request.get_method = Lambda: ' PUT ' # or ' DELETE ' response = Urllib2.urlopen (request)
7. Get the return code for HTTP
For the OK. Just use the GetCode () method of the response object returned by Urlopen to get the return code for the HTTP. however, for other return codes, Urlopen throws an Exception. At this point, you should check the Code property of the exception object:
Import urllib2 try: response = Urllib2.urlopen (' http://bbs.csdn.net/why ') except urllib2. httperror, E: Print E.code
8.Debug Log
When using urllib2, the debug Log can be opened by the following method, so that the contents of the transceiver will be printed on the screen, easy to debug, sometimes save the job of grasping the package
Import urllib2 HttpHandler = urllib2. HttpHandler (debuglevel=1) Httpshandler = urllib2. Httpshandler (debuglevel=1) opener = Urllib2.build_opener (httphandler, httpshandler) Urllib2.install_ Opener (opener) response = Urllib2.urlopen (' http://www.google.com ')
This allows you to see the contents of the transmitted Packet:
9. Processing of forms
Login necessary forms, How to fill out the form?
first, Use the tool to intercept the content you want to fill in.
For example, I usually use the Firefox+httpfox plugin to see what package I Sent.
Take VERYCD as an example. Find your post request and post form items First.
To be able to see VERYCD words need to fill username,password,continueuri,fk,login_submit these Items. The FK is randomly generated (in fact, not too random.) It looks like the epoch time is generated by a simple code, it needs to be retrieved from the Web page, which means that you have to visit the Web page first, and use the tools such as the regular table to intercept the FK entries in the returned Data. Continueuri as the name implies can be casually written, login_submit is fixed, which can be seen from the source Code. and username,password, That's Pretty obvious:
#-*-coding:utf-8-*- Import urllib import urllib2 postdata=urllib.urlencode ({ ' username ': ' Wang Xiaoguang ', ' password ': ' why888 ', ' Continueuri ': ' http://www.verycd.com/', ' FK ': ', ' login_submit ': ' login ' }) req = urllib2. Request ( url = ' http://secure.verycd.com/signin ', data = postdata ) result = Urllib2.urlopen (req ) Print Result.read ()
10. Masquerading as a browser
Some sites hate the crawler to visit, so the crawler refused to request
This time we need to disguise as a browser, which can be done by altering the header in the HTTP Packet.
# ... headers = { ' user-agent ': ' mozilla/5.0 (Windows; U Windows NT 6.1; en-us; Rv:1.9.1.6) gecko/20091201 firefox/3.5.6 ' } req = urllib2. Request ( url = ' http://secure.verycd.com/signin/*/http://www.verycd.com/', data = postdata, headers = Headers ) # ...
11. Dealing with "anti-hotlinking"
Some sites have so-called anti-hotlinking settings, which in fact are very easy,
is to check the header of the request you Sent. Referer website is not his own,
So we just need to change the headers referer to the Site. Take Cnbeta as an example:
#...
headers = {
' Referer ': ' http://www.cnbeta.com/articles '
}
#...
Headers is a DICT data structure that you can put into whatever header you Want. To do some camouflage.
Like what. Some sites like to read the x-forwarded-for in the header to see the real IP of Others. Can directly change the X-forwarde-for.
Source of References:
0 Basic Write Python crawler urllib2 usage Guide
Http://www.lai18.com/content/384669.html
Extended Reading
"0 Basic Python crawler" series of technical Articles to organize the collection
10 basic Python crawler's definition and URL composition
20 Basic Write Python crawler uses URLLIB2 components to crawl Web content
30 Basic Write Python crawler urllib2 usage Guide
40 Basic Writing Python crawler urllib2 two important concepts: openers and handlers
50 HTTP exception handling for the underlying Python crawler
60 Basic write Python crawler crawl Baidu paste code share
70 The basic Python crawler artifact is a form of expression
80 Basic Python crawler crawler to write a full record
90 Basic Python crawler frame scrapy installation configuration
100 Base Write Python crawler package generation EXE file
110 Basic Write Python crawler crawl Baidu paste and save to local TXT file improved version
120 basic write Python crawler crawl embarrassing encyclopedia code share
130 base Write Python crawler using scrapy framework to write crawler
Python writes crawlers using the Urllib2 method