Python write crawlers use the urllib2 method, pythonurllib2
Use urllib2 for python write Crawlers
The Usage Details of urllib2 are sorted out.
1. Proxy Settings
By default, urllib2 uses the environment variable http_proxy to set HTTP Proxy.
If you want to explicitly control the Proxy in the program without being affected by environment variables, you can use the Proxy.
Create test14 to implement a simple proxy Demo:
import urllib2 enable_proxy = True proxy_handler = urllib2.ProxyHandler({"http" : 'http://some-proxy.com:8080'}) null_proxy_handler = urllib2.ProxyHandler({}) if enable_proxy: opener = urllib2.build_opener(proxy_handler) else: opener = urllib2.build_opener(null_proxy_handler) urllib2.install_opener(opener)
Note that using urllib2.install _ opener () sets the global opener of urllib2.
In this way, the subsequent use will be very convenient, but it cannot be more detailed, for example, you want to use two different Proxy settings in the program.
A better way is to directly call opener's open method instead of the global urlopen method.
2. Timeout settings
In the old version of Python (before Python2.6), The urllib2 API does not expose the Timeout settings. To set the Timeout value, you can only change the global Timeout value of the Socket.
Import urllib2 import socket. setdefatimetimeout (10) # timeout urllib2.socket. setdefatimetimeout (10) # Another Method
After Python 2.6, timeout can be directly set through the timeout parameter of urllib2.urlopen.
import urllib2 response = urllib2.urlopen('http://www.google.com', timeout=10)
3. Add a specific Header to the HTTP Request
To add a header, you must use the Request object:
import urllib2 request = urllib2.Request('http://www.baidu.com/') request.add_header('User-Agent', 'fake-client') response = urllib2.urlopen(request) print response.read()
Pay special attention to some headers. The server will check these headers.
User-Agent: Some servers or proxies use this value to determine whether the request is sent by the browser.
Content-Type: When the REST interface is used, the server checks the value to determine how to parse the Content in the HTTP Body. Common values include:
Application/xml: Used in xml rpc, such as RESTful/SOAP calls
Application/json: used for json rpc calls
Application/x-www-form-urlencoded: used when the browser submits a Web form
When using the RESTful or SOAP service provided by the server, the Content-Type setting error may cause the server to reject the service.
4. Redirect
By default, urllib2 automatically performs a redirect action on the HTTP 3XX return code, without manual configuration. To check whether a redirect action has occurred, you only need to check whether the Response URL and Request URL are consistent.
import urllib2 my_url = 'http://www.google.cn' response = urllib2.urlopen(my_url) redirected = response.geturl() == my_url print redirected my_url = 'http://rrurl.cn/b1UZuP' response = urllib2.urlopen(my_url) redirected = response.geturl() == my_url print redirected
If you do not want automatic redirect, you can customize the HTTPRedirectHandler class in addition to using the lower-level httplib library.
import urllib2 class RedirectHandler(urllib2.HTTPRedirectHandler): def http_error_301(self, req, fp, code, msg, headers): print "301" pass def http_error_302(self, req, fp, code, msg, headers): print "303" pass opener = urllib2.build_opener(RedirectHandler) opener.open('http://rrurl.cn/b1UZuP')
5. Cookie
Urllib2 automatically processes cookies. To obtain the value of a Cookie, you can do this:
import urllib2 import cookielib cookie = cookielib.CookieJar() opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie)) response = opener.open('http://www.baidu.com') for item in cookie: print 'Name = '+item.name print 'Value = '+item.value
After running the command, the Cookie value for accessing Baidu is output:
6. Use the PUT and DELETE methods of HTTP
Urllib2 only supports http get and POST methods. To use http put and DELETE methods, you can only use httplib libraries of lower layers. Even so, we can make urllib2 send a PUT or DELETE request in the following way:
import urllib2 request = urllib2.Request(uri, data=data) request.get_method = lambda: 'PUT' # or 'DELETE' response = urllib2.urlopen(request)
7. Get the HTTP return code
For 200 OK, you only need to use the getcode () method of the response object returned by urlopen to obtain the HTTP return code. However, for other return codes, urlopen throws an exception. At this time, we need to check the code attribute of the exception object:
import urllib2 try: response = urllib2.urlopen('http://bbs.csdn.net/why') except urllib2.HTTPError, e: print e.code
8. Debug Log
When using urllib2, you can use the following method to open the debug Log, so that the content of the packet sending and receiving will be printed on the screen for debugging convenience. Sometimes you can save the packet capture work.
import urllib2 httpHandler = urllib2.HTTPHandler(debuglevel=1) httpsHandler = urllib2.HTTPSHandler(debuglevel=1) opener = urllib2.build_opener(httpHandler, httpsHandler) urllib2.install_opener(opener) response = urllib2.urlopen('http://www.google.com')
In this way, we can see the transmitted data packet content:
9. Form Processing
Do I need to fill in the form for Logon?
First, use a tool to intercept the content of the table to be filled in.
For example, I usually use the firefox + httpfox plug-in to see what packages I have actually sent.
Take verycd as an example. First, find your POST request and POST form items.
You can see that if verycd is used, you need to enter the username, password, continueURI, fk, and login_submit items, where fk is randomly generated (in fact, it is not random, it looks like the epoch time is generated by a simple code. You need to obtain the epoch time from the webpage. That is to say, you must first access the webpage and use regular expressions and other tools to intercept the fk items in the returned data. As the name suggests, continueURI can be written at will. login_submit is fixed, which can be seen from the source code. There is also username and password, which is obvious:
#-*-Coding: UTF-8-*-import urllib import urllib2 postdata = urllib. urlencode ({'username': 'wangxiaoguang ', 'Password': 'why888', 'continuuri': 'http: // www.verycd.com/', 'fk ':'', 'login _ submit ': 'login'}) req = urllib2.Request (url = 'HTTP: // secure.verycd.com/signin', data = postdata) result = urllib2.urlopen (req) print result. read ()
10. Disguised as browser access
Some websites dislike crawler visits, so they reject requests from crawlers.
In this case, we need to pretend to be a browser, which can be achieved by modifying the header in the http packet.
#… headers = { 'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6' } req = urllib2.Request( url = 'http://secure.verycd.com/signin/*/http://www.verycd.com/', data = postdata, headers = headers ) #...
11. Deal with "anti-leeching"
Some sites have so-called anti-leeching settings, which are simple to describe,
It is to check whether the referer site itself is in the header of the request you sent,
So we only need to change the referer of headers to this website. The cnbeta is used as an example:
#...
Headers = {
'Referer': 'http: // www.cnbeta.com/articles'
}
#...
Headers is a dict data structure. You can put any desired header for some disguise.
For example, some websites like to read X-Forwarded-For from the header to see the real IP address of others. You can directly change X-Forwarde-.
Reference Source:
Guide to Using urllib2 to write python crawlers without basic knowledge
Http://www.lai18.com/content/384669.html
Additional reading
Collect and collect technical articles from the "no basic writing Python crawler" Series
1. Definition and URL composition of a python Crawler
2. Use urllib2 to capture webpage content
3 Guide to Using urllib2 to write python Crawlers
4. Two important concepts in urllib2: Openers and Handlers
5 zero-Basic python crawler-based HTTP Exception Handling
6. Zero-Basic write python crawler crawling Baidu post bar code sharing
7 basic python crawler-based Regular Expression
8 basic write python crawler full record
9 install and configure Scrapy, a crawler framework of zero-basic writing python Crawlers
10 basic write python crawler package to generate exe files
11 basic writing python crawler crawlers crawl Baidu Post bars and store them to local txt file Ultimate Edition
12. Zero-basic writing python crawlers: Capturing gossip encyclopedia code sharing
13 zero-Basic write python crawler using Scrapy framework to write Crawlers
Copyright Disclaimer: This article is an original article by the blogger and cannot be reproduced without the permission of the blogger.