Summary of some tips on using python crawlers to capture websites.
Python has been used for more than three months, and the most common examples are crawler scripts: scripts that capture the local verification of the proxy, I wrote the script for automatic logon and Automatic posting in the discuz forum, the script for automatic email receiving, the script for simple verification code recognition, and the script for capturing google music, the result has a powerful gmbox, so you don't need to write it.
These scripts have a common feature related to the web. They always need to use methods to obtain links. In addition, the simplecd semi-crawler half-site project has accumulated a lot of crawler site grabbing experience, to sum up, you do not need to repeat your work in the future.
1. The most basic website capture
import urllib2content = urllib2.urlopen('http://XXXX').read()
2. Use a Proxy Server
This is useful in some situations, such as the IP address being blocked, or the number of IP addresses being accessed is limited.
import urllib2proxy_support = urllib2.ProxyHandler({'http':'http://XX.XX.XX.XX:XXXX'})opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler)urllib2.install_opener(opener)content = urllib2.urlopen('http://XXXX').read()
3. logon required
I have trouble splitting the problem:
3.1 cookie Processing
import urllib2, cookielibcookie_support= urllib2.HTTPCookieProcessor(cookielib.CookieJar())opener = urllib2.build_opener(cookie_support, urllib2.HTTPHandler)urllib2.install_opener(opener)content = urllib2.urlopen('http://XXXX').read()
Yes. If you want to use both proxy and cookie, add proxy_support and change operner
opener = urllib2.build_opener(proxy_support, cookie_support, urllib2.HTTPHandler)
3.2 Form Processing
Do I need to fill in the form for Logon? First, use a tool to intercept the content of the table to be filled in.
For example, I usually use the firefox + httpfox plug-in to see what packages I have actually sent.
Here is an example. Take verycd as an example. First, find your POST request and the POST form item:
You can see that if verycd is used, you need to enter the username, password, continueURI, fk, and login_submit items, where fk is randomly generated (in fact, it is not random, it looks like the epoch time is generated by a simple code. You need to obtain the epoch time from the webpage. That is to say, you must first access the webpage and use regular expressions and other tools to intercept the fk items in the returned data. As the name suggests, continueURI can be written at will. login_submit is fixed, which can be seen from the source code. And username and password.
Okay. With the data to be filled in, we need to generate postdata.
Import urllibpostdata = urllib. urlencode ({'username': 'xxxxx', 'Password': 'xxxxx', 'continuuris ': 'http: // www.verycd.com/', 'fk ': fk, 'login _ submit ': 'login '})
Then generate an http request and then send the request:
req = urllib2.Request( url = 'http://secure.verycd.com/signin/*/http://www.verycd.com/', data = postdata)result = urllib2.urlopen(req).read()
3.3 disguised as browser access
Some websites dislike crawler visits, so they reject requests from crawlers. At this time, we need to pretend to be a browser, which can be achieved by modifying the header in the http package:
headers = { 'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}req = urllib2.Request( url = 'http://secure.verycd.com/signin/*/http://www.verycd.com/', data = postdata, headers = headers)
3.4 anti-leeching"
Some sites have so-called anti-leeching settings. In fact, it is very simple to say that it is to check whether the referer site is in the header of your request, so we only need to be like 3.3, change the referer of headers to this website. The cnbeta, known as the black screen, is used as an example:
headers = { 'Referer':'http://www.cnbeta.com/articles'}
Headers is a dict data structure. You can put any desired header for some disguise. For example, some smart websites always like to look at people's privacy. If someone else accesses the website via proxy, He must read X-Forwarded-For in the header to see the real IP address of others, let's change X-Forwarde-For directly. You can change it to something fun to bully and bully him.
3.5 ultimate success
Sometimes, even if we do 3.1-3.4, the access will still get data, so there is no way, honestly write all the headers seen in httpfox, that would be fine. No, you can only use the ultimate trick. selenium directly controls the browser to access the website. As long as the browser can do so, it can do the same. For example, pamie, watir, and so on.
4. multi-thread concurrent capturing
If the single thread is too slow, multiple threads are required. Here, a simple thread pool template is provided. This program simply prints 1-10, but it can be seen that it is concurrent.
From threading import Threadfrom Queue import Queuefrom time import sleep # q is the task Queue # NUM is the total number of concurrent threads # How many JOBS are there? q = Queue () NUM = 2 JOBS = 10 # specific processing function, responsible for processing a single task def do_somthing_using (arguments): print arguments # This is a working process, gets data from the queue and processes def working (): while True: arguments = q. get () do_somthing_using (arguments) sleep (1) q. task_done () # fork NUM threads waiting queue for I in range (NUM): t = Thread (target = working) t. setDaemon (True) t. start () # queue JOBS into the queue for I in range (JOBS): q. put (I) # Wait for all JOBS to complete q. join ()
5. process the verification code
What should I do if I encounter a verification code? There are two solutions:
- Google's verification code, cool
- Simple verification code: the number of characters is limited. It only uses simple translation or rotation to add noise without being distorted. This kind of verification code can still be processed. The general idea is to rotate it back, remove the noise, divide a single character, and then use the feature extraction method (such as PCA) to reduce the dimension and generate a feature library, and then compare the verification code with the feature library. This is complicated. I can't finish a blog post, so I won't start it here. Please study this textbook for details.
- In fact, some verification codes are still very weak, so I won't name them here. Anyway, I have extracted very accurate verification codes through the 2 method, SO 2 is actually feasible.
6 gzip/deflate support
The current web pages generally support gzip compression, which can solve a large amount of transmission time. Taking the VeryCD homepage as an example, the uncompressed version of 247 K is compressed to a later version of 45 K, which is 1/5 of the original version. This means that the crawling speed will be 5 times faster.
However, python's urllib/urllib2 does not support compression by default. To return the compression format, you must specify 'Accept-encoding 'in the request header ', after reading response, check the header to check whether there is a 'content-encoding' item to determine whether decoding is required. This is tedious and trivial. How can we make urllib2 automatically support gzip and defalte?
In fact, you can inherit the BaseHanlder class and then use the build_opener method to handle it:
import urllib2from gzip import GzipFilefrom StringIO import StringIOclass ContentEncodingProcessor(urllib2.BaseHandler): """A handler to add gzip capabilities to urllib2 requests """ # add headers to requests def http_request(self, req): req.add_header("Accept-Encoding", "gzip, deflate") return req # decode def http_response(self, req, resp): old_resp = resp # gzip if resp.headers.get("content-encoding") == "gzip": gz = GzipFile( fileobj=StringIO(resp.read()), mode="r" ) resp = urllib2.addinfourl(gz, old_resp.headers, old_resp.url, old_resp.code) resp.msg = old_resp.msg # deflate if resp.headers.get("content-encoding") == "deflate": gz = StringIO( deflate(resp.read()) ) resp = urllib2.addinfourl(gz, old_resp.headers, old_resp.url, old_resp.code) # 'class to add info() and resp.msg = old_resp.msg return resp# deflate supportimport zlibdef deflate(data): # zlib only provides the zlib compress format, not the deflate format; try: # so on top of all there's this workaround: return zlib.decompress(data, -zlib.MAX_WBITS) except zlib.error: return zlib.decompress(data)
Then it's easy,
Encoding_support = protocol = urllib2.build _ opener (encoding_support, urllib2.HTTPHandler) # open the webpage directly with opener. If the server supports gzip/defalte, automatically decompress content = opener. open (url). read ()
7. More convenient Multithreading
The summary does mention a simple multi-threaded template, but the real application of the plug-in the program will only make the program fragmented and unsightly. I also had some brains on how to make multithreading more convenient. First think about how to make multi-threaded calls the most convenient?
1. asynchronous I/O capturing with twisted
In fact, it is not necessary to use multithreading to capture more efficiently. You can also use the asynchronous I/O Method: directly use the twisted getPage method, then add the callback and errback methods at the end of asynchronous I/O respectively. For example, you can do this:
From twisted. web. client import getPagefrom twisted. internet import reactorlinks = ['HTTP: // logs for I in range (5420,5430)] def parse_page (data, url): print len (data), urldef fetch_error (error, url): print error. getErrorMessage (), url # capture links in batches for url in links: getPage (url, timeout = 5 )\. addCallback (parse_page, url) \ # Call the parse_page method if the call succeeds. addErrback (fetch_error, url) # Call the fetch_error method reactor if the request fails. callLater (5, reactor. stop) # notify the reactor to end the program reactor in 5 seconds. run ()
The code written by a twisted person is too distorted and accepted by an abnormal person. Although this simple example looks good, the whole person who writes the twisted program is distorted every time, I am so tired that the document does not exist. I have to read the source code to know how to complete it.
If you want to support gzip/deflate and even some login extensions, you have to write a new HTTPClientFactory class for twisted, and so on. My frown is really big, so I gave up. If you have perseverance, please try it on your own.
This article describes how to use twisted for batch processing of web sites. It is easy to understand.
2. Design a simple multi-thread crawling class
I still feel more comfortable in the python "local" stuff such as urllib. Think about it. If there is a Fetcher class, you can call it like this.
F = Fetcher (threads = 10) # set the number of download threads to 10for url in urls: f. push (url) # push all URLs into the download queue while f. taskleft (): # if there is any unfinished download thread content = f. pop () # retrieve the result do_with (content) from the download Completion queue # process content
Such a multi-threaded call is simple and clear, so we should design it like this. First, we should have two queues and use Queue to solve the problem. The basic architecture of Multithreading is similar to the article "skill summary, the push method and pop method are both well processed, and the Queue method is used directly. If taskleft has a "running task" or "tasks in the Queue", it is yes, the Code is as follows:
Import urllib2from threading import Thread, Lockfrom Queue import Queueimport timeclass Fetcher: def _ init _ (self, threads): self. opener = urllib2.build _ opener (urllib2.HTTPHandler) self. lock = Lock () # thread lock self. q_req = Queue () # task Queue self. q_ans = Queue () # complete the Queue self. threads = threads for I in range (threads): t = Thread (target = self. threadget) t. setDaemon (True) t. start () self. running = 0 def _ del _ (self): # Wait for the two queues to complete the time. sleep (0.5) self. q_req.join () self. q_ans.join () def taskleft (self): return self. q_req.qsize () + self. q_ans.qsize () + self. running def push (self, req): self. q_req.put (req) def pop (self): return self. q_ans.get () def threadget (self): while True: req = self. q_req.get () with self. lock: # To ensure the atomicity of the operation, enter the critical area self. running + = 1 try: ans = self. opener. open (req ). read () failed t Exception, what: ans = ''print what self. q_ans.put (req, ans) with self. lock: self. running-= 1 self. q_req.task_done () time. sleep (0.1) # don't spamif _ name _ = "_ main _": links = ['HTTP: // www.verycd.com/topics/mongod/'{ I for I in range (5420,5430)] f = Fetcher (threads = 10) for url in links: f. push (url) while f. taskleft (): url, content = f. pop () print url, len (content)
8. Some trivial experiences
1. Connection Pool:
Opener. open, like urllib2.urlopen, creates an http request. Generally, this is not a problem, because in a linear environment, a request may be generated in one second. However, in a multi-threaded environment, hundreds of requests can be generated every second, in just a few minutes, normal rational servers will block you.
However, in normal html requests, it is normal to maintain dozens of connections to the server at the same time. Therefore, you can manually maintain an HttpConnection pool, then, you can select a connection from the connection pool for connection every capture.
A clever method is to use squid as a proxy server to capture data. squid automatically maintains the connection pool for you and also provides the data cache function, in addition, squid is required on every server. Why bother writing a connection pool.
2. Set the stack size of the thread.
The stack size setting will significantly affect the memory usage of python. If this value is not set for multiple threads in python, the program occupies a large amount of memory, which is critical to the vps of openvz. Stack_size must be greater than 32768. In fact, it should always be more than 32768*2.
from threading import stack_sizestack_size(32768*16)
3. automatic retry upon failed settings
def get(self,req,retries=3): try: response = self.opener.open(req) data = response.read() except Exception , what: print what,req if retries>0: return self.get(req,retries-1) else: print 'GET Failed',req return '' return data
4. Set timeout
Import socket. setdefatimetimeout (10) # Set the connection timeout after 10 seconds
5. Login
Login is simplified. First, we need to add cookie support to build_opener. Refer to the "summary" article. to log on to VeryCD, add an empty login method to Fetcher and call it in init, then inherit the Fetcher class and override login method:
Def login (self, username, password): import urllib data = urllib. urlencode ({'username': username, 'Password': password, 'contine': 'http: // www.verycd.com/', 'login _ submit': U' '. encode ('utf-8'), 'Save _ cookies': 1,}) url = 'HTTP: // www.verycd.com/signin' self. opener. open (url, data ). read ()
Therefore, when Fetcher is initialized, the system will automatically log on to the VeryCD website.
9. Summary
In this way, the combination of all the tips mentioned above is not far from the Fetcher class of my current private library final version. It supports multithreading, gzip/deflate compression, timeout setting, and automatic retry, set the stack size, automatic login, and other functions. The code is simple, easy to use, and has good performance. It can be described as an essential tool for traveling at home, killing people, setting fires, and coughing.
The reason is that the final version is not far from the final version, because the final version also has a reserved function "vest": automatic selection by multiple agents. It seems like the difference between random. choice and Agent retrieval, proxy verification, and Agent speed measurement. This is another story.