Preface
Wrote these scripts have a common, are related to the web, always use to get links to some of the methods, accumulated a lot of crawler capture experience, in this summary, then do things will not have to repeat the work.
1. The most basic grasping station
Import urllib2content = Urllib2.urlopen (' Http://XXXX '). Read ()
2. Using a proxy server
This is useful in some cases, such as IP being blocked or, for example, the number of IP accesses being restricted, and so on.
Import Urllib2proxy_support = Urllib2. Proxyhandler ({' http ': ' Http://XX.XX.XX.XX:XXXX '}) opener = Urllib2.build_opener (Proxy_support, Urllib2. HttpHandler) Urllib2.install_opener (opener) content = Urllib2.urlopen (' Http://XXXX '). Read ()
3. Need to log in the situation
Log in the situation more trouble I split up the question:
Processing of 3.1 Cookies
Import Urllib2, cookielibcookie_support= urllib2. Httpcookieprocessor (Cookielib. Cookiejar ()) opener = Urllib2.build_opener (Cookie_support, Urllib2. HttpHandler) Urllib2.install_opener (opener) content = Urllib2.urlopen (' Http://XXXX '). Read ()
Yes, if you want to use proxies and cookies at the same time, then join and proxy_support operner change to
Opener = Urllib2.build_opener (Proxy_support, Cookie_support, Urllib2. HttpHandler)
3.2 Processing of forms
Login necessary forms, how to fill out the form? First, use the tool to intercept the content you want to fill in.
For example, I usually use the Firefox+httpfox plugin to see what I sent the package
Let me give you an example of this, take VERYCD as an example, find your own post request, and the Post form item:
Can see VERYCD words need to fill username,password,continueuri,fk,login_submit these items, where FK is randomly generated (in fact, not too random, It looks like the epoch time is generated by a simple code, it needs to be fetched from a Web page, which means that you have to first visit a webpage and use tools such as regular expressions to intercept the FK entries in the returned data. Continueuri As the name implies can be casually written, Login_submit is fixed , this from the source can be seen. And Username,password, that's pretty obvious.
Well, with the data to fill out, we're going to generate postdata.
Import urllibpostdata=urllib.urlencode ({' username ': ' XXXXX ', ' Password ': ' XXXXX ', ' continueuri ': '/HTTP// www.verycd.com/', ' FK ': FK, ' login_submit ': ' Login '})
The HTTP request is then generated, and then the request is sent:
req = Urllib2. Request (url = ' http://secure.verycd.com/signin/*/http://www.php.cn/', data = postdata) result = Urllib2.urlopen (req). Read ()
3.3 Masquerading as browser access
Some websites resent the crawler's visit, so they refuse to request the crawler. This time we need to disguise as a browser, which can be achieved by modifying the header in the HTTP packet:
headers = {' user-agent ': ' mozilla/5.0 (Windows; U Windows NT 6.1; En-us; rv:1.9.1.6) gecko/20091201 firefox/3.5.6 '}req = urllib2. Request (url = ' http://secure.verycd.com/signin/*/http://www.php.cn/', data = postdata, headers = headers)
3.4 Anti-"anti-hotlinking"
Some sites have so-called anti-hotlinking settings, in fact, it is very simple, is to check the header you sent the request inside, Referer site is not his own, So we just want to like 3.3, the headers Referer to the site can be, with a well-known Cnbeta for example:
headers = {' Referer ': ' Http://www.cnbeta.com/articles '}
Headers is a DICT data structure, you can put in any desired header, to do some camouflage. For example, some clever web site always like to peep at People's privacy, others through the proxy access, he just want to read the header of the x-forwarded-for to see someone else's real IP, no words, then directly x-forwarde-for change it, Can be changed to anything fun stuff to bully him, hehe.
3.5 Ultimate Trick
Sometimes even if do 3.1-3.4, access or will be according to, then no way, honestly put httpfox see headers All write on, that generally also on the line. No more, it can only use the ultimate trick, selenium directly control the browser to access, as long as the browser can do, then it can do. There are similar pamie,watir, and so on.
4. Multithreading Concurrent fetching
Single thread too slow, you need to multi-threading, here to a simple thread pool template This program is simply printed 1-10, but can be seen in parallel.
From threading import threadfrom queue import queuefrom time import Sleep#q is the task queue #num is the total number of concurrent threads #jobs is how many tasks q = queue () NUM = 2JO BS = 10# specific processing function, responsible for handling a single task Def do_somthing_using (arguments): Print arguments# This is a worker process, responsible for constantly fetching data from the queue and processing def working (): While True: arguments = Q.get () do_somthing_using (arguments) sleep (1) q.task_done () #fork num threads waiting for queue for I In range (NUM): t = Thread (target=working) T.setdaemon (True) T.start () #把JOBS排入队列for I in Range (JOBS): Q.put (i) # Wait for all jobs to complete q.join ()
5. Processing of verification codes
What if I hit the verification code? This is handled in two different situations:
1, Google that kind of verification code, salad
2, simple verification Code: a limited number of characters, only the use of simple translation or rotation with noise without distortion, this is still possible to deal with, the general idea is rotating back, the noise is removed, and then divided into a single character, divided well after the feature extraction method (such as PCA) to reduce dimension and generate a feature library, Then compare the verification code with the feature library. This is more complicated, a blog post is not finished, here will not start, specific practices please make a study of the relevant textbooks.
In fact some of the verification code is still very weak, here is not named, anyway, I through 2 method to extract the very high accuracy of the verification code, so 2 is actually feasible.
6 Gzip/deflate Support
Today's web page generally supports gzip compression, which can often solve a large amount of transmission time to VeryCD home For example, uncompressed version 247K, compressed after 45K, for the original 1/5. This means that the crawl speed is 5 times times faster.
However, the Python urllib/urllib2 default does not support compression, to return to the compression format, you must write in the header of the request ' accept-encoding ', Then read the response to check the header to see if there is a ' content-encoding ' to determine whether the need to decode, very cumbersome trivial. How do I get urllib2 to automatically support gzip, Defalte?
You can actually inherit the Basehanlder class and then build_opener the way to handle it:
Import urllib2from gzip Import gzipfilefrom Stringio import Stringioclass contentencodingprocessor (urllib2. Basehandler): "" "A handler to add GZIP capabilities to URLLIB2 requests" "" # Add headers to requests def http_request (SE LF, req): Req.add_header ("accept-encoding", "gzip, deflate") return req # decode def http_response (self, req, resp): old_ RESP = resp # gzip if Resp.headers.get ("content-encoding") = = "gzip": GZ = Gzipfile (Fileobj=stringio (Resp.read ()), Mode= "R") resp = Urllib2.addinfourl (GZ, Old_resp.headers, Old_resp.url, old_resp.code) resp.msg = old_resp.msg # Deflate if Resp.headers.get ("content-encoding") = = "Deflate": GZ = Stringio (Deflate (Resp.read ())) resp = Urllib2.addi Nfourl (GZ, Old_resp.headers, Old_resp.url, Old_resp.code) # ' class to add info () and resp.msg = old_resp.msg return resp # deflate Supportimport zlibdef deflate (data): # zlib only provides the zlib-compress format, not the deflate format; Try: # so on top of all there ' s this worKaround:return zlib.decompress (data,-zlib. max_wbits) except Zlib.error:return zlib.decompress (data)
And then it was simple,
Encoding_support = Contentencodingprocessoropener = Urllib2.build_opener (Encoding_support, Urllib2. HttpHandler) #直接用opener打开网页, if the server supports Gzip/defalte, the content = Opener.open (URL) is automatically uncompressed. Read ()
7. More Convenient multithreading
The summary article does mention a simple multithreaded template, but the stuff that really applies to the program will only make the program fragmented and unsightly. I have also had some brains in how to make multithreading more convenient. What is the most convenient way to make multi-threaded calls first?
1. Asynchronous I/O crawl with twisted
In fact, more efficient crawl is not necessarily using multithreading, but also can use asynchronous I/O method: Directly with the twisted GetPage method, and then add the asynchronous I/O at the end of the callback and Errback method can be. For example, you can do this:
From twisted.web.client import getpagefrom twisted.internet Import Reactor links = [' http://www.verycd.com/topics/%d/'% I for I in Range (5420,5430)] def parse_page (data,url): Print len (data), url def fetch_error (error,url): Print Error.geterr Ormessage (), URL # Bulk crawl links for URLs in links:getpage (url,timeout=5) \ . Addcallback (parse_page,url) \ #成功则调用parse_ Page method . Adderrback (fetch_error,url) #失败则调用fetch_error方法 reactor.calllater (5, Reactor.stop) # Notify reactor end program after 5 Seconds Reactor.run ()
Twisted people, such as their name, write code is too distorted, not normal people can accept, although this simple example looks good, each write twisted program whole people are twisted, tired, document equals No, must see the source to know how the whole, alas don't mention.
If you want to support gzip/deflate, even do some landing extension, you have to write a new Httpclientfactory class for twisted and so on, I this brow is really wrinkled, then give up. If you have perseverance, try it yourself.
2, design a simple multi-threaded crawl class
Still feel in the urllib, such as Python "native" east to toss up more comfortable. Imagine, if you have a Fetcher class, you can call this
f = fetcher (threads=10) #设定下载线程数为10for URL in urls:f.push (URL) #把所有url推入下载队列while f.taskleft (): #若还有未完成下载的线程 content = f. Pop () #从下载完成队列中取出结果 do_with (content) # process content
Such a multithreaded call simple and clear, so design it, first of all, there are two queues, with the queue to fix, multi-threaded basic architecture and "skills summary" a similar article, push method and pop method are relatively good processing, are directly using the queue method, Taskleft is if there is " Running Tasks "or" tasks in queue "Yes, and the code is as follows:
Import urllib2from Threading Import thread,lockfrom Queue import Queueimport time class Fetcher:def __init__ (self,threads ): Self.opener = Urllib2.build_opener (urllib2. HttpHandler) Self.lock = Lock () #线程锁 self.q_req = Queue () #任务队列 Self.q_ans = Queue () #完成队列 self.threads = threads for I in range (threads): t = Thread (target=self.threadget) T.setdaemon (True) t.start () self.running = 0 def __del__ (s ELF): #解构时需等待两个队列完成 time.sleep (0.5) Self.q_req.join () Self.q_ans.join () def taskleft (self): return self.q_req.qsize () +self.q_ans.qsize () +self.running def push (self,req): Self.q_req.put (req) def pop (self): return self.q_ans.get () def t Hreadget (self): when true:req = Self.q_req.get () with Self.lock: #要保证该操作的原子性, enter critical area self.running + = 1 Try:ans = Self.opener.open (req). read () except Exception, What:ans = ' Print What Self.q_ans.put ((Req,ans) ) with self.lock:self.running-= 1 Self.q_req.task_done () Time.sleep (0.1) # don ' t spam if __name__ = = "__main__": links = [' http://www.verycd.com/topics/%d/'%i for I in Range (5420,5430)] f = fetcher (threads=10 ) for URLs in Links:f.push (URLs) while F.taskleft (): Url,content = F.pop () print Url,len (content)
8. A few trivial experiences
1. Connection pool:
Like Opener.open and Urllib2.urlopen, a new HTTP request is created. Usually this is not a problem, because in a linear environment, a second may be reborn as a request; However, in a multi-threaded environment, each second can be dozens of hundred requests, so as long as a few minutes, the normal rational server will be banned you.
In normal HTML requests, however, it is normal to keep dozens of connections to the server at the same time, so you can manually maintain a httpconnection pool, and then select the connection from the connection pool each time you crawl.
Here is a trickery method, is to use squid to do a proxy server to crawl, then squid will automatically maintain the connection pool for you, but also with the data cache function, and squid is my every server must be installed on the east, why bother to write connection pool it.
2. Set the stack size of the thread
Stack size settings will significantly affect Python's memory footprint, and Python multithreading does not set this value to cause the program to consume a lot of memory, which is very deadly for OpenVZ VPS. Stack_size must be greater than 32768, in fact should always be 32768*2 above
From threading Import stack_sizestack_size (32768*16)
3. Automatic Retry after Setup failure
def get (self,req,retries=3): try: response = Self.opener.open (req) data = Response.read () except Exception, what: print what,req if retries>0: return Self.get (req,retries-1) else: print ' GET Failed ', req return ' return data
4. Set timeout
Import Socket Socket.setdefaulttimeout (TEN) #设置10秒后连接超时
Landing more simplified, first build_opener to add a cookie support, such as to log in VeryCD, to fetcher add an empty method login, and in Init (), and then inherit the Fetcher class and override the login method:
def login (Self,username,password): Import urllib data=urllib.urlencode ({' username ': username, ' password ': Password, ' Continue ': ' http://www.verycd.com/', ' login_submit ': U ' login '. Encode (' Utf-8 '), ' Save_cookie ' : 1,}) url = ' Http://www.verycd.com/signin ' Self.opener.open (url,data). Read ()
The fetcher automatically logs on to the VeryCD website when it is initialized.
9. Summary
So, the above is a summary of the Python crawler grasp the practical skills of the entire content, the text of the code is simple, easy to use, performance is also good, I believe that you use Python has a great help.