Practical skills for crawling websites using python

Last Update:2017-05-14 Source: Internet

Author: User

Tags epoch time php website

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Many people use python, and the most commonly used crawler scripts are: scripts that capture the local verification of the proxy, and scripts that automatically receive emails, I have also written a simple script for verification code recognition, so today we will summarize some practical skills for python crawler websites. Preface

The scripts that have been written have a common feature that is related to the web. some methods of getting links are always used, and many crawler websites are Accumulated. here, we will summarize some of them, in the future, you do not need to repeat your work.

1. the most basic website capture

import urllib2content = urllib2.urlopen('http://XXXX').read()

2. use a proxy server

This is useful in some situations, such as the IP address being blocked, or the number of IP addresses being accessed is limited.

import urllib2proxy_support = urllib2.ProxyHandler({'http':'http://XX.XX.XX.XX:XXXX'})opener = urllib2.build_opener(proxy_support, urllib2.HTTPHandler)urllib2.install_opener(opener)content = urllib2.urlopen('http://XXXX').read()

3. logon required

I have trouble splitting the problem:

3.1 cookie processing

import urllib2, cookielibcookie_support= urllib2.HTTPCookieProcessor(cookielib.CookieJar())opener = urllib2.build_opener(cookie_support, urllib2.HTTPHandler)urllib2.install_opener(opener)content = urllib2.urlopen('http://XXXX').read()

Yes. if you want to use both proxy and cookie, addproxy_supportThenopernerChange

opener = urllib2.build_opener(proxy_support, cookie_support, urllib2.HTTPHandler)

3.2 form processing

Do I need to fill in the form for logon? First, use a tool to intercept the content of the table to be filled in.

For example, I usually use the firefox + httpfox plug-in to see what packages I have actually sent.

Here is an example. take verycd as an example. First, find your POST request and the POST form item:

If you can see verycd, enterUsername, password, continueURI, fk, login_submitAmong these items, fk is randomly generated (in fact, it is not very random, it looks like the epoch time is generated by a simple code), it needs to be obtained from the web page, that is, you have to visit the Web page first, use regular expressions and other tools to intercept the fk items in the returned data.As the name suggests, continueURI can be written at will. login_submit is fixed., Which can be seen from the source code. And username and password.

Okay. with the data to be filled in, we need to generate postdata.

Import urllibpostdata = urllib. urlencode ({'username': 'xxxxx', 'password': 'xxxxx', 'continuuris ': 'http: // www.verycd.com/', 'fk ': fk, 'login _ submit ': 'login '})

Then generate an http request and then send the request:

req = urllib2.Request( url = 'http://secure.verycd.com/signin/*/#', data = postdata)result = urllib2.urlopen(req).read()

3.3 disguised as browser access

Some websites dislike crawler visits, so they reject requests from crawlers. At this time, we need to pretend to be a browser, which can be achieved by modifying the header in the http package:

headers = { 'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}req = urllib2.Request( url = 'http://secure.verycd.com/signin/*/#', data = postdata, headers = headers)

3.4 Anti-anti-leeching"

Some sites have so-called anti-leeching settings. In fact, it is very simple to say that it is to check whether the referer site is in the header of your request, so we only need to be like 3.3, change the referer of headers to this website. The cnbeta, known as the black screen, is used as an example:

headers = { 'Referer':'http://www.cnbeta.com/articles'}

Headers is a dict data structure. you can put any desired header for some disguise. For example, some smart websites always like to look at people's privacy. if someone else accesses the website via proxy, he must read X-Forwarded-For in the header to see the real IP address of others, let's Change X-Forwarde-For directly. you can change it to something fun to bully and bully him.

3.5 ultimate success

Sometimes, even if we do 3.1-3.4, the access will still get data, so there is no way, honestly write all the headers seen in httpfox, that would be fine. No, you can only use the ultimate trick. selenium directly controls the browser to access the website. as long as the browser can do so, it can do the same. For example, pamie, watir, and so on.

4. multi-thread concurrent capturing

If the single thread is too slow, multiple threads are required. here, a simple thread pool template is provided. this program simply prints 1-10, but it can be seen that it is concurrent.

From threading import Threadfrom Queue import Queuefrom time import sleep # q is the task Queue # NUM is the total number of concurrent threads # How many JOBS are there? q = Queue () NUM = 2 JOBS = 10 # Specific Processing function, responsible for processing a single task def do_somthing_using (arguments): print arguments # This is a working process, gets data from the queue and processes def working (): while True: arguments = q. get () do_somthing_using (arguments) sleep (1) q. task_done () # fork NUM threads waiting queue for I in range (NUM): t = Thread (target = working) t. setDaemon (True) t. start () # queue JOBS into the queue for I in range (JOBS): q. put (I) # wait for all JOBS to complete q. join ()

5. process the verification code

What should I do if I encounter a verification code? There are two solutions:

1. google's verification code.

2. simple verification code: the number of characters is limited. it only uses simple translation or rotation to add noise without being distorted. this kind of code can still be processed. the general idea is to rotate the code back, remove the noise, divide a single character, and then use the feature extraction method (such as PCA) to reduce the dimension and generate a feature library, and then compare the verification code with the feature Library. This is complicated. I can't finish a blog post, so I won't start it here. please study this textbook for details.

In fact, some verification codes are still very weak, so I won't name them here. anyway, I have extracted very accurate verification codes through the 2 method, so 2 is actually feasible.

6 gzip/deflate support

The current web pages generally support gzip compression, which can solve a large amount of transmission time. taking the VeryCD homepage as an example, the uncompressed version of 247 K is compressed to a later version of 45 K, which is 1/5 of the original version. This means that the crawling speed will be 5 times faster.

However, python's urllib/urllib2 does not support compression by default. to return the compression format, you must specify 'Accept-encoding 'in the request header ', after reading response, check the header to check whether there is a 'Content-encoding' item to determine whether decoding is required. this is tedious and trivial. How can we make urllib2 automatically support gzip and defalte?

In fact, you can inherit the BaseHanlder class and then use the build_opener method to handle it:

import urllib2from gzip import GzipFilefrom StringIO import StringIOclass ContentEncodingProcessor(urllib2.BaseHandler): """A handler to add gzip capabilities to urllib2 requests """  # add headers to requests def http_request(self, req): req.add_header("Accept-Encoding", "gzip, deflate") return req  # decode def http_response(self, req, resp): old_resp = resp # gzip if resp.headers.get("content-encoding") == "gzip":  gz = GzipFile(     fileobj=StringIO(resp.read()),     mode="r"     )  resp = urllib2.addinfourl(gz, old_resp.headers, old_resp.url, old_resp.code)  resp.msg = old_resp.msg # deflate if resp.headers.get("content-encoding") == "deflate":  gz = StringIO( deflate(resp.read()) )  resp = urllib2.addinfourl(gz, old_resp.headers, old_resp.url, old_resp.code) # 'class to add info() and  resp.msg = old_resp.msg return resp # deflate supportimport zlibdef deflate(data): # zlib only provides the zlib compress format, not the deflate format; try:    # so on top of all there's this workaround: return zlib.decompress(data, -zlib.MAX_WBITS) except zlib.error: return zlib.decompress(data)

Then it's easy,

Encoding_support = protocol = urllib2.build _ opener (encoding_support, urllib2.HTTPHandler) # open the webpage directly with opener. if the server supports gzip/defalte, automatically decompress content = opener. open (url). read ()

7. more convenient multithreading

The summary does mention a simple multi-threaded template, but the real application of the plug-in the program will only make the program fragmented and unsightly. I also had some brains on how to make multithreading more convenient. First think about how to make multi-threaded calls the most convenient?

1. asynchronous I/O capturing with twisted

In fact, it is not necessary to use multithreading to capture more efficiently. you can also use the asynchronous I/O method: directly use the twisted getPage method, then add the callback and errback methods at the end of asynchronous I/O respectively. For example, you can do this:

From twisted. web. client import getPagefrom twisted. internet import reactor links = ['http: // www.verycd.com/topics/mongod/'{ I for I in range (5420,5430)] def parse_page (data, url): print len (data), url def fetch_error (error, error, url): print error. getErrorMessage (), url # capture links in batches for url in links: getPage (url, timeout = 5 )\. addCallback (parse_page, url) \ # Call the parse_page method if the call succeeds. addErrback (fetch_error, url) # Call the fetch_error method reactor if the request fails. callLater (5, reactor. stop) # notify the reactor to end the program reactor in 5 seconds. run ()

The code written by a twisted person is too distorted and accepted by an abnormal person. Although this simple example looks good, the whole person who writes the twisted program is distorted every time, I am so tired that the document does not exist. I have to read the source code to know how to complete it.

If you want to support gzip/deflate and even some login extensions, you have to write a new HTTPClientFactory class for twisted, and so on. my frown is really big, so I gave up. If you have perseverance, please try it on your own.

2. design a simple multi-thread crawling class

I still feel more comfortable in the python "local" stuff such as urllib. Think about it. if there is a Fetcher class, you can call it like this.

F = Fetcher (threads = 10) # set the number of download threads to 10for url in urls: f. push (url) # push all URLs into the download queue while f. taskleft (): # if there is any unfinished download thread content = f. pop () # retrieve the result do_with (content) from the download completion queue # process content

Such a multi-threaded call is simple and clear, so we should design it like this. First, we should have two queues and use Queue to solve the problem. The basic architecture of multithreading is similar to the article "skill Summary, the push method and pop method are both well processed, and the Queue method is used directly. if taskleft has a "running task" or "tasks in the Queue", it is yes, the code is as follows:

Import urllib2from threading import Thread, Lockfrom Queue import Queueimport time class Fetcher: def _ init _ (self, threads): self. opener = urllib2.build _ opener (urllib2.HTTPHandler) self. lock = Lock () # thread lock self. q_req = Queue () # task Queue self. q_ans = Queue () # complete the Queue self. threads = threads for I in range (threads): t = Thread (target = self. threadget) t. setDaemon (True) t. start () self. running = 0 def _ del _ (self): # wait for the two queues to complete the time. sleep (0.5) self. q_req.join () self. q_ans.join () def taskleft (self): return self. q_req.qsize () + self. q_ans.qsize () + self. running def push (self, req): self. q_req.put (req) def pop (self): return self. q_ans.get () def threadget (self): while True: req = self. q_req.get () with self. lock: # to ensure the atomicity of the operation, enter the critical area self. running + = 1 try: ans = self. opener. open (req ). read () failed t Exception, what: ans = ''print what self. q_ans.put (req, ans) with self. lock: self. running-= 1 self. q_req.task_done () time. sleep (0.1) # don't spam if _ name _ = "_ main _": links = [' http://www.verycd.com/topics/%d/ '% I for I in range (5420,5430)] f = Fetcher (threads = 10) for url in links: f. push (url) while f. taskleft (): url, content = f. pop () print url, len (content)

8. some trivial experiences

1. connection pool:

Opener. open, like urllib2.urlopen, creates an http request. Generally, this is not a problem, because in a linear environment, a request may be generated in one second. However, in a multi-threaded environment, hundreds of requests can be generated every second, in just a few minutes, normal rational servers will block you.

However, in normal html requests, it is normal to maintain dozens of connections to the server at the same time. Therefore, you can manually maintain an HttpConnection pool, then, you can select a connection from the connection pool for connection every capture.

A clever method is to use squid as a proxy server to capture data. squid automatically maintains the connection pool for you and also provides the data cache function, in addition, squid is required on every server. why bother writing a connection pool.

2. set the stack size of the thread.

The stack size setting will significantly affect the memory usage of python. if this value is not set for multiple threads in python, the program occupies a large amount of memory, which is critical to the vps of openvz. Stack_size must be greater than 32768. In fact, it should always be more than 32768*2.

from threading import stack_sizestack_size(32768*16)

3. Automatic Retry upon failed settings

 def get(self,req,retries=3):  try:   response = self.opener.open(req)   data = response.read()  except Exception , what:   print what,req   if retries>0:    return self.get(req,retries-1)   else:    print 'GET Failed',req    return ''  return data

4. set timeout

Import socket. setdefatimetimeout (10) # set the connection timeout after 10 seconds

The login is simplified. First, we need to add cookie support to build_opener. to log on to VeryCD, add an empty login method to Fetcher and call it in init, then inherit the Fetcher class and override login method:

Def login (self, username, password): import urllib data = urllib. urlencode ({'username': username, 'password': password, 'contine': 'http: // www.verycd.com/', 'login _ submit': U' '. encode ('utf-8'), 'SAVE _ cookies': 1,}) url = 'http: // www.verycd.com/signin' self. opener. open (url, data ). read ()

Therefore, when Fetcher is initialized, the system will automatically log on to the VeryCD website.

9. Summary

In this case, the above is a summary of all the practical skills of the python crawler website. The content in this article is simple, easy to use, and has good performance. I believe it will be of great help for you to use python.

For more articles about the practical skills of python crawlers, refer to the Chinese PHP website!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More