Python crawler: A summary of some common crawler techniques

Last Update:2016-04-06 Source: Internet

Author: User

Tags http cookie

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python has been around for more than a year, Python applications most of the scene is the Web rapid development, crawler, automated operations: Write a simple website, write automatic post script, write send and receive mail script, write a simple verification code identification script.

Crawler in the development process also has a lot of reuse process, here to summarize, in the future can save some things.

1, basic crawl web get method?

12345 importurllib2 url = "http://www.baidu.com"response = urllib2.urlopen(url)printresponse.read()

Post method?

123456789 importurllibimporturllib2url ="http://abcde.com"form ={‘name‘:‘abc‘,‘password‘:‘1234‘}form_data =urllib.urlencode(form)request =urllib2.Request(url,form_data)response =urllib2.urlopen(request)printresponse.read()

2. Using Proxy IP

In the development of the crawler will often encounter IP is blocked, then need to use proxy IP;

There are Proxyhandler classes in the URLLIB2 package that allow you to set up a proxy to access the Web page, as in the following code snippet:

1234567 import urllib2 proxy = urllib2. Proxyhandler ({ ' http ' : ' 127.0.0.1:8087 ' }) opener = urllib2.build_opener ( Proxy) urllib2.install_opener (opener) response = urllib2.urlopen ( ' http://www.baidu.com ' ) print response.read ()

3. Cookie Processing

Cookies are data (usually encrypted) that are stored on the user's local terminal in order to identify the user, perform session tracking, and Python provides a cookielib module for processing cookies, The primary role of the Cookielib module is to provide objects that store cookies to facilitate access to Internet resources in conjunction with the URLLIB2 module.

Code snippet:

123456 < Code class= "python keyword" >import urllib2, cookielib < Code class= "Python plain" >cookie_support = urllib2. Httpcookieprocessor (Cookielib. Cookiejar ()) opener = < Code class= "Python Plain" >urllib2.build_opener (cookie_support) urllib2.install_ Opener (opener) content = urllib2.urlopen ( ' Http://XXXX '

The key is Cookiejar (), which manages the HTTP cookie value, stores the cookie generated by the HTTP request, and adds a cookie to the outgoing HTTP request. The entire cookie is stored in memory, and the cookie is lost after the Cookiejar instance is garbage collected, and all processes do not need to be operated separately.

Adding cookies manually

12	`cookie` `="PHPSESSID=91rurfqm2329bopnosfu4fvmu7; kmsign=55d2c12c9b1e3; KMUID=b6Ejc1XSwPq9o756AxnBAg="request.add_header("Cookie", cookie)`

4. Disguised as a browser

Some websites resent the crawler's visit, so they refuse to request the crawler. So with URLLIB2 direct access to the site will often appear HTTP Error 403:forbidden situation

Special attention should be paid to some of the headers, which are checked against these headers by the server side.

1.user-agent Some servers or proxies check this value to determine whether a browser-initiated Request

2.content-type when using the REST interface, the Server checks the value to determine how the content in the HTTP Body should be parsed.

This can be done by modifying the header in the HTTP package as follows:

12345678910 importurllib2headers ={ ‘User-Agent‘:‘Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6‘}request =urllib2.Request( url =‘http://my.oschina.net/jhao104/blog?catalog=3463517‘, headers =headers)print urllib2.urlopen(request).read()

5. Page parsing

For page parsing the most powerful of course is the regular expression, this for different sites different users are not the same, do not have too much to explain, with two more good URLs:

Getting Started with regular expressions: http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html

Online Regular expression test: http://tool.oschina.net/regex/

The second is the analysis of the library, commonly used there are two lxml and BeautifulSoup, for the use of the two introduced two relatively good site:

lxml:http://my.oschina.net/jhao104/blog/639448

Beautifulsoup:http://cuiqingcai.com/1319.html

For these two libraries, my evaluation is, are html/xml processing library, BeautifulSoup pure Python implementation, low efficiency, but functional, such as can be used through the results of the search for an HTML node source code; LXMLC language encoding, efficient, XPath support

6, the Verification Code processing

For some simple verification codes, simple identification can be done. I have only done some simple verification code identification. But some anti-human verification code, such as 12306, can be manually coded by the code platform, of course, this is to pay.

7, gzip compression

Have encountered some Web pages, no matter how the transcoding is a mess of garbled. Haha, that means you don't know. Many Web services have the ability to send compressed data, which can reduce the amount of data transmitted over a network line by more than 60%. This is especially true for XML Web services, because the compression rate of XML data can be high.

However, the general server does not send compressed data for you unless you tell the server that you can process the compressed data.

So you need to modify the code like this:

12345 importurllib2, httplibrequest =urllib2.Request(‘http://xxxx.com‘)request.add_header(‘Accept-encoding‘, ‘gzip‘) 1opener =urllib2.build_opener()f =opener.open(request)

This is the key: Create a Request object, add a accept-encoding header to tell the server you can accept gzip compressed data

And then unzip the data:

1234567 import stringio import gzip compresseddata = f.read () compressedstream = stringio.stringio (compresseddata) gzipper = gzip. Gzipfile (fileobj = compressedstream) print gzipper.read ()

8, multi-threaded concurrent crawl

Single thread too slow, you need to multi-threading, here to a simple thread pool template This program is simply printed 1-10, but can be seen to be concurrent.

Although Python's multithreading is very chicken, but for crawler this kind of network frequent type, still can improve the efficiency to a certain extent.

1234567891011121314151617181920212223242526272829 fromthreading importThreadfromQueue importQueuefrom time importsleep# q是任务队列#NUM是并发线程总数#JOBS是有多少任务q =Queue()NUM =2JOBS =10#具体的处理函数，负责处理单个任务defdo_somthing_using(arguments): printarguments#这个是工作进程，负责不断从队列取数据并处理defworking(): whileTrue: arguments =q.get() do_somthing_using(arguments) sleep(1) q.task_done()#fork NUM个线程等待队列fori in range(NUM): t =Thread(target=working) t.setDaemon(True) t.start()#把JOBS排入队列for i inrange(JOBS): q.put(i)#等待所有JOBS完成q.join()

Reprint please indicate source: Open source China http://my.oschina.net/jhao104/blog/647308

Python crawler: A summary of some common crawler techniques

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More