Python Crawler Basics and Tricks

Last Update:2018-07-02 Source: Internet

Author: User

Tags http cookie

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Based on Python2.7

1 Basic Crawl Pages

Get method

importurllib2url = "http://www.baidu.com"response = urllib2.urlopen(url)printresponse.read()

Post method

importurllibimport urllib2 url = "http://abcde.com"form = {'name':'abc','password':'1234'}form_data = urllib.urlencode(form)request = urllib2.Request(url,form_data)response = urllib2.urlopen(request)printresponse.read()

2 Using proxy IP

In the development of the crawler will often encounter IP is blocked, then use proxy IP;

There are Proxyhandler classes in the URLLIB2 package, which allows you to set up proxy access pages as follows:

importurllib2 proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})opener = urllib2.build_opener(proxy)urllib2.install_opener(opener)response = urllib2.urlopen(' printresponse.read()

3 Cookie Processing

Cookies are data (usually encrypted) that are stored on the user's local terminal in order to identify the user, perform session tracking, and Python provides a cookielib module for processing cookies, The primary role of the Cookielib module is to provide objects that store cookies to facilitate access to Internet resources in conjunction with the URLLIB2 module.

importurllib2, cookielib cookie_support= urllib2.HTTPCookieProcessor(cookielib.CookieJar())opener = urllib2.build_opener(cookie_support)urllib2.install_opener(opener)content =urllib2.urlopen('http://XXXX').read()

The key is Cookiejar (), which manages the HTTP cookie value, stores the cookie generated by the HTTP request, and adds a cookie to the outgoing HTTP request.

The entire cookie is stored in memory, and the cookie is lost after the Cookiejar instance is garbage collected, and all processes do not need to be operated separately.

4 masquerading as a browser

Some websites resent the crawler's visit, so they refuse to request the crawler. So with URLLIB2 direct access to the site will often appear HTTP Error 403:forbidden situation

Special attention should be paid to some of the headers, which are checked against these headers by the server side.

1.user-agent Some servers or proxies check this value to determine whether a browser-initiated Request

2.content-type when using the REST interface, the Server checks the value to determine how the content in the HTTP Body should be parsed.

This can be done by modifying the header in the HTTP package as follows:

importurllib2 headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'}request = urllib2.Request( url = 'http://my.oschina.net/jhao104/blog?catalog=3463517', headers = headers)printurllib2.urlopen(request).read()

5 page parsing

For page parsing the most powerful of course is the regular expression, this for different Web sites different users, two better URLs:

Getting Started with regular expressions: http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html

Online Regular expression test: http://tool.oschina.net/regex/

The second is the analysis of the library, commonly used there are two lxml and BeautifulSoup, for the use of the two introduced two relatively good site:

lxml:http://my.oschina.net/jhao104/blog/639448

Beautifulsoup:http://cuiqingcai.com/1319.html

For both libraries, it is the Html/xml processing library

BeautifulSoup Pure Python implementation, low efficiency, but functional, such as can be used through the results of the search for an HTML node source code;

lxml is a C language implementation that is efficient and supports XPath

6, the Verification Code processing

For some simple verification codes, simple identification can be done. Only a few simple verification code identification.

7, gzip compression

Some web pages, no matter how the transcoding is a mess of garbled. Many Web services have the ability to send compressed data, which can reduce the amount of data transmitted over a network line by more than 60%. This is especially true for XML Web services, because the compression rate of XML data can be high.

However, the general server does not send compressed data for you unless you tell the server that you can process the compressed data.

You need to modify the code like this:

importurllib2, httplibrequest = urllib2.Request('http://xxxx.com')request.add_header('Accept-encoding', 'gzip') 1opener = urllib2.build_opener()f =opener.open(request)

This is the key:

Create a Request object and add a accept-encoding header to tell the server you can accept gzip compressed data

And then unzip the data:

importStringIOimport gzip compresseddata = f.read() compressedstream = StringIO.StringIO(compresseddata)gzipper = gzip.GzipFile(fileobj=compressedstream) printgzipper.read()

8 Multi-threaded concurrent Fetch

A single thread is too slow to be multi-threaded.

Although Python's multithreading is very chicken, but for crawler this kind of network frequent type, still can improve the efficiency to a certain extent.

For multi-threaded crawling, there is a more practical way is to use the scrapy,scrapy twisted as a framework, the use of multi-threaded asynchronous crawl, can greatly speed up the crawl speed. However, it also increases the likelihood of being denied access by the server and needs to be addressed separately.

9 anti-"anti-hotlinking"

Some sites have so-called anti-hotlinking settings, in fact, it is very simple, is to check the header of your sending request, Referer site is not his own, so we just need to change the headers Referer to the site can:

headers ={ 'Referer':'http://www.baidu.com'}

Headers is a DICT data structure, you can put in any desired header, to do some camouflage.

10 Ultimate Tricks

Sometimes access or will be according to, then no way, honestly put httpfox see headers All write on, that generally also on the line. No more, it can only use the ultimate trick, selenium directly control the browser to access, as long as the browser can do, then it can do. There are similar pamie,watir and so on.

111 Experience

1. Connection pool:

Like Opener.open and Urllib2.urlopen, a new HTTP request is created. Usually this is not a problem, because in a linear environment, a second may be reborn as a request; However, in a multithreaded environment, each second can be dozens of hundreds of requests, so as long as a few minutes, the normal rational server will be banned.

In normal HTML requests, however, it is normal to keep dozens of connections to the server at the same time, so you can manually maintain a httpconnection pool, and then select the connection from the connection pool each time you crawl.

Here is a trickery method, is to use squid as a proxy server to crawl, then squid will automatically maintain the connection pool, but also comes with data caching function, and squid is the one I have on each server must be installed.

2. Automatic retry after Setup failure

defget(self,req,retries=3): try: response =self.opener.open(req) data =response.read() exceptException: printwhat,req ifretries>0: returnself.get(req,retries-1) else: print 'GET Failed',req return'' returndata

3. Set timeout

importsocketsocket.setdefaulttimeout(10) #设置10秒后连接超时

Python Crawler Basics and Tricks

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More