Tips for catching web pages with Python

Last Update:2014-11-15 Source: Internet

Author: User

Tags net time

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Using Python to make a web-capture program is very fast, here's an example:

Import= Urllib2.urlopen ('http://blog.raphaelzhang.com'). Read ()

But in the actual work, this writing is far from enough, at least encounter the following problems:

The network will go wrong and any errors are possible. For example, the machine is down, the network cable is broken, the domain name is wrong, the net time-out, the page has no, the website jumps, the service is forbidden, the host load is not enough ...
The server has a limit that allows only common browsers to access
The server adds the restrictions of the anti-theft chain
Some 2B websites, regardless of whether you have a accept-encoding head in your HTTP request, or whatever your head is, will always send you the gzip content.
URL links are all strange, with Chinese characters, and some even have carriage return to line
Some website HTTP header has a content-type, the webpage has several content-type, even more, each content-type is not the same, the most outrageous is, These content-type may not be used in the body of the content-type, resulting in garbled
Network link is slow, multiply analysis thousands of pages of time, suggest you can have a good meal to go
The interface of Python itself is a bit rough

Well, we'll take care of a lot of questions.

Error handling and server throttling

The first is error handling. Since Urlopen itself will make most of the errors, even the HTTP response of 4XX and 5XX, it also makes an exception, so we just need to catch the exception. At the same time, we can also get the response object returned by Urlopen, and read its HTTP status code. In addition, when we are urlopen, we also need to set the timeout parameter to ensure that the timeout is handled well. The following is a code example:

ImportUrllib2ImportSocketTry: F= Urllib2.urlopen ('http://blog.raphaelzhang.com', timeout = 10) Code=F.getcode ()ifCode < 200orCode >= 300:    #your own HTTP error handlingexceptException, E:ifisinstance (E, urllib2. Httperror):Print 'HTTP Error: {0}'. Format (e.code)elifIsinstance (E, urllib2. Urlerror) andisinstance (E.reason, socket.timeout):Print 'URL error:socket Timeout {0}'. Format (E.__str__())  Else:    Print 'Misc Error:'+ E.__str__()

If it is a server limit, in general we can view the real browser's request to set the corresponding HTTP header, such as restrictions on the browser we can set the User-agent head, for the anti-theft chain restrictions, we can set the Referer head, the following is the sample code:

Importurllib2 req= Urllib2. Request ('http://blog.raphaelzhang.com', Headers= {"Referer":"http://www.baidu.com",    "user-agent":"mozilla/5.0 (Windows NT 5.1) applewebkit/534.24 (khtml, like Gecko) chrome/11.0.696.68 safari/534.24"}) HTML= Urllib2.urlopen (url = req, timeout = ten). Read ()

Some websites use cookies to restrict, mainly involves the login and limit the flow, at this time there is no common method, can only see can do automatic login or analysis of the problem of cookies.

URL and content processing

URL in the form of strange shapes can only be individually analyzed, individual processing, caught a count of one. For example, there may be characters in the URL, relative path, and carriage return, such as the problem, we can first use the Urljoin function of the Urlparse module to deal with the problem of the relative path, and then remove the messy carriage return and other things, Finally, we use Urllib2 's quote function to deal with the problem of special character encoding and escaping, and generate a real URL.

Of course, in the process of using you will find Python urljoin and urlopen have some problems, so the specific code we give in the back.

For those who dropped the gzip content in 3,721, we directly analyzed its content format, not the HTTP header. The code is this:

Import Urllib2 Import  = Urllib2.urlopen ('http://blog.raphaelzhang.com'). Read ()if  '\x1f\x8b\x08\x00\x00\x00':  = gzip. Gzipfile (fileobj = Cstringio.stringio (HTML)). Read ()

Well, now it's time for the coding process to be painful. Since we are Chinese, we use Chinese characters, so we have to figure out what encoding (F*CK) is used in a text, otherwise the legendary garbled characters will appear.

According to the General browser processing flow, judge the page encoding first is based on HTTP server sent over HTTP response header Content-type field, such as text/html; Charset=utf-8 that this is an HTML page, using UTF8 encoding. If there is no HTTP header, or if there is a CharSet attribute in the head area of the Web page content or a META element whose Http-equiv property is Content-type, if there is a CharSet attribute, the property is read directly, and if it is the latter, The content property of this element is read, similar to the format described above.

Supposedly, if we all follow the rules of the card, the problem of coding is very good to fix. However, the problem is that people may not behave according to the rules, take their own shortcut ... (OMG). First, the HTTP response does not necessarily have a content-type header, and then some pages may not have Content-type or multiple content-type (for example, Baidu Search Cache page). At this point, we have to risk guessing, so the general process of processing the encoding is:

Read the properties of the HTTP corresponding object returned by Urlopen headers.dict[‘content-type‘] , and read the CharSet
Use regular expressions to parse the value of the Content-type/charset attribute in the meta-element in the head area of the HTML and read it
If the encoding obtained in the previous two steps is only gb2312 or GBK, you can assume that the encoding of the Web page is GBK (GBK compatible gb2312 anyway)
If more than one encoding is found, and the encodings are incompatible, such as UTF8 and ios8859-1, then we have to use the Chardet module, call Chardet detect function, read the encoding property
Once the encoding is obtained, the Web page can be transcoded into a Unicode string in Python for easy follow-up and unified processing with the Decode method.

A few things to note, my practice is that as long as the page encoding is gb2312, I will assume it as GBK. Because many websites although write is gb2312, actually is GBK code, after all gb2312 the number of characters covered is too few, many words, such as Zhu Di, David Tao, Zhu, these words are not in the gb2312 coverage.

Second, Chardet is not omnipotent, if there is ready-made content-type/charset, still use Content-type/charset. Because Chardet is using a guessing method, mainly refers to the old version of the Firefox code guessing algorithm, is based on some of the characteristics of the encoding to guess. For example, the gb2312 encoding of the word "" of the corresponding encoding may appear relatively high frequency. In my actual use, it is the probability of error 10% or so. In addition, because it wants to analyze and guess, so time is also relatively long.

The code for detection is a bit long, and you can see the implementation of the GetEncoding function in this code.

Improve performance

The main performance bottleneck of web crawler is on network processing. In this regard, you can use Cprofile.run and pstats.stats to test the call time of each function to be verified. In general, there are several ways to solve this:

Using threading or Multiprocess for parallel processing, and fetching multiple pages at the same time, the use is very simple, the document here
Add the accept-encoding header to the HTTP request and set it to gzip, which means you can accept gzip compressed data, most websites support gzip compression, saving 70 to 80% of traffic
If you do not need to read everything, you can add a range header to the HTTP request (this header is also required for HTTP breakpoint continuation). For example, setting the value of the range header to bytes=0-1023 is equivalent to asking for the first 1024 bytes of content, which can greatly save bandwidth. But a few sites do not support this head, this needs to be noted
When calling Urlopen, be sure to set the timeout parameter, or the program will always wait there

Parallel processing specific use of multi-threaded or multi-process, I now test to see the difference is not big, after all, the main bottleneck is on the network link.

In fact, in addition to the above methods, there are some ways in python that might improve performance better. For example, using Greenlet,stackless,pypy to support better multithreaded multi-process tools or Python, you can also use asynchronous IO, such as twisted or pycurl. But personal to greenlet not familiar, think twisted is too twisted,pycurl enough pythonic, and Stackless and PyPy afraid to destroy other Python program, so still use URLLIB2 + threading scheme. Of course, because of the Gil problem, Python multithreading is still not fast enough, but for single-threaded situations, there have been several times the savings.

Python's little problem

Python exposes a few minor problems when it comes to grabbing web pages. The main problem is the Urlopen and Urljoin functions.

The function of the Urljoin function in the Urlparse module is to place a relative URL, for example: /img/mypic.png, plus the absolute URL of the current page, such as Http://blog.raphaelzhang.com/apk/index.html, is converted to an absolute URL, for example in this case, HTTP/// Blog.raphaelzhang.com/img/mypic.png. However, the results of urljoin processing need to be further processed, that is, to remove unnecessary. Path, removing the special symbols such as carriage return and line breaks. The code is this:

 fromUrlparseImportUrljoin Relurl='.. /.. /img/\nmypic.png'Absurl='http://blog.raphaelzhang.com/2012/'URL=Urljoin (Absurl, Relurl)#URL is http://blog.raphaelzhang.com/../img/\nmypic.pngurl = reduce (LambdaR,x:r.replace (X[0], x[1]), [('/.. /','/'), ('\ n',"'), ('\ r',"')], URL)#The URL is the normal http://blog.raphaelzhang.com/img/mypic.png

There are some problems with the Urlopen function, which actually has its own requirements for URL strings.

First of all, the function you give to urlopen needs to do the coding of special characters such as kanji, using Urllib2 's quote function, just like this:

Import# The URL here is of course a good absolute URL after the above urljoin and reduce processing , url = urllib2.quote (Url.split ( ' # ') [0].encode ('UTF8'"%/:=&?~#+!$,; ') @()*[]")

Second, the URL cannot have #. Because theoretically, the #后面的是fragment is used to locate the entire document, not to get the document, but actually urlopen should be able to do it on its own, leaving the task to the developer. I've dealt with this problem in the way I've done with Url.split (' # ') [0].

Finally, Python 2.6 and previous versions cannot handle URLs like http://blog.raphaelzhang.com?id=123, which can cause run-time errors, and we have to manually process this URL into a normal HTTP/ blog.raphaelzhang.com/?id=123 such a URL is ok, but this bug in Python 2.7 has been resolved.

Well, the above question of Python Web page crawl is over, we still have to deal with the problem of analyzing the webpage, let the following decomposition.

Tips for catching web pages with Python

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More