Transferred from: http://www.cnblogs.com/rzhang/archive/2011/12/29/python-html-parsing.html
Python commonly used in the page parsing library has BeautifulSoup and lxml.html, the former may be more well-known, the panda began to use the BeautifulSoup, but found that it really has a few problems around the past, so the final use of the lxml:
1. BeautifulSoup is too slow. Panda originally wrote the program is to extract the text of the page, so the need to do a lot of Dom parsing Web pages, after testing can be determined that BS average 10 times times slower than lxml. The reason should be Libxml2+libxslt's native C code is faster than Python.
2. BS relies on Python's own sgmllib, but this sgmllib has at least two problems. First, it resolves a problem with a string such as "class= My CSS class," which is known in the following code.
?
1 2 3 |
from BeautifulSoup import BeautifulSoup html = u ‘<div class=我的CSS类>hello</div>‘ print BeautifulSoup(html).find( ‘div‘ )[ ‘class‘ ] |
The printed result is a zero-length string, not my CSS class.
But this problem can be solved by the peripheral code, just rewrite the sgmllib attrfind this lookup element property of the regular line, you can change to
?
1 |
sgmllib.attrfind = re. compile (r ‘\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*(\‘[^\‘]*\‘|"[^"]*"|[^\s^\‘^\"^>]*))?‘ ) |
This problem can be said that the Web page writing is not standardized, and can not blame Sgmllib bar, but this and BS originally wanted to parse the format of the HTML is not good purpose is contrary.
But the second question is a bit more deadly, see the example code below.
?
1 2 3 |
from BeautifulSoup import BeautifulSoup html = u ‘<a onclick="if(x>10) alert(x);" href="javascript:void(0)">hello</a>‘ print BeautifulSoup(html).find( ‘a‘ ).attrs |
The printed results are:
?
1 |
[(u ‘onclick‘ , u ‘if(x>10) alert(x);‘ )] |
Obviously the href attribute is discarded, the reason is that the Sgmllib library in the resolution of the attribute when the > and other special symbols will end the parsing of properties, to solve this problem, can only modify Sgmlparser Parse_starttag code in Sgmllib, Find Line 292, k = match.end (0), and add the following code:
?
1 2 3 4 |
if k > J: match = endbracket.search (rawdata, K + 1 ) if Not match: return - 1 j = match.start ( 0 ) |
So the contrast lxml will be much better, perhaps in parsing some of the HTML is really a problem, but it is still very good to use the situation. And lxml XPath feeling really great, a few years ago in tossing Asp.net/web service when learning xpath/xslt things, but practical actually very few, this time with lxml XPath, can speed up a lot of more cumbersome elements to find, It's so cool. For example, to find all meta elements that have the name attribute and the content property:
?
1 |
dom.xpath( ‘.//meta[@name][@content]‘ ) |
Here is the code that determines if element x is the ancestor of element y:?
1 |
x in y.xpath( ‘ancestor-or-self::*‘ ) |
In addition, lxml supports XPath 1.0 functions such as String-length, Count, and so on (see XPath andXSLT with lxml). However, 2.0 of functions, such as the function of the sequence operation will not be, this requires the underlying LIBXML2 and LIBXSLT library upgrade.
Of course, lxml also has its own problem, that is, multi-threaded aspect seems to have a re-entry problem, if you need to parse a large number of pages, it can only start a number of processes to try.
Goto: Python page parsing: BeautifulSoup vs lxml.html