Goto: Python page parsing: BeautifulSoup vs lxml.html

Last Update:2015-08-15 Source: Internet

Author: User

Tags xpath xslt

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Transferred from: http://www.cnblogs.com/rzhang/archive/2011/12/29/python-html-parsing.html

Python commonly used in the page parsing library has BeautifulSoup and lxml.html, the former may be more well-known, the panda began to use the BeautifulSoup, but found that it really has a few problems around the past, so the final use of the lxml:

1. BeautifulSoup is too slow. Panda originally wrote the program is to extract the text of the page, so the need to do a lot of Dom parsing Web pages, after testing can be determined that BS average 10 times times slower than lxml. The reason should be Libxml2+libxslt's native C code is faster than Python.

2. BS relies on Python's own sgmllib, but this sgmllib has at least two problems. First, it resolves a problem with a string such as "class= My CSS class," which is known in the following code.

1 2 3 from BeautifulSoup import BeautifulSoup html = u‘<div class=我的CSS类>hello</div>‘ print BeautifulSoup(html).find(‘div‘)[‘class‘]

The printed result is a zero-length string, not my CSS class.

But this problem can be solved by the peripheral code, just rewrite the sgmllib attrfind this lookup element property of the regular line, you can change to

1	`sgmllib.attrfind` `=` `re.compile(r‘\s([a-zA-Z_][-.:a-zA-Z_0-9])(\s=\s(\‘[^\‘]\‘\|"[^"]"\|[^\s^\‘^\"^>]*))?‘)`

This problem can be said that the Web page writing is not standardized, and can not blame Sgmllib bar, but this and BS originally wanted to parse the format of the HTML is not good purpose is contrary.

But the second question is a bit more deadly, see the example code below.

1 2 3 from BeautifulSoup import BeautifulSoup html = u‘<a onclick="if(x>10) alert(x);" href="javascript:void(0)">hello</a>‘ print BeautifulSoup(html).find(‘a‘).attrs

The printed results are:

1	`[(u‘onclick‘, u‘if(x>10) alert(x);‘)]`

Obviously the href attribute is discarded, the reason is that the Sgmllib library in the resolution of the attribute when the > and other special symbols will end the parsing of properties, to solve this problem, can only modify Sgmlparser Parse_starttag code in Sgmllib, Find Line 292, k = match.end (0), and add the following code:

1 2 3 4 if k > J: match = endbracket.search (rawdata, K + 1 ) if Not match: return - 1 j = match.start ( 0 )

So the contrast lxml will be much better, perhaps in parsing some of the HTML is really a problem, but it is still very good to use the situation. And lxml XPath feeling really great, a few years ago in tossing Asp.net/web service when learning xpath/xslt things, but practical actually very few, this time with lxml XPath, can speed up a lot of more cumbersome elements to find, It's so cool. For example, to find all meta elements that have the name attribute and the content property:

1	`dom.xpath(‘.//meta[@name][@content]‘)`

Here is the code that determines if element x is the ancestor of element y:?

1	`x` `in` `y.xpath(‘ancestor-or-self::*‘)`

In addition, lxml supports XPath 1.0 functions such as String-length, Count, and so on (see XPath andXSLT with lxml). However, 2.0 of functions, such as the function of the sequence operation will not be, this requires the underlying LIBXML2 and LIBXSLT library upgrade.

Of course, lxml also has its own problem, that is, multi-threaded aspect seems to have a re-entry problem, if you need to parse a large number of pages, it can only start a number of processes to try.

Goto: Python page parsing: BeautifulSoup vs lxml.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More