Goto: Python page parsing: BeautifulSoup vs lxml.html

Source: Internet
Author: User
Tags xpath xslt

Transferred from: http://www.cnblogs.com/rzhang/archive/2011/12/29/python-html-parsing.html

Python commonly used in the page parsing library has BeautifulSoup and lxml.html, the former may be more well-known, the panda began to use the BeautifulSoup, but found that it really has a few problems around the past, so the final use of the lxml:

1. BeautifulSoup is too slow. Panda originally wrote the program is to extract the text of the page, so the need to do a lot of Dom parsing Web pages, after testing can be determined that BS average 10 times times slower than lxml. The reason should be Libxml2+libxslt's native C code is faster than Python.

2. BS relies on Python's own sgmllib, but this sgmllib has at least two problems. First, it resolves a problem with a string such as "class= My CSS class," which is known in the following code.

?
1 2 3 from BeautifulSoup import BeautifulSoup html = u‘<div class=我的CSS类>hello</div>‘ print BeautifulSoup(html).find(‘div‘)[‘class‘]

The printed result is a zero-length string, not my CSS class.

But this problem can be solved by the peripheral code, just rewrite the sgmllib attrfind this lookup element property of the regular line, you can change to

?
1 sgmllib.attrfind = re.compile(r‘\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*(\‘[^\‘]*\‘|"[^"]*"|[^\s^\‘^\"^>]*))?‘)

This problem can be said that the Web page writing is not standardized, and can not blame Sgmllib bar, but this and BS originally wanted to parse the format of the HTML is not good purpose is contrary.

But the second question is a bit more deadly, see the example code below.

?
1 2 3 from BeautifulSoup import BeautifulSoup html = u‘<a onclick="if(x>10) alert(x);" href="javascript:void(0)">hello</a>‘ print BeautifulSoup(html).find(‘a‘).attrs

The printed results are:

?
1 [(u‘onclick‘, u‘if(x>10) alert(x);‘)]

Obviously the href attribute is discarded, the reason is that the Sgmllib library in the resolution of the attribute when the > and other special symbols will end the parsing of properties, to solve this problem, can only modify Sgmlparser Parse_starttag code in Sgmllib, Find Line 292, k = match.end (0), and add the following code:

?
1 2 3 4 if k > J:     match = endbracket.search (rawdata, K + 1 )     if Not match: return - 1     j = match.start ( 0 )

So the contrast lxml will be much better, perhaps in parsing some of the HTML is really a problem, but it is still very good to use the situation. And lxml XPath feeling really great, a few years ago in tossing Asp.net/web service when learning xpath/xslt things, but practical actually very few, this time with lxml XPath, can speed up a lot of more cumbersome elements to find, It's so cool. For example, to find all meta elements that have the name attribute and the content property:

?
1 dom.xpath(‘.//meta[@name][@content]‘)
Here is the code that determines if element x is the ancestor of element y:?
1 x in y.xpath(‘ancestor-or-self::*‘)

In addition, lxml supports XPath 1.0 functions such as String-length, Count, and so on (see XPath andXSLT with lxml). However, 2.0 of functions, such as the function of the sequence operation will not be, this requires the underlying LIBXML2 and LIBXSLT library upgrade.

Of course, lxml also has its own problem, that is, multi-threaded aspect seems to have a re-entry problem, if you need to parse a large number of pages, it can only start a number of processes to try.

Goto: Python page parsing: BeautifulSoup vs lxml.html

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.