Python Web page parsing

Source: Internet
Author: User
Tags xpath server memory

Continue on the article, the Web page crawl after the page is parsed.

There are many libraries parsing pages in Python, and I started with BeautifulSoup, which seems to be the most well-known HTML parsing library in Python. Its main feature is the fault tolerance is very good, can deal with the real life of a variety of messy web pages, and its API is quite flexible and rich.

But I am in my own body extract project, gradually unbearable beautifulsoup, mainly because of the following several reasons:

    • Because BeautifulSoup 3 (the current version) relies on Python's built-in sgmllib.py, and sgmllib.py has a number of intolerable problems, it does not get the correct results when parsing certain Web pages
    • Because the upper and lower layers of beautifulsoup 3 are completely python, speed is simply intolerable and sometimes slower than the Internet.

Let's talk about the above questions.

The first is the parsing problem, look at the following Python code:

 fromBeautifulSoupImportBeautifulSoup HTML= u'<div class=my-css>hello</div>'PrintBeautifulSoup (HTML). Find ('Div')['class']html= u'<div class= My CSS class >hello</div>'PrintBeautifulSoup (HTML). Find ('Div')['class']html= u'<div class= "My CSS class" >hello</div>'PrintBeautifulSoup (HTML). Find ('Div')['class']

The result of the first print was My-css, the second was nothing, and the third print result was the "My CSS class" we expected.

It turns out that the regular expression that Sgmllib uses when parsing attributes does not take into account the problem of non-ASCII characters. This is a relatively good problem to solve, as long as we import the Sgmllib module at the beginning of the program and then modify its corresponding property to the regular expression variable, as follows:

Import= re.compile (R'\s* ([a-za-z_][-.:a-za-z_0-9]*) (\s*=\s* (\ ' [^\ ']*\ ' | ') [^"]*"| [^\s^\ ' ^\ "^>]*))? ')

The second question is a little more troublesome, or the code:

 from Import  = u'<a onclick= "if (x>10) alert (x);" href= "javascript:void (0)" >hello</a> ' Print BeautifulSoup (HTML). Find ('a'). attrs

The result of the printing is [(u‘onclick‘, u‘if(x>10) alert(x);‘)] .

Obviously, the href attribute of element A is lost. The reason is that when the Sgmllib library parses the attribute, it will end the parsing of the attribute once it encounters a special symbol such as >. To solve this problem, you can only modify the Sgmlparser parse_starttag method in Sgmllib, find 292 rows, that is k = match.end(0) , add the following code:

if k > j:    = Endbracket.search (RawData, k+1)    ifnotreturn -1    = Match.start (0)

As for the beautifulsoup slow problem, it can not be done by simple code modification, can only be replaced by an HTML parsing library. In fact, I now use lxml, which is based on the C language development of LIBXML2 and LIBXSLT library. By my personal test, the speed is 10 times times faster than the BeautifulSoup 3 average.

Parsing HTML using the lxml library is very simple and very compatible, and most Web pages on the actual web site can be parsed correctly. And lxml uses the very convenient XPath syntax for element querying, which supports XPath 1.0 functions such as String-length, count, and so on (see XPath and XSLT with lxml). However, 2.0 of functions, such as the function of the sequence operation will not be, this requires the developers of the underlying LIBXML2 and LIBXSLT libraries to upgrade the two libraries, add support for the XPath 2.0 function.

Suppose we get a Unicode type of Web page and want to get all the links with the href attribute, the code is as follows:

Import== Dom.xpath ('.//a[@href]')

There are some caveats to using lxml:

    1. You cannot lxml.html.fromstring pass a string with a length of 0, or you will throw a parse exception. It is necessary to judge in advance that if the length is zero, you can pass in
    2. Some Web pages for some reason, there is \x00 in the content of the page, that is, ASCII encoded 0 characters. Since lxml is developed in the C language, \x00 means the end of the string in C, so you need to replace these characters ( html.replace(‘\x00‘, ‘‘) )
    3. In general, in order to reduce coding guessing errors, the page strings we pass to are lxml.html.fromstring Unicode strings, that is, strings that have been coded for detection and decode. But if the page is <?xml, and there is encoding set (like <?xml version="1.0" encoding="UTF-8" ?> this), that is, an XML wrapped HTML, then we must pass the original string to lxml, or lxml will also report an exception, because for this kind of document, lxml will try to use its own decoding mechanism to do
    4. lxml compatibility is limited, no mainstream browser so tolerant. Therefore, a few browsers can be roughly normal display of the webpage, lxml still cannot parse out

The 12th problem is solved in the above questions, but the third one is the most troublesome. Because you do not know whether this page is a coded XML document, there is only an exception, ValueError but lxml used to ValueError report all errors, so in order to be precise, you need to parse the string information of the exception to know whether it is the problem caused by the parsing exception, if so, Will not decode the content of the Web page again to lxml, if not the error.

The fourth question is annoying, but to tell the truth, we can not do too much work, lxml compatibility has been quite good, the feasible way is to abandon these pages. or change a tool, do not use lxml to do page parsing, but in the Python library is difficult to find a better than lxml HTML parsing library.

An example of parsing HTML code with lxml can refer to the method in the body extraction program __create_dom .

There is also a suspense problem, that is, in the parsing of Web pages, memory pressure is large, lxml seems to be a memory overflow problem.

I have a program that scans tens of thousands of pages every day, parsing four thousand or five thousand of pages. Probably every one months or 1.5 months, this program will cause the server memory full, no matter how much memory, eat all. I used to think that it might be the re-entry of the underlying code (because I use multithreading to do the program), but later switched to multi-process and single-threaded mode will have this problem, only to trace the error code is called lxml.html.fromstring when the overflow.

But this bug super hard to reproduce, and I have many programs will keep calling lxml (some every day, some hourly, some will parse hundreds of pages each time) to do the work of HTML parsing, but only this program will occasionally overflow situation, too depressed.

Similar reports are available on the web, but it is still not possible to reproduce and determine the bug effectively. So I can only write a script that limits the amount of memory the Python program consumes, and then calls the Python program through the scripts, like this:

# !/bin/bash  -M 1536000-v 1536000python my-prog.py

Parsing a Web page also has some other work to do, such as converting non-standard (web-site-Customized) HTML tags into spans or div to be recognized as text. The rest is the work of debugging.

In addition, when writing this article, BeautifulSoup 4 will be released, it promises to support python built-in libraries, lxml and Html5lib and other different HTML parsing engine for Web page parsing, Perhaps the new version of BeautifulSoup is also a good choice.

Python Web page parsing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.