Parsing html__html with Python

Source: Internet
Author: User
Tags tagname in python

In Python, there are three libraries that can parse HTML text, Htmlparser,sgmllib,htmllib. They do not implement the method, but the function is similar. The classes that provide parsing HTML in these three libraries are base classes and do not work in their own right. After they have discovered the components (such as tags, annotations, reputations, and so on), they call the corresponding functions that must be overloaded because they are not processed in the base class.

Like what:

"" <p>the <a href= "http://ietf.org" >IETF admonishes:
<i>be strict in what <b>send</b>.</i></a></p>
<form>
<input type=submit > <input type=text name=start size=4></form>
</body>"""

If the data is processed, the Handle_starttag function is invoked for Htmlparser when the

Below is a detailed introduction to the next few library 1, Htmlparser

#------------------htmlparser_stack.py------------------# #--CODING:GBK--Import htmlparser,sys,os,string html = ""

The output of this function:

/html/body/p >> the
/html/body/p/a >> IETF admonishes:
/html/body/p/a/i >> be strict in what
/html/body/p/a/i/b >> Send
/html/body/p/a/i >>.

For some pages, there may not be a strict start to end tag pairs, at which point we can go to ignore some tags. You can write a stack yourself to handle these tags.

#*---------------Tagstack Class Example-----------------# class Tagstack:def __init__ (self, lst=[]): Self.lst = lst def __getitem__ (Self, POS): Return Self.lst[pos] def append (self, tag): # Remove every paragraph-level tag if it is one if Tag.lower () in (' P ', ' blockquote '): Self.lst = [t-T in Self.lst if T-not in (' P ', ' blockquote ')] self.lst.append (tag) de F pop (self, Tag): # "Pop" by tag from nearest POS, not only last item self.lst.reverse () Try:pos = Self.lst.index (tag) ex Cept valueerror:raise htmlparser.htmlparseerror, "Tag not on stack" del Self.lst[pos] self.lst.reverse () Tagstack = Tagst ACK ()

Htmlparser has a bug that can't handle Chinese attributes, for example, if there is a paragraph in the page:

<input type=submit value= Jump to >

Then parsing to this line will make an error.


The wrong reason or the regular expression of the trouble.

Attrfind = Re.compile (
R '/s* ([a-za-z_][-.:a-za-z_0-9]*) (/s*=/s* '
R ' (/' [^/']*/' |] [^"]*"| [-a-za-z0-9./,:;+*%?! &$/(/) _#=~@]*))
Attrfind does not match Chinese characters.

You can change this match to fix this error. Sgmllib This error is not present.

2, Sgmllib


The HTML format is a subset of the SGML format, so SGML can handle a lot of things, and here's a snippet of code to sample Sgmllib usage.

#------------------htmlparser_stack.py------------------# #--CODING:GBK--Import sgmllib,sys,os,string html = "" ;lala>

Output:


Start tag:Start tag:<title>
/lala >> Advice
End Tag:</title>
End Tag:Start tag:<body>
Start tag:<p>
/lala >> the
Start tag:<a>
/lala >> IETF admonishes:
Start tag:<i>
/lala >> be strict in what
Start tag:<b>
/lala >> Send
End Tag:</b>
/lala >>.
End Tag:</i>
End Tag:</a>
End Tag:</p>
Start tag:<form>
Start tag:<input>
/lala >>υ
Start tag:<input>
End Tag:</form>
End Tag:</body>
End Tag:</lala>

As with Htmlparser, if you want to parse HTML with sgmllib, you inherit sgmllib. Sgmlparser class, the functions in this class are empty and the user needs to overload it. The function provided by this class is to invoke the corresponding function in a particular case.

For example, when the

SGML labels are customizable, such as defining a Start_lala function, and then processing the <lala> tag.

There is a place to be explained, if the Start_tagname function is defined, and the Handle_starttag function is defined, then the function will only run the Handle_starttag function, and start_tagname null function is not a problem. If the Handle_starttag function is not defined, the Start_tagname function is run when the <tagname> label is encountered. If the tagname start function is not defined, the label is an unknown label and the Unknown_starttag function is called

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.