Parsing html__html with Python

Last Update:2018-07-31 Source: Internet

Author: User

Tags tagname in python

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In Python, there are three libraries that can parse HTML text, Htmlparser,sgmllib,htmllib. They do not implement the method, but the function is similar. The classes that provide parsing HTML in these three libraries are base classes and do not work in their own right. After they have discovered the components (such as tags, annotations, reputations, and so on), they call the corresponding functions that must be overloaded because they are not processed in the base class.

Like what:

"" <p>the <a href= "http://ietf.org" >IETF admonishes:
<i>be strict in what <b>send</b>.</i></a></p>
<form>
<input type=submit > <input type=text name=start size=4></form>
</body>"""

If the data is processed, the Handle_starttag function is invoked for Htmlparser when the

Below is a detailed introduction to the next few library 1, Htmlparser

#------------------htmlparser_stack.py------------------# #--CODING:GBK--Import htmlparser,sys,os,string html = ""

The output of this function:

/html/body/p >> the
/html/body/p/a >> IETF admonishes:
/html/body/p/a/i >> be strict in what
/html/body/p/a/i/b >> Send
/html/body/p/a/i >>.

For some pages, there may not be a strict start to end tag pairs, at which point we can go to ignore some tags. You can write a stack yourself to handle these tags.

#*---------------Tagstack Class Example-----------------# class Tagstack:def __init__ (self, lst=[]): Self.lst = lst def __getitem__ (Self, POS): Return Self.lst[pos] def append (self, tag): # Remove every paragraph-level tag if it is one if Tag.lower () in (' P ', ' blockquote '): Self.lst = [t-T in Self.lst if T-not in (' P ', ' blockquote ')] self.lst.append (tag) de F pop (self, Tag): # "Pop" by tag from nearest POS, not only last item self.lst.reverse () Try:pos = Self.lst.index (tag) ex Cept valueerror:raise htmlparser.htmlparseerror, "Tag not on stack" del Self.lst[pos] self.lst.reverse () Tagstack = Tagst ACK ()

Htmlparser has a bug that can't handle Chinese attributes, for example, if there is a paragraph in the page:

Then parsing to this line will make an error.

The wrong reason or the regular expression of the trouble.

Attrfind = Re.compile (
R '/s* ([a-za-z_][-.:a-za-z_0-9]*) (/s*=/s* '
R ' (/' [^/']*/' |] [^"]*"| [-a-za-z0-9./,:;+*%?! &$/(/) _#=~@]*))
Attrfind does not match Chinese characters.

You can change this match to fix this error. Sgmllib This error is not present.

2, Sgmllib

The HTML format is a subset of the SGML format, so SGML can handle a lot of things, and here's a snippet of code to sample Sgmllib usage.

#------------------htmlparser_stack.py------------------# #--CODING:GBK--Import sgmllib,sys,os,string html = "" ;lala>

Output:

Start tag:Start tag:<title>
/lala >> Advice
End Tag:</title>
End Tag:Start tag:<body>
Start tag:<p>
/lala >> the
Start tag:<a>
/lala >> IETF admonishes:
Start tag:<i>
/lala >> be strict in what
Start tag:<b>
/lala >> Send
End Tag:</b>
/lala >>.
End Tag:</i>
End Tag:</a>
End Tag:</p>
Start tag:<form>
Start tag:<input>
/lala >>υ
Start tag:<input>
End Tag:</form>
End Tag:</body>
End Tag:</lala>

As with Htmlparser, if you want to parse HTML with sgmllib, you inherit sgmllib. Sgmlparser class, the functions in this class are empty and the user needs to overload it. The function provided by this class is to invoke the corresponding function in a particular case.

For example, when the

SGML labels are customizable, such as defining a Start_lala function, and then processing the <lala> tag.

There is a place to be explained, if the Start_tagname function is defined, and the Handle_starttag function is defined, then the function will only run the Handle_starttag function, and start_tagname null function is not a problem. If the Handle_starttag function is not defined, the Start_tagname function is run when the <tagname> label is encountered. If the tagname start function is not defined, the label is an unknown label and the Unknown_starttag function is called

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More