(reproduced):
Because to use Python to do the school network authentication process, the need to parse the server back to the HTML, I thought it would be as simple as JavaScript operation DOM, the results found that this is not the case, was engaged a bit.
In fact, Python has a xml.dom module, but this time it can not be used, why? Because the server returns HTML from the XML perspective is not well-formed, there is no closed tags, not commented out JavaScript and css,xml.dom can not handle, this time to use Sgmllib.
The sgmllib.py contains an important class: Sgmlparser. Sgmlparser breaks HTML into useful fragments, such as start and end tags. Once it has successfully decomposed a piece of data into a useful fragment, it invokes an internal method based on the data found. In order to use this parser, you need to subclass the Sgml-parser class and override these methods.
Sgmlparser class contains a lot of internal methods, after the beginning of reading HTML, encountered the corresponding data will call its corresponding method, the most important method is three:
- Start_tagname(self, attrs)
- End_tagname(self)
- Handle_data (self, text)
TagName is the name of the tag, such as when encountering <pre>, will call Start_pre, encounter </pre>, will call End_pre,attrs is the parameter of the tag, to [(attribute, value), ( attribute, value), ...] In the form of the return, we have to do is in its subclasses overloaded their interest in the corresponding function of the tag.
A classic example:
- From sgmllib import sgmlparser
- Class Urllister(Sgmlparser):
- Self . urls = []
- def start_a(self, attrs):
- href = [v for K, V in attrs if k==' href ']
- if href:
- Self . URLs. extend(href)
As the name implies, the function of this class is to extract all the links in the HTML (<a> tags) in the address (the value of the href attribute), put it in a list, very useful function. ^^
For example, the following HTML is processed:
<tr>
<TD height="207"colspan="2"align="Left"valign="Top"class="Normal">
<p>Damien Rice-"0"</p>
<a href="Http://galeki.xy568.net/music/Delicate.mp3">1. Delicate</a><br/>
<a href="Http://galeki.xy568.net/music/Volcano.mp3">2. Volcano</a><br/>
<a href="Http://galeki.xy568.net/music/The Blower ' s Daughter.mp3">3. The blower ' s daughter</a><br/>
<a href="Http://galeki.xy568.net/music/Cannonball.mp3">4. Cannonball</a><br/>
<a href="Http://galeki.xy568.net/music/Older Chests.mp3">5. Order Chests</a><br/>
<a href="Http://galeki.xy568.net/music/Amie.mp3">6. Amie</a><br/>
<a href="Http://galeki.xy568.net/music/Cheers darlin '. mp3">7. Cheers Darling</a><br/>
<a href="Http://galeki.xy568.net/music/Cold Water.mp3">8. Cold Water</a><br/>
<a href="Http://galeki.xy568.net/music/I Remember.mp3">9. I Remember</a> <br />
<a href= "Http://galeki.xy568.net/music/Eskimo.mp3" >10. Eskimo</a></td>
</tr>
It's messy, right? Here is an example of using Urllister to extract the above MP3 download address:
- Date="Above the pile of ......"
- Lister=urllister()
- Lister. Feed(date)
Use the feed () to pass the HTML you want to process to the object entity, and then we'll take a look at the processing results:
- Print Lister. URLs
Show:
[ ' Http://galeki.xy568.net/music/Delicate.mp3 ',
Span style= "color: #483d8b;" > ' Http://galeki.xy568.net/music/Volcano.mp3 ',
"http://galeki.xy568.net/music/ The blower ' s Daughter.mp3 ",
' Http://galeki.xy568.net/music/Cannonball.mp3 ',
' Http://galeki.xy568.net/music/Amie.mp3 ',
" http ://galeki.xy568.net/music/cheers darlin '. mp3 ',
' http://galeki.xy568.net/music/ Cold Water.mp3 ',
' Http://galeki.xy568.net/music/I Remember.mp3 ',
' Http://galeki.xy568.net/music/Eskimo.mp3 '
Well, isn't it convenient? Now that we know how to handle the attributes in a tag, how do we handle the text that the label contains? Is the above listed handle_data (self, text), when encountered in the contents of the tag, will call this function, the text passed in naturally is the contents of the tag, but how to filter out the contents of the tag of interest? For example, the list of songs above, it is necessary to cooperate with Start_tagname, End_tagname, used as a marker method to achieve this goal:
- class ListName(Sgmlparser):
- Is_a=""
- Name=[]
- def start_a(self, attrs):
- Self . is_a=1
- def end_a(self):
- Self . is_a=""
- def handle_data(self, text):
- if self . Is_a:
- Self . name. Append(text)
A is_a tag is added here, and an if is added to the handle_date, that is, the contents of the tag will be added to the name[] only within the a tag.
Look at the results:
- Listname=listname()
- ListName. Feed(date)
- print ListName. name
Show:
[' 1.Delicate ', ' 2.Volcano ', ' 3.The blower ' s daughter ',
' 4.Cannonball ', ' 5.Order chests ', ' 6.Amie ',
' 7.Cheers Darling ', ' 8.Cold water ', ' 9.I remember ',
' 10.Eskimo ']
OK, get it done ~
Sgmlparser built-in methods not only these three, there are processing comments handle_comment, as well as handling the handle_decl of the Declaration and so on, but the use of the same method as above, no longer write more.
Python Learning Notes