Parsing and processing of XML and HTML

Last Update:2017-11-15 Source: Internet

Author: User

Tags processing text

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, Python processes xml

XML refers to Extensible Markup Language (eXtensible Markup Language). XML is designed to transmit and store data. XML is a set of rules that define semantic markup that divides a document into parts and identifies the parts. It is also a meta-markup language that defines syntactic languages used to define other, semantic, structured markup languages that are relevant to a particular domain.

Python the XML the parsing : Common XML programming interfaces have DOM and SAX, and the two interfaces handle XML files in different ways, of course, using different scenarios.

1.SAX (Simple API for XML)

The Python standard library contains sax parsers, andSax uses event-driven models to process XML by triggering events in the parsing of XML and invoking user-defined callback functions. file.

2.DOM (Document Object Model)

Parses XML data into a tree in memory, manipulating the XMLby manipulating the tree.

Note: because the DOM needs to map XML data to the tree in memory, one is slower, the other is more memory consumption, and SAX streams the XML file, faster, Consumes less memory, but requires the user to implement a callback function (handler).

Cases:

Cat Book.xml

<?xml version= "1.0" encoding= "iso-8859-1"? ><bookstore><book><title lang= "Eng" >Harry Potter </title><price>29.99</price></book><book><title lang= "Eng" >Learning XML</ Title><price>39.95</price></book></bookstore>

The relevant code that is handled with Python is as follows:

Import stringfrom xml.parsers.expat import parsercreateclass defaultsaxhandler (Object ):     def start_element (self,name,attrs):         self.name=name         #print (' element:%s, attrs:%s '  %   (NAME,STR (attrs)))         print ("<" +name+ ">")      def end_element (self,name):         #print (' end  element:%s '  % name)         print ("</" +name+ ">")      def char_data (Self,text):        if  Text.strip ():             print ("%s ' S text  is %s " %  (Self.name,text)) Handler = defaultsaxhandler () parser =  Parsercreate () parser. StartelemeNthandler=handler.start_elementparser. Endelementhandler=handler.end_elementparser. Characterdatahandler=handler.char_datawith open (' Book.xml ')  as f:    parser . Parse (F.read ())

Capture the country's various provinces zip code example:

650) this.width=650; "src=" https://s4.51cto.com/oss/201711/15/ebbe7c5dabe92e4b624f60f5e9d8e107.jpg "title=" 11.jpg "alt=" Ebbe7c5dabe92e4b624f60f5e9d8e107.jpg "/>

Import requestsfrom xml.parsers.expat import parsercreateclass defaultsaxhandler ( Object):     def __init__ (self,provinces):         self.provinces=provinces    def start_element (self,name,attrs):         if name !=  ' map ':             name = attrs[' title ']             number = attrs[' href ']             self.provinces.append ((Name,number))     def end_element (self,name):         pass    def char_data (Self,text):         passdef get_province_entry (URL):     Content=requests.get (URL).Content.decode (' gb2312 ')     start=content.find (' <map name= "map_86"  id= "map_86 ">")     end=content.find (' </map> ')     content=content[start: End+len (' </map> ')].strip ()      #print (content)     provinces  = []    handler = defaultsaxhandler (provinces)      Parser = parsercreate ()     parser. Startelementhandler = handler.start_element    parser. Endelementhandler = handler.end_element    parser. Characterdatahandler = handler.char_data    parser. Parse (content)     return provincesprovinces=get_province_entry (' http://www.ip138.com/ Post ') print (provinces)

A small example of DOM:

From xml.dom Import Minidomdoc = Minidom.parse (' book.xml ') root = Doc.documentelementprint (root.nodename) Books = Root.getelementsbytagname (' book ') for book in books:titles = Book.getelementsbytagname (' title ') Prices = Book.getele Mentsbytagname (' Price ') print (Titles[0].childnodes[0].nodevalue + ":" + prices[0].childnodes[0].nodevalue)

Second, Htmlparser

The core of Html.parser is the Htmlparser class. The process of working is: when the feed gives it an HTML-like string, it calls the Goahead method to iterate over each label and calls the corresponding Parse_xxxx method to extract Start_tag, tag, data, comment, and End_tag and data, and then call the corresponding method to process the extracted content

Handle_startendtag #处理开始标签和结束标签

Handle_starttag #处理开始标签, such as <xx>

Handle_endtag #处理结束标签, such as </xx> or <....../>

Handle_charref #处理特殊字符串, that is, & #开头的, is usually the inner code representation of the character

Handle_entityref #处理一些特殊字符, beginning with &, for example  

Handle_data #处理 <xx>data</xx> The data in the middle

Handle_comment #处理注释

Handle_decl #处理 <!, such as <!. DOCTYPE HTML PUBLIC "-//w3c//dtd HTML 4.01 transitional//en"

Handle_pi #处理形如 <?instruction>

Markupbase Installation Method : Direct ' pip install ' cannot be installed successfully, try Command ' pip search Markupbase ' to get package name ' Micropython-markupbase ', then download this package directly on the webpage, after downloading there is a ' _ markupbase.py ' file, remove the filename prefix and copy the file to the Python installation directory ' \lib\site-packages '. Example: CP markupbase.py/usr/local/lib/python3.6/site-packages/

The following example: Process the specified HTML file

#coding =utf-8from htmlparser import htmlparserclass myparser (Htmlparser):      "" "A simple example of Htmlparser" "    def handle_decl (SELF,&NBSP;DECL):          "" Processing header Document "" "         Htmlparser.handle_decl (SELF,&NBSP;DECL)         print (decl)      def handle_starttag (self, tag, attrs):          "" Processing start Tag "" "        htmlparser.handle_starttag (Self, tag,  attrs)         if not htmlparser.get_starttag_text (self ). EndsWith ("/>"):             print ("<" +tag+ " > ")     def handle_data (Self, data):          "" Processing text Element "" "         htmlparser.handle_data (Self, data)          print (data)     def handle_endtag (Self, tag):          "" Processing end Tag "" "        htmlparser.handle_endtag ( Self, tag)         if not htmlparser.get_starttag_text ( Self). EndsWith ("/>"):             print ("</" + tag+ ">")     def handle_startendtag (self, tag, attrs):          "" "Handle self-closing label" "        htmlparser.handle_ Startendtag (self, tag, attrs)         print (HTMLParser.get_ Starttag_text (self))     def handle_comment (Self, data):          "" " Process comment "" "        htmlparser.handle_comment (self, data)          print (data)     def close (self):         htmlparser.close (self)         print (" Parser over ") Demo=myparser () demo.feed (open (" test.html "). Read ()) Demo.close ()

This article is from the "Worknote" blog, make sure to keep this source http://caiyuanji.blog.51cto.com/11462293/1981977

Parsing and processing of XML and HTML

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More