Dom and sax parsing for Java programmers from stupid birds to cainiao (26) XML

Last Update:2018-12-04 Source: Internet

Author: User

Tags abstract definition

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Dom Parsing

In the DOM interface specification, there are four basic interfaces: Document, node, nodelist, and namednodemap. Among the four basic interfaces, the document interface is the entry for document operations, which is inherited from the node interface. The node interface is the parent class of most other interfaces. interfaces such as document, element, attribute, text, and comment are inherited from the node interface. The nodelist interface is a collection of nodes that contain all the subnodes of a node. The namednodemap interface is also a set of nodes. Through this interface, you can establish a one-to-one ing between node names and nodes, so that you can directly access specific nodes by using node names.

1. Document

The document interface represents the entire XML/html document. Therefore, it is the root of the entire document tree and provides an entry for accessing and operating the data in the document.

2. nodelist

The nodelist interface provides an abstract definition of a node set. It does not include how to implement the definition of this node set. Nodelist is used to represent a group of nodes with sequential relationships, such as the subnode sequence of a node. In addition, it also appears in the return values of some methods, such as getelementsbytagname.

In Dom, The nodelist object is "live". In other words, changes to the document are directly reflected in the relevant nodelist object. For example, if a nodelist object is obtained through DOM and the object contains a set of all subnodes of an element node, when the element node is operated by Dom (add, delete, and modify the child nodes in the node), these changes are automatically reflected in the nodelist object, the Dom application is not required to perform other operations.

Each item in nodelist can be accessed through an index. The index value starts from 0.

3. namednodemap

Objects that implement the namednodemap interface include a set of nodes that can be accessed by name. However, note that namednodemap does not inherit from nodelist, and the nodes in the node set contained in it are unordered. Although these nodes can also be accessed through indexes, it only provides a simple way to enumerate nodes contained in namednodemap, it does not indicate that a sort order is defined for the nodes in namednodemap In the DOM specification.

Namednodemap represents the one-to-one correspondence between a group of nodes and their unique names. This interface is mainly used to represent attribute nodes. Like nodelist, namednodemap objects in Dom are also "live.

4. DOM object

Everything is a node (object)

. Node object: the most basic object in the DOM Structure

• Document Object: a document that represents the entire XML

• Nodelist object: contains a list of one or more nodes.

• Element Object: representing the Tag Element in the XML document

5. Dom parsing XML steps

Import javax. XML. parsers. *; import Org. w3C. dom. *; public class Dom {public static void main (string ARGs []) {try {// create a parser factory documentbuilderfactory factory = documentbuilderfactory. newinstance (); // obtain the parser documentbuilder builder = factory. newdocumentbuilder (); document DOC = builder. parse ("candidate. XML "); nodelist NL = Doc. getelementsbytagname ("person"); For (INT I = 0; I <NL. getlength (); I ++) {element node = (El Ement) nL. item (I); system. out. print ("name:"); system. out. println (node. getelementsbytagname ("name "). item (0 ). getfirstchild (). getnodevalue ());...... System. Out. println () ;}} catch (exception e) {e. printstacktrace ();}}}

Detailed procedures:

1) documentbuilderfactory DBF = documentbuilderfactory. newinstance ();

• Here we use documentbuilderfacotry to create programs unrelated to the specific parser. When the static method newinstance () of the documentbuilderfactory class is called, it determines which parser to use based on a system variable. Because all the Resolvers obey the interface defined by JAXP, the Code is the same no matter which parser is used. Therefore, when switching between different Resolvers, you only need to change the value of the system variable without changing any code. This is the benefit of the factory.

2) • documentbuilder DB = DBF. newdocumentbuilder ();

• After obtaining a factory object, use its static method newdocumentbuilder () to obtain a documentbuilder object, which represents a specific Dom parser. But the specific parser, Microsoft or IBM, is not important for the program.

3) then, we can use this parser to parse the XML document.

• Document Doc = dB. parse ("C:/XML/message. xml ");

• The parse () method of documentbuilder accepts an XML document name as the input parameter and returns a document object, which represents the tree model of an XML document. All subsequent operations on XML documents will be irrelevant to the parser. You can directly perform operations on this document object. The specific document operation method is defined by the Dom.

4). Starting from the document object obtained above, we can start Dom parsing. Using the getelementsbytagname () method of the document object, we can get a nodelist object. A Node object represents a tag element in an XML document, while a nodelist object, represents a list of node objects.

Nodelist NL = Doc. getelementsbytagname ("message"); • A list of node objects corresponding to the <message> label in the XML document is obtained through such a statement. Then, we can use the item () method of the nodelist object to obtain each node object in the list.

• Node my_node = NL. Item (0 );

5) when a Node object is created, the data stored in the XML file is extracted and encapsulated in the node. In this example, to extract the content in the message tag, we usually use the getnodevalue () method of the Node object.

• Stringmessage

= My_node.getfirstchild (). getnodevalue ();

Note:Note that the getfirstchild () method is used to obtain the first subnode object in the message. Although there are no sub-labels or attributes except text under the message tag, we insist on using the getfirstchild () method here, which is mainly related to W3C definition of Dom. W3C defines the text part in the label as a node. Therefore, we need to get the node representing the text before we can use getnodevalue () to obtain the text content.

6. Dom basic object explanation

Dom has five basic objects: Document, node, nodelist, element, and ATTR.

The document object represents the entire XML document. All other nodes are included in the document object in a certain order and arranged into a tree structure, programmers can traverse this tree to get all the content of the XML document, which is also the starting point for XML document operations. We always get a document object by parsing the XML source file, and then perform subsequent operations. In addition, the document contains methods for creating other nodes. For example, createattribute () is used to create an ATTR object. Its main methods include:

1). createattribute (string): Create an ATTR object with the given attribute name, and place it on an element object using the setattributenode method.

• 2) createelement (string): creates an element object with the given Tag Name, representing a tag in the XML document, then you can add attributes or perform other operations on this element object.

• 3) createtextnode (string): Creates a Text object with the given string. The text object represents the plain text string contained in the tag or attribute. If there are no other labels in a tag, the text object represented by the tag text is the unique sub-object of this element object.

4) getelementsbytagname (string): returns a nodelist object that contains all the tags with the given tag name.

5) getdocumentelement (): returns an element object representing the root element node of the DOM tree, that is, the object representing the root element of the XML document.

7. The main methods contained in the Node object are:

• Appendchild (Org. w3C. dom. node): Add a child node to the node and put it at the end of all the child nodes. If the child node already exists, delete it and add it.

• Getfirstchild (): If a node has a subnode, the first subnode is returned, and the getlastchild () method returns the last subnode.

• Getnextsibling (): returns the next sibling node of the node in the DOM tree. The peer-to-peer, And the getpreviussibling () method returns the previous sibling node.

• Getnodename (): return the node name based on the node type.

• Getnodetype (): Type of the returned Node

. Getnodevalue (): return the value of the node.

• Haschildnodes (): determines whether a subnode exists.

• Hasattributes (): determines whether the node has attributes.

• Getownerdocument (): return the Document Object of the node.

• Insertbefore (Org. W3C. Dom. node new, org. W3C. Dom. node ref): inserts a child object before a given child object.

• Removechild (Org. W3C. Dom. node): deletes a given subnode object.

. ReplaceChild (Org. W3C. Dom. node new, org. W3C. Dom. node old): replace the given subnode object with a new Node object.

• The nodelist object, as its name implies, represents a list containing one or more nodes. We can simply regard it as a node array. We can obtain the elements in the list through the method:

• Getlength (): the length of the returned list.

• Item (INT): return the Node object at the specified position.

8The element object represents the Tag Element in the XML document and inherits from the nodeIs also the main sub-object of node. Tags can contain attributes, so element objects have methods for accessing their attributes. methods defined in any node can also be used on element objects.

• Getelementsbytagname (string): returns a nodelist object containing tags with the given tag name in its child nodes.

• Gettagname (): returns a string that represents the tag name.

• Getattribute (string): return the attribute value of the given attribute name in the tag. Note that entity attributes are allowed in XML documents, and this method is not applicable to these object attributes. In this case, the getattributenode () method is used to obtain an ATTR object for further operations.

• Getattributenode (string): returns an ATTR object that represents the given attribute name.

9The. ATTR object represents the attribute in a tag.. ATTR inherits from node, but because ATTR is actually contained in element, it cannot be considered as a sub-object of element, so ATTR is not part of the DOM tree in Dom, therefore, the returned values of getparentnode (), getpreviussibling (), and getnextsibling () in node are null. That is to say, ATTR is actually regarded as part of its element object, and does not appear as a separate node in the DOM tree. This should be different from other node sub-objects in use.

Sax Parsing

The full name of Sax is simple APIs for XML. The access mode provided by Sax is a sequential mode, which is a fast way to read and write XML data. When you use the sax analyzer to analyze XML documents, a series of events are triggered and corresponding event processing functions are activated. Applications can use these event processing functions to access XML documents, therefore, the sax interface is also called an event-driven interface. Due to the simple implementation of the sax analyzer and low memory requirements, the implementation efficiency is relatively high. For applications that only need to access the data in the XML document and do not change the document, A more suitable sax analyzer.

SAX (simple APIs for XML), simple APIs for XML. When using Dom to parse XML, first load the XML file into the memory, and then access the DOM tree in the memory in a random way. Sax is executed based on events and sequentially, once an element passes through, we have no way to access it. Sax does not have to load the entire XML file into the memory beforehand, so it occupies less memory than Dom, for large XML documents, we usually use SAX instead of dom for parsing.

Sax is also the observer mode used (similar to events in the GUI)

The saxparserfactory at the top of the figure is used to generate a analyzer instance. The XML document is read from the arrow on the left. When the analyzer analyzes the document, the callback method defined in the documenthandler, errorhandler, dtdhandler, and entityresolver interfaces is triggered.

4. Sax is event-driven.The reading process of a document is the parsing process of sax. During the reading process, the parser calls different processing methods for different projects.

5.org. xml. Sax. helpers. defaulthandler Class Method

Project	Solution
Document start	Startdocument ()
<People>	Startelement ()
"Tony Blair"	Characters ()
</People>	Endelement ()
Document ended	Enddocument ()

6. Example of extracting XML document content by using Sax

Package COM. shengsiyuan. XML. sax; import Java. io. file; import Java. util. stack; import javax. XML. parsers. saxparser; import javax. XML. parsers. saxparserfactory; import Org. XML. sax. attributes; import Org. XML. sax. saxexception; import Org. XML. sax. helpers. defaulthandler; public class saxtest2 {public static void main (string [] ARGs) throws exception {// Step1: Get the saxparserfactory factory = saxparserfactory of the sax Parser factory. newinstance (); // Step 2: Obtain the saxparser parser = factory. newsaxparser (); // Step3: Start parsing parser. parse (new file ("student. XML "), new myhandler2 () ;}} class myhandler2 extends defaulthandler {private stack <string> stack = new stack <string> (); Private string name; private string gender; private string age; @ overridepublic void startelement (string Uri, string localname, string QNAME, attributes) throws saxexception {stack. push (QNAME); For (INT I = 0; I <attributes. getlength (); I ++) {string attrname = attributes. getqname (I); string attrvalue = attributes. getvalue (I); system. out. println (attrname + "=" + attrvalue) ;}@ overridepublic void characters (char [] CH, int start, int length) throws saxexception {string tag = stack. peek (); If ("name ". equals (TAG) {name = new string (CH, start, length);} else if ("gender ". equals (TAG) {gender = new string (CH, start, length);} else if ("age ". equals (TAG) {age = new string (CH, start, length) ;}@ overridepublic void endelement (string Uri, string localname, string QNAME) throws saxexception {stack. pop (); // indicates that the element has been parsed. If ("student" needs to pop up from the stack ". equals (QNAME) {system. out. println ("name:" + name); system. out. println ("Gender:" + gender); system. out. println ("Age:" + age); system. out. println ();}}}

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More