Introduction to xml parsing in java

Source: Internet
Author: User

I haven't written a blog for a long time. The reason is that my computer broke down last month and I just got it back recently. I was busy working on the XML file parsing project two days ago. Let's share it with you here.
Xml parsing is nothing more than file decomposition. First, the labels of each node are read, and then whether the node contains parameters. If there are parameters, the parameters in the node are traversed, that is, to break down the strings on both sides and obtain the parameter name on the left, while the value on the right is the value corresponding to the parameter; next, determine whether the node contains "innerText". Of course, this is required. If the current node ends with "/>", it also contains "innerText ", in this case, it is regarded as a syntax error. Finally, the innerText under the node is parsed into a node, if there is a node. Basically, XML parsing is like this. Below I will post some of my parsing code and give some explanations.
Private synchronized Node parser (Node baseNode, String document ){
// The content of this section is blank and will not be parsed.
If (document = null ){
Return null;
}
// This method is used to process subnodes.
If (document. indexOf ('<') =-1 & document. indexOf ('>') =-1 ){
// The content of this section is innerText and will not be parsed.
Return null;
}
If (document. indexOf ('<') =-1 & document. indexOf ('> ')! =-1 ){
// The content of this section is incorrect and an exception is thrown.
Throw new XMLContentException ();
}
If (document. indexOf ('<')! =-1 & document. indexOf ('>') =-1 ){
// The content of this section is incorrect and an exception is thrown.
Throw new XMLContentException ();
}
If (document. indexOf ("<! --")! =-1 & document. indexOf ("-->") =-1 ){
// The content of this section is incorrect and an exception is thrown.
Throw new XMLContentException ();
}
If (document. indexOf ("<! -- ") =-1 & document. indexOf (" --> ")! =-1 ){
// The content of this section is incorrect and an exception is thrown.
Throw new XMLContentException ();
}

// Match "<" to see if the character string is the start part (the first letter is an English letter). The node name must be at least one character long.
String regExTagStart = "* [A-Za-z] [\ w. \-:] + [\ da-zA-Z] + *";
Pattern regexTagStart = Pattern. compile (regExTagStart );

Document = document. substring (document. indexOf ('<') + 1). trim ();
// If the current node does not contain spaces (for example, <root>), the current node is "Pure node"
// The index used to extract the node name
Int endIndex =-1;
// Indicates whether the current node contains Parameters
Boolean hasParams = true;
If (document. substring (0, document. indexOf ('>'). indexOf ('') =-1 ){
EndIndex = document. indexOf ('> ');
HasParams = false;
} Else {
EndIndex = document. indexOf ('');
}
// Obtain the name of the current node
String tag = document. substring (0, endIndex );
// Some verification is added here
If (! RegexTagStart. matcher (tag). matches ()){
// If verification fails, an exception is thrown.
Throw new XMLContentException ();
}
// Create a Node object for storing the current node
Node node = new Node (tag );
Node. addLisener (nodeHandler );

// If the verification succeeds and the current node contains parameters, the parameters and parameter values of the node are taken out.
If (hasParams ){
Document = document. substring (document. indexOf (''). trim ();
// Obtain the current tag row
String tagInline = document. substring (0, document. indexOf ('>') + 1). trim ();
If (tagInline. indexOf ("/> ")! =-1
& Document. indexOf ('>') = document. indexOf ("/>") + 1 ){
Document = document. substring (document. indexOf ("/> "));
} Else if (tagInline. indexOf ("/>") =-1 ){
Document = document. substring (document. indexOf ('> '));
}
// Used to match tag rows
Pattern regExInline = Pattern
. Compile ("(\ w + * = * \" [^ \ n \ f \ r \ "] * \" [\ n \ r \ t] *) */?> $ ");
If (! RegExInline. matcher (tagInline). matches ()){
// Throw an exception
Throw new XMLContentException ();
}
// Traverse all node attributes and add them to the node collection
While (true ){
// If no parameter exists in the current node, jump out of the loop
If (tagInline. indexOf ('=') =-1 ){
Break;
}
String paramName = new String ();
String paramValue = new String ();
Boolean paramIsKeyword = false;
ParamName = tagInline. substring (0, tagInline. indexOf ('= '))
. Trim ();
TagInline = tagInline. substring (tagInline. indexOf ('=') + 1 );
ParamValue = tagInline. substring (tagInline. indexOf ('"') + 1 );
ParamValue = paramValue. substring (0, paramValue. indexOf ('"'));
TagInline = tagInline. substring (tagInline. indexOf ('"')
+ ParamValue. length () + 2 );
// If the node parameter name is a key name such as "name" and "value", add it to a specific attribute.
If (paramName. inclusignorecase ("name ")){
ParamIsKeyword = true;
Node. setName (paramValue );
}
If (paramName. inclusignorecase ("value ")){
ParamIsKeyword = true;
Node. setValue (paramValue );
}
// When the node parameter name is not a key name, add it to the parameter set
If (! ParamIsKeyword ){
// Add parameters to the node list
Node. addParam (paramName, paramValue );
}
}
}
// If the current node ends with "/>", the innerText value of the node object is ignored.
If (document. indexOf ("/> ")! =-1
& Document. indexOf ('>') = document. indexOf ("/>") + 1 ){
// The current node has been completed. If the document contains text, continue searching for the next node.
Document = document. substring (document. indexOf ('>') + 1 );
If (document. length ()> 0 ){
BaseNode. addNode (node );
Node = parser (baseNode, document );
}
Return node;
}
// Get the end tag and remove the tag space
Document = document. replaceFirst ("</[] *" + tag + "[] *>", "</" + tag + "> ");
// Obtain the innerText value of the current node: there are some problems here. If the </[tag]> medium contains spaces, the current node cannot end, resulting in a syntax error.
String innerText = document. substring (document. indexOf ('>') + 1,
Document. indexOf ("</" + tag + "> "));
Node. setInnerText (innerText );

// The current node has been completed. If there is still text in the document, continue searching for the next node: The same as above
Document = document. substring (
Document. indexOf ("</" + tag + ">") + tag. length () + 3). trim ();
If (document. length ()> 0 ){
BaseNode. addNode (node );
Node = parser (baseNode, document );
}
Return node;
} The Code basically uses indexOf and substring to capture the tag and its parameters, which is somewhat ugly. However, I am currently preparing to rewrite it to regular expressions for parsing. I will not talk much about it, analyze him now. The preceding section mainly analyzes the xml file (not so much the xml file as the innerText of the parent node of the baseNode parameter). If a syntax error occurs in the xml file, immediately stop parsing the node and throw an exception. addLisener (nodeHandler); this sentence is used to add a listener to the currently created Node object, which is critical here, before implementing the parser, I learned that a lot of xml parsing is to load all documents from the file to the memory at a time, and then hand over the parsing method to traverse all the nodes in the document, and create the corresponding object. The special feature of this example is that it does not read the xml document to the memory and then traverse all the nodes, but only obtains its root node, if necessary (that is, when the user calls his subnode, because the node set under the root node needs to be obtained first, a time source can be set here, used to trigger parsing of innerText on the current node), and then perform parsing for the node. This improves the system efficiency if the node is not fully used when the xml document is read, because of this, the resolution method is only It is responsible for parsing innerText under the current parent node, so it is very helpful for the system implementation and easy to understand the program. There is another piece of code that is also very critical. After the completion, the debugging will take an afternoon. For example, if there are multiple subnodes in the current parent node and the method can only return one Node object, traversing all nodes will inevitably overwrite the previous nodes, later, I tried to add a parameter in the method, baseNode, which calls the parsing method as the passed parent node object. If there are multiple node objects, add it to baseNode. Because the class object exists in the heap, the object will not be overwritten and will only be modified on the original object. The following is the trigger time processing @ Override when the node is called.
Public void innerTextUsing (Node node, String innerText ){
// TODO Auto-generated method stub
Node n = parser (node, innerText );
If (n = null)
Return;
Node. addNode (n );
} The code here is very simple. When an event is triggered, the event source will pass in the Node object that triggers the event and its innerText value, and then directly hand it over to the parsing method for processing. Conclusion: The entire parsing process is very simple. A preprocessing method is automatically called after the document object is created. This method mainly processes the xml file header, removes the comments in the document, and then executes the parsing method, for the first time, the parsing method only parses the root node of the xml document and saves the remaining content to the innerText of the root node. An event is triggered when the node acquisition method is called, then the innerText corresponding to the node is parsed in the event, so that the cycle is repeated. There are various exceptions in the analysis. I hope you can correct them.

This article is from "focus on achieving the future !" Blog, please be sure to keep this source http://xiaodpro.blog.51cto.com/744826/381758

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.