Use Stax to parse XML

Source: Internet
Author: User
Tags ibm developerworks
1. Preface

Most of this article is excerpted from IBM developerworks (mainly theoretical). For details about the following three articles, the excerpt mainly aims to make yourself understand more, just take notes... this is also a reference for future use! The excerpt is not comprehensive, and the original content must be enriched. For details, see the original document.

References:

Parsing XML with Stax, Part 1: streaming API for XML (Stax) Introduction: http://www.ibm.com/developerworks/cn/xml/x-stax1.html

Use Stax to parse XML, Part 1: Pull parsing and events: http://www.ibm.com/developerworks/cn/xml/x-stax2.html

Parsing XML with Stax, Part 1: using custom events and writing XML: http://www.ibm.com/developerworks/cn/xml/x-stax3.html

2. Overview

At the beginning, Java API for XML processiong (JAXP) provides two XML Processing Methods: Document Object Model (DOM) and Simple API for XML (SAX ), JSR-173 proposes a new method for stream-oriented: streaming API for XML (Stax ). Its final version was released in March 2004 and became part of JAXP 1.4 (included in Java 6.

As the name implies, Stax focuses on the stream, and Stax enables applications to process XML as an event stream. In fact, the sax method is also an XML Processing Method Based on the event stream, however, the difference between the two lies in that sax is based on the observer mode. we need to provide the event handler and register it with the parser. The parser calls back the program we provide when the specified event occurs; while Stax allows our programs to "pull" events one by one, so that Stax has more flexibility and there is no need to "pull" events that we are not interested in.

Stax provides two sets of APIS for processing XML, which provide different levels of abstraction. Pointer-Based APIS process XML as a tag (or event) stream. Applications can check the status of the parser to obtain the information of the last tag to be parsed, then process the next tag, and so on. This is a low-level API. Despite its high efficiency, it does not provide the abstraction of the underlying XML structure. The iterator-based API processes XML as a series of event objects. The application only needs to determine the type of the event to be parsed, convert it to the corresponding specific type, and then use its method to obtain information about the event.

3. Basic Principles

No matter which API we use, the first thing we need to do is get the parser factory instance, configure the custom or pre-defined attributes that can be set for the instance as needed (their names are defined in the xmlinputfactory class), and then create a parser as follows:

XMLInputFactory inputFactory = XMLInputFactory.newFactory();XMLEventReader eventReader = inputFactory.createXMLEventReader(new FileInputStream("E:\\PDFPATH_6.xml"));

In this example, the event object-based parser xmleventreader is created. To use a pointer-based API, you can call a createxmlstreamreader method to obtain an xmlstreamreader; APIS Based on the event iterator have more object-oriented features than pointer-Based APIS, because the current parser status is reflected in the event object, therefore, when processing an event, you do not need to access the parser. All the required information is encapsulated in the obtained event object.

4. pointer-Based APIS

The pointer-based API moves the logic pointer in the XML tag stream to process XML. A pointer-based parser is essentially a state machine, which transfers from one State to another driven by an event. The trigger event here is the XML tag that will be parsed by the parser before the stream is marked using the appropriate method. In each status, you can use a group of methods to obtain information about the previous event. Generally, not all methods can be used in each State.

Pointer-based API is a low-level method for parsing XML. In this way, the application moves the pointer along the XML tag stream and checks the parser status in each step to learn more about the parsed content. This method is highly efficient and especially suitable for environments with limited resources.

To use pointer-Based APIS, you must first obtain xmlstreamreader from xmlinputfactory by calling the createxmlstreamreader method of the xmlinputfactory instance created above. This method has multiple overloaded versions and supports different types of input.

4.1 xmlstreamreader Interface

The xmlstreamreader interface basically defines pointer-Based APIS (Mark constants are defined in its super-class xmlstreamconstants interface ). It is called pointer-based because the reader is like a pointer to the underlying mark stream. The application can push the pointer forward along the marking stream and analyze the flag where the current pointer is located.

Xmlstreamreader provides multiple methods to tag a stream by navigation. To determine the type of the tag (or event) to which the current Pointer Points, the application can call geteventtype (). This method returns a tag constant defined in xmlstreamconstants. Move to the next tag, and the application can call next (). This method also returns the parsed tag type. If you call geteventtype (), the returned value is the same. This method (and other mobile reader methods) can be called only when the method hasnext () returns true (that is, other tags need to be parsed ).

Sample Code:

// create an XMLStreamReaderXMLStreamReader r = ...;int event = r.getEventType();while (true) {switch (event) {case XMLStreamConstants.START_DOCUMENT:// do somethingbreak;case XMLStreamConstants.START_ELEMENT:// do somethingbreak;// add cases for each event of interest}if (!r.hasNext())break; event = r.next();}

It can also be used with other methods to move the reader. The nexttag () method skips all blank spaces, comments, or processing commands until start_element or end_element is encountered. This method is useful when parsing Content containing only elements. If non-blank text (excluding comments or processing instructions) is encountered before the mark is found, an exception is thrown. The getelementtext () method returns all text content between the start and end labels (start_element and end_element) of an element. If a nested element is encountered, an exception is thrown.

Note that the "tag" and "Event" here can be used interchangeably. Although the pointer-based API documentation describes events, it is very convenient to regard the input source as a marked stream. It is not easy to cause confusion because there is also a complete set of event-based APIs (where events are real objects ). However, events of xmlstreamreader are not all tags. For example, the start_document and end_document events do not need corresponding tags. The previous event occurs before the resolution starts, and the latter occurs when there is no more parsing work to do (for example, after the tag is closed for the last element is parsed, the reader is in the end_element state, however, if no more tags need to be resolved, the reader will switch
End_document status ).

4.2. Process XML documents

After the creation, xmlstreamreader starts from the start_document status (that is, geteventtype () returns start_document ). This should be taken into account when processing tags. Unlike the iterator, you do not need to move the pointer (using next () to a valid state. Similarly, when the reader is converted to the final state end_document, the application should not move it any more. In this state, the hasnext () method returns false.

The start_document event provides methods to obtain information about the document, such as getencoding (), getversion (), and isstandalone (). The application can also call getproperty (string) to obtain the value of the named attribute. However, some attributes are defined only in a specific State (for example, if the current event is a DTD, the attribute javax. XML. stream. notations and javax. XML. stream. returns all symbols and object declarations respectively ).

In start_element and end_element events, you can use methods related to element names and namespaces (such as getname (), getlocalname (), getprefix (), and getnamespacexxx ()), you can also use the property-related methods (getattributexxx () in the start_element event ()).

Attribute and namespace are also identified as independent events, although they are not used in parsing typical XML documents. However, attribute or namespace nodes can be used to return XPath query results.

Text-based events (such as characters, CDATA, comment, and space) can be obtained using various gettextxxx () methods. You can use getpitarget () and getpidata () to retrieve the target and data of processing_instruction. Entity_reference and DTD also support gettext () and entity_reference also support getlocalname ().

After the parsing is complete, the application closes the reader and releases the resources obtained during the parsing process. Note that the underlying input source is not disabled.

4.3 streamfilter

You can call the createfilteredreader method with the basic reader of xmlinputfactory and a filter defined by an application (that is, a class instance that implements streamfilter) to create a filtered xmlstreamreader. When you navigate to a filtered reader, the reader queries the filter every time it moves to the next tag. If the filter recognizes the current event, it is published to the filtered reader. Otherwise, skip this tag and check the next one. This method allows developers to create a pointer-based XML processing program that only processes the subset of parsed content, and use it with filters for different extended content models.

5. APIs Based on Event objects

This API is centered on Event objects. Like pointer-Based APIS, it is also a "pull"-based XML parsing method: The application uses the provided method to pull each event from the parser and process the event as needed, wait until the stream resolution is complete (or the application decides to stop parsing ).

5.1 Introduction to The xmleventreader Interface

The main interface of the event iterator API is xmleventreader. Compared with xmlstreamreader, it has fewer methods. This is because xmleventreader is used to iterate the event object stream (in fact, xmleventreader extends java. util. iterator ). All information about parsing events is encapsulated in the event object rather than the reader.

To use an API Based on the event iterator, the application must first obtain the xmleventreader instance from xmlinputfactory. Like the createxmlstreamreader method, createxmleventreader also has multiple overloaded versions, multiple Input sources for creating xmleventreader are supported. Note that the createxmleventreader method can also use xmlstreamreader to create xmleventreader as a parameter. In this usage, the APIS Based on the event iterator are stacked on the pointer-based APIs.
. In fact, you usually need to use other input sources to create an xmlstreamreader, and then use it to create xmleventreader.

5.2 use xmleventreader

After xmleventreader is created, the application can use it to iterate over the event that represents the infoset fragment of the underlying XML stream. Because the interface xmleventreader extends java. util. iterator, you can use standard iterator methods such as hasnext () and next (). However, note that the remove () method is not supported. If you call this method, an exception is thrown.

Xmleventreader also provides some convenient methods to simplify XML processing:

  • Nextevent () is essentially a strongly typed method equivalent to the next () method of iterator. It returns an xmlevent, which is the basic interface of all event objects.
  • Nexttag () can skip all irrelevant spaces until the next start or end mark. Therefore, the returned value is a startelement or endelement event. This method is particularly useful when processing pure elements (that is, elements declared as empty in the document type declaration DTD.
  • Getelementtext () can access the text content of plain text elements (between the start tag and the end tag ). Starting from startelement as the next expected event, this method connects all characters and returns the result string before an endelement occurs.
  • Peek () can get the next event that the iterator will return (if any) without moving the iterator.

For more information about the APIS, see the jdk api documentation.

5.3. xmlevent hierarchy

Xmleventreader communicates its status with the application through the event object after each step of the parsing process. The standard types of event objects used in the entire API are defined in the javax. xml. Stream. Events package. The xmlevent interface indicates the root of the type hierarchy. All types of events must be extended. It indicates various pointer layer event types (in pointer-Based APIS) are defined in the interface xmlstreamconstants. However, you can also use custom interfaces (as long as xmlevent is extended ).

After an event is retrieved from the parser, the application usually needs to convert it down to the sub-type of xmlevent to access information of this specific type. Xmlevent provides the geteventtype () method to return the event constants defined in xmlstreamconstants, and can perform downward type conversion based on this information. In addition, xmlevent also provides a Boolean query method, for example, if isstartelement () returns true, it indicates that this is a startelement. asstartelement (), asendelement (), and ascharacters () convert the corresponding events
Startelement, endelement, and characters.

5.4 eventfilter filtering events

Stax allows us to create a special event reader, that is, xmleventreader can only read the specified event type object, which requires the eventfilter interface. You only need to call the createxmleventreader (xmleventreader, eventfilter) method for the xmlinputfactory instance, and pass the basic event reader and a simple filter to accept/reject events obtained from the basic reader.

For example:

eventReader = inputFactory.createFilteredReader(eventReader, new EventFilter() {public boolean accept(XMLEvent event) {int type = event.getEventType();return type == XMLStreamConstants.START_ELEMENT|| type == XMLStreamConstants.END_ELEMENT|| type == XMLStreamConstants.CHARACTERS;}});

The xmleventreader obtained in the code above can only accept three events: Element start, element end, and character.

6. Specific applications

The reason for this Stax concern is that the project needs to parse the large XML file and import the content into the database. The XML file exceeds 100 MB, and the record exceeds 1 million. Therefore, it is not feasible to use the traditional Dom to parse the file into a Java object at a time, because millions of Java objects will appear in the memory, and a large part of the content will be eaten. Therefore, you can only use the method of edge resolution and storage.

Next, I will paste the main application code and use the event object-based API.

XML file structure:

<? XML version = "1.0" encoding = "GBK"?> <TRS> <rec> <pdfpath> \ 1989yy05 \ R15 \ 92257x \ 013 \ 002 \ 99130.133 </pdfpath> <br/> <UI> 1989017091 </UI> <br/> <Zhonghua> 0 </Zhonghua> <br/> </REC> <rec> <pdfpath> \ 1989yy02 \ r4 \ 94093x \ 004 \ 001 \ 184114.20.</pdfpath> <br/> <UI> 1989019986 </UI> <br/> <Zhonghua> 0 </Zhonghua> <br/> </REC> //... many entries </TRS>

Java Parsing Code (because it is not used, it is rough and has not been optimized yet ..):

Package COM. ninemax. admin. action; import Java. io. fileinputstream; import Java. io. filenotfoundexception; import Java. util. stack; import javax. XML. stream. eventfilter; import javax. XML. stream. xmleventreader; import javax. XML. stream. xmlinputfactory; import javax. XML. stream. xmlstreamconstants; import javax. XML. stream. xmlstreamexception; import javax. XML. stream. events. xmlevent; import Org. apache. struts2.convention. annotation. namespace; import Org. springframework. beans. factory. annotation. autowired; import COM. ninemax. action. base. baseactionsupport; import COM. ninemax. entity. domain; import COM. ninemax. service. admin. idomainservice; @ namespace ("/") public class literalurlaction extends baseactionsupport {public static final string entity_tag = "REC"; // entity tag public static final string path_tag = "pdfpath "; // attribute tag public static final string ui_tag = "UI"; // attribute tag public static final string flag_tag = "Zhonghua"; // attribute tag @ autowiredprivate idomainservice domainservice; public String test () {xmlinputfactory inputfactory = xmlinputfactory. newfactory (); stack <domain> stack = new stack <domain> (); try {long s = system. currenttimemillis (); int I = 0; xmleventreader eventreader = inputfactory. createxmleventreader (New fileinputstream ("E: \ pdfpath_6.xml"); eventreader = inputfactory. createfilteredreader (eventreader, new eventfilter () {public Boolean accept (xmlevent event) {int type = event. geteventtype (); return type = xmlstreamconstants. start_element | type = xmlstreamconstants. end_element | type = xmlstreamconstants. characters ;}}); While (eventreader. hasnext () {xmlevent event = eventreader. nextevent (); If (event. isstartelement () {string tag = event. asstartelement (). getname (). getlocalpart (); If (entity_tag.w.signorecase (TAG) {// if it is an entity element, create a new element and press it to the top of the stack domain = new domain (); stack. push (domain);} else if (ui_tag.w.signorecase (TAG) {// attribute, set to the current object string UI = eventreader. nextevent (). ascharacters (). getdata (); stack. lastelement (). setui (UI);} else if (path_tag.equalsignorecase (TAG) {// attribute, set to the current object string Path = eventreader. nextevent (). ascharacters (). getdata (); stack. lastelement (). setpath (PATH);} else if (flag_tag.equalsignorecase (TAG) {// attribute, set to the current object string flag = eventreader. nextevent (). ascharacters (). getdata (); stack. lastelement (). setflag (FLAG) ;}} else if (event. isendelement () {// element end event string tag = event. asendelement (). getname (). getlocalpart (); If (entity_tag.w.signorecase (TAG) & stack. size () = 10000) {// if the event ends with an entity element and the number of instances in the stack has reached 10000, the data is stored in the domainservice at a time. saveentity (stack); I + = stack. size (); stack. clear (); // clear stack }}if (stack. size ()> 0) {// after the resolution is completed, the remaining data is stored in the domainservice. saveentity (stack); I + = stack. size ();} eventreader. close (); system. out. println ("total cost" + (system. currenttimemillis ()-S) + "Millisecond Time. A total of "+ I +" records ");} catch (filenotfoundexception e) {e. printstacktrace ();} catch (xmlstreamexception e) {e. printstacktrace ();} return NULL ;}}

According to tests, the performance can also be...

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.