Analysis of XML documents using xerces C ++

Last Update:2018-12-04 Source: Internet

Author: User

Tags xml parser

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A while ago, I learned how to use xerces-C ++ to parse XML documents in the specified format. Here, I would like to share my learning experience with you. Here I will only talk about some beginner knowledge and hope to help you.
What is xerces-C ++?
The predecessor of xerces-C ++ is IBM's xml4c project. Xml4c and xml4j are two parallel projects, while xml4j is the predecessor of xerces-j -- JAVA Implementation. IBM assigned the source code of the two projects to the Apache Software Foundation, which renamed them xerces-C ++ and xerces-J respectively. These two projects are the core project of the Apache XML group (if we see "xerces-c" instead of "xerces-C ++", they are also the same thing, this project was originally written in C (the Translator's note: the original article is in C ++ ).
Xerces-C ++: function Introduction
Xerces-C ++ is a very robust XML parser. It provides two methods for parsing XML documents, Dom and SAX (I use DOM ).
Sax is an event-Oriented Programming API. A parsing engine consumes XML sequence data and calls back the application when the XML data structure is found. These Callbacks are called event handles.
Unlike sax, it allows you to edit XML documents and save them as a file or stream. It also allows you to build an XML document programmatically. dom provides a memory model. You can traverse the document tree, delete nodes, or graft new nodes. unlike the parsed sax event, the DOM event reflects the interaction between users and documents and changes in the use documents.
In general, Sax traverses XML documents by line, while Dom first generates the XML document tree and then traverses the DOM tree to parse each node.
Xerces-C ++: Learning Process
1,Platform Selection:
Before learning xerces-C ++, You must select an application platform, such as Windows, Linux, cygwin, and Solaris. Here, I chose Redhat Enterprise Linux as3, xerces-C ++ is xerces-c-src_2_7_0.tar.gz, can be downloaded from the official website: http://www.apache.org/directly.
2,Compile source code
Because I downloaded the source code, we need to compile it. Otherwise, we cannot load the library file.
First go to your working directory: CD/home/olcom/laubo (this is my current working directory)
Then unzip your source package: Tar zxvf xerces-c-src_2_7_0.tar.gz
Set environment variables that contain source code:
Export xercescroot =/home/olcom/laubo/xerces-c-src_2_7_0
Directory: CD xerces-c-src_2_7_0/src/xercesc
Run the script to generate the MAKEFILE file:
./Runconfigure-plinux-CGCC-XG ++-c -- prefix =/opt/apachexml
Option:-P is the operating system platform.
-C Compiler
-X C ++ Compiler
-C library configuration path
Compile source code: Make
Make install
(Compilation may take a while, and it may take about 7 minutes on my machine, so be patient)
3. Learning Library
Because the class library is large, I didn't choose to analyze and read the class library at the beginning. I first compiled and debugged a complete example on the Internet, then analyze the interface provided by the class library from the example. Here, I have simplified my program and hope it can be used as an example for everyone to learn.
First, we need to define an XML document style. Here, we define a style (containing Chinese characters) as follows:
// Sample. xml
<? XML version = "1.0" encoding = "UTF-8" standalone = "no"?>
<National Survey>
<Node1>
<Subnode>
<Subnode1>
<Subnode11> China 111-> Jiangsu </subnode11>
<Subnode11> China 112-> Tianjin </subnode11>
<Subnode11> China 113-> Beijing </subnode11>
<Subnode11> China 114-> Shanghai </subnode11>
<Subnode11> China 115-> Guangzhou </subnode11>
</Subnode1>
</Subnode>
<Subnode1> Asia 12-> South Korea </subnode1>
<Subnode2> Asia 13-> Japan </subnode2>
<Subnode3> Asia 14-> Vietnam </subnode3>
<Subnode4> Asia 15-> Cambodia </subnode4>
<Subnode5> Asia 16-> Laos </subnode5>
</Node1>
<Node2>
<Subnode> America 21-> Brazil </subnode>
<Subnode> America 22-> Argentina </subnode>
<Subnode> America 23-> Chile </subnode>
<Subnode> America 24-> Mexico </subnode>
<Subnode> America 25-> Paraguay </subnode>
<Subnode> America 26-> America </subnode>
<Subnode> America 27-> Canada </subnode>
</Node2>
<Node3>
<Subnode> Europe 31-> UK </subnode>
<Subnode> Europe 32-> Italy </subnode>
<Subnode> Europe 33-> France </subnode>
<Subnode> Europe 34-> Germany </subnode>
<Subnode> Europe 35-> Spain </subnode>
<Subnode> Europe 36-> Hungary </subnode>
</Node3>
<Node5> the end </node5>
</National Survey>
After defining the format, let's take a look at how the program parses it. The program is as follows:

Code: [copy to clipboard] // cxml. h
# Ifndef xml_parser_hpp
# Define xml_parser_hpp
# Include <xercesc/util/transservice. HPP>
# Include <xercesc/DOM. HPP>
# Include <xercesc/DOM/domdocument. HPP>
# Include <xercesc/DOM/domdocumenttype. HPP>
# Include <xercesc/DOM/domelement. HPP>
# Include <xercesc/DOM/domimplementation. HPP>
# Include <xercesc/DOM/domimplementationls. HPP>
# Include <xercesc/DOM/domnodeiterator. HPP>
# Include <xercesc/DOM/domnodelist. HPP>
# Include <xercesc/DOM/domtext. HPP>
# Include <xercesc/DOM/domattr. HPP>
# Include <xercesc/parsers/xercesdomparser. HPP>
# Include <xercesc/util/xmluni. HPP>
# Include <xercesc/framework/xmlformatter. HPP>
# Include <xercesc/util/xmlstring. HPP>
# Include <stdlib. h>
# Include <string>
# Include <vector>
# Include <stdexcept>
Using namespace STD;
Using namespace xercesc;
Class xmlstringtranslate;
Class cxml
{
Public:
Cxml ();
~ Cxml ();
Xmltransservice: Codes transervicecode;
Void xmlparser (string &) Throw (STD: runtime_error );
PRIVATE:
Xmlstringtranslate * xmltan;
Xercesc: xercesdomparser * m_domxmlparser; // defines the resolution object
};
Class xmlstringtranslate: Public xmlformattarget
{
Public:

Xmlstringtranslate (const char * const encoding );
Bool translatorutf8tochinese (string & strtranslatormsg );
Bool utf8_2_gb2312 (char * In, int inlen, char * Out, int outlen );
String translate (const xmlch * const value );
Const xmlch * const translate (const char * const value );
Virtual ~ Xmlstringtranslate ();

Protected:
Xmlformatter * fformatter;
Xmlch * fencodingused;
Xmlch * tofill;
Char * m_value;
Protected:
Enum Constants
{
Ktmpbufsize = 16*1024,
Kcharbufsize = 16*1024
};
Void clearbuffer ();
Virtual void writechars (const xmlbyte * const towrite
, Const unsigned int count
, Xmlformatter * const formatter );
};
# Endif
// Cxml. cpp
# Include <string>
# Include <iostream>
# Include <sstream>
# Include <stdexcept>
# Include <list>
# Include <sys/types. h>
# Include <sys/STAT. h>
# Include <errno. h>
# Include <unistd. h>
# Include <iconv. h>
# Include "cxml. H"
Bool xmlstringtranslate: utf8_2_gb2312 (char * In, int inlen, char * Out, int outlen) // code type conversion
{
Iconv_t Cd = iconv_open ("GBK", "UTF-8 ");
// Check CD
If (INT) Cd =-1)
{
Cout <"iconv is error" <Endl;
Return false;
}
Char * pin = In, * pout = out;
Int inlen _ = inlen + 1;
Int outlen _ = outlen;

Iconv (Cd, & pin, (size_t *) & inlen _, & pout, (size_t *) & outlen _);
Iconv_close (CD );
Return true;
}
Bool xmlstringtranslate: translatorutf8tochinese (string & strtranslatormsg)
{
Char * pstrsource = const_cast <char *> (strtranslatormsg. c_str ());
Char pstrdestination [strtranslatormsg. length () * 2 + 1]; // The compilation fails. You can change it to char * pstrdestination = new char [strtranslatormsg. length () * 2 + 1], but remember to release
Memset (pstrdestination, '/0', strtranslatormsg. Length () * 2 + 1 );
If (! Utf8_2_gb2312 (pstrsource, strtranslatormsg. Length (), pstrdestination, strtranslatormsg. Length ()))
Return false;

Strtranslatormsg = pstrdestination;
Return true;
}
Cxml: cxml ()
{
Try
{
// Initialize xerces-C ++ Library
Xmlplatformutils: Initialize ();
}
Catch (xercesc: xmlexception & excp)
{
Char * MSG = xmlstring: transcode (excp. getmessage ());
Printf ("XML toolkit initialization error: % s/n", MSG );
Xmlstring: release (& MSG );
}

Xmltan = new xmlstringtranslate ("UTF-8 ");
// Create an xercesdomparser object for parsing documents
M_domxmlparser = new xercesdomparser;
}
Cxml ::~ Cxml ()
{
Try
{
Delete xmltan;
Xmlplatformutils: Terminate ();
}
Catch (xmlexception & excp)
{
Char * MSG = xmlstring: transcode (excp. getmessage ());
Printf ("XML toolkit terminate error: % s/n", MSG );
Xmlstring: release (& MSG );
}
}
Void cxml: xmlparser (string & xmlfile) Throw (STD: runtime_error)
{
// Obtain the File Information Status
Struct stat filestatus;
Int iretstat = Stat (xmlfile. c_str (), & filestatus );
If (iretstat = enoent)
Throw (STD: runtime_error ("file_name does not exist, or path is an empty string ."));
Else if (iretstat = enotdir)
Throw (STD: runtime_error ("a component of the path is not a directory ."));
Else if (iretstat = eloop)
Throw (STD: runtime_error ("Too extends symbolic links encountered while traversing the path ."));
Else if (iretstat = eacces)
Throw (STD: runtime_error ("ermission denied ."));
Else if (iretstat = enametoolong)
Throw (STD: runtime_error ("file can not be read/N "));

// Configure domparser
M_domxmlparser-> setvalidationscheme (xercesdomparser: val_auto );
M_domxmlparser-> setdonamespaces (false );
M_domxmlparser-> setdoschema (false );
M_domxmlparser-> setloadexternaldtd (false );

Try
{
// Call the parsing interface provided by the xerces C ++ class library
M_domxmlparser-> parse (xmlfile. c_str ());

// Obtain the DOM tree
Domdocument * xmldoc = m_domxmlparser-> getdocument ();
Domelement * proot = xmldoc-> getdocumentelement ();
If (! Proot)
{
Throw (STD: runtime_error ("Empty XML document "));
}

// Create a walker to visit all text nodes.
/*************************************** *******
Domtreewalker * Walker =
Xmldoc-> createtreewalker (proot, domnodefilter: show_text, null, true );
// Use the tree Walker to print out the text nodes.
STD: cout <"treewalker:/N ";

For (domnode * Current = Walker-> nextnode (); current! = 0; current = Walker-> nextnode ())
{

Char * strvalue = xmlstring: transcode (current-> getnodevalue ());
STD: cout <strvalue;
Xmlstring: release (& strvalue );
}
STD: cout <STD: Endl;

**************************************** *********/

// Create an iterator to visit all text nodes.
Domnodeiterator * iterator = xmldoc-> createnodeiterator (proot,
Domnodefilter: show_text, null, true );

// Use the tree Walker to print out the text nodes.
STD: cout <"iterator:/N ";

For (domnode * Current = iterator-> nextnode ();
Current! = 0; current = iterator-> nextnode ())
{
String strvalue = xmltan-> translate (current-> getnodevalue ());
Xmltan-> translatorutf8tochinese (strvalue );
STD: cout <strvalue <Endl;
}

STD: cout <STD: Endl;

}
Catch (xercesc: xmlexception & excp)
{
Char * MSG = xercesc: xmlstring: transcode (excp. getmessage ());
Ostringstream errbuf;
Errbuf <"error parsing file:" <MSG <flush;
Xmlstring: release (& MSG );
}
}
Xmlstringtranslate: xmlstringtranslate (const char * const encoding): fformatter (0 ),
M_value (0), fencodingused (0), tofill (0)
{
Xmlformattarget * myformtarget = this;
Fencodingused = xmlstring: transcode (encoding );
Fformatter = new xmlformatter (fencodingused
, Myformtarget
, Xmlformatter: noescapes
, Xmlformatter: unrep_charref );
Tofill = new xmlch [ktmpbufsize];
Clearbuffer ();
}
Xmlstringtranslate ::~ Xmlstringtranslate ()
{
If (fformatter)
Delete fformatter;
If (fencodingused)
Delete [] fencodingused;
If (m_value)
Free (m_value );
If (tofill)
Free (tofill );

Fformatter = 0;
Fencodingused = 0;
M_value = 0;
Tofill = 0;
}
Void xmlstringtranslate: writechars (const xmlbyte * const towrite
, Const unsigned int count
, Xmlformatter * const formatter)
{
If (m_value)
Free (m_value );
M_value = 0;
M_value = new char [count + 1];
Memset (m_value, 0, Count + 1 );
Memcpy (m_value, (char *) towrite, Count + 1 );
}
Void xmlstringtranslate: clearbuffer ()
{
If (! Tofill)
Return;
For (INT I = 0; I <ktmpbufsize; I ++)
Tofill [I] = 0;
}
[/I] string xmlstringtranslate: translate (const xmlch * const value) // converts data types from xmlch * to string.
{
* Fformatter <value;
String strvalue = string (m_value );
Return strvalue;
}
Const xmlch * const xmlstringtranslate: translate (const char * const value)
{
Clearbuffer ();
Const unsigned int srccount = xmlstring: stringlen (value );
Unsigned char fcharsizebuf [kcharbufsize];
Xmltranscoder * ptranscoder = (xmltranscoder *) fformatter-> gettranscoder ();
Unsigned int byteseaten;
Unsigned int size = ptranscoder-> transcodefrom (
(Xmlbyte *) value,
Srccount,
Tofill,
Ktmpbufsize,
Byteseaten,
Fcharsizebuf
);
Tofill [size] = 0;
String T1 = string (value );
String t2 = translate (tofill );
Assert (T1 = t2 );
Return tofill;
}
# Ifdef main_test
Int main ()
{
String xmlfile = "sample. xml ";
Cxml;
Cxml. xmlparser (xmlfile );
Return 0;
}
# Endif

// Makefile
# This is makefile for XERCES-C ++ Appliaction
Main = xml
Cc = g ++
Cflags =-C-g-wall
$ (Main): cxml. o
[Tab] $ (CC) cxml. O-o XML-L/opt/apachexml/lib-lxerces-C
Cxml. O: cxml. cpp
[Tab] $ (CC) $ (cflags)-pedantic-I/opt/apachexml/include cxml. cpp-dmain_test
. Phony: clean
Clean:
[Tab] RM cxml. o $ (main)

The following briefly analyzes the source program:
First, to use the xerces C ++ class library to parse XML documents, you must initialize the class library. Therefore, in the class XML constructor, We initialize the Class Library: xmlplatformutils :: initialize ();
Next, we define the parsing object and initialize it in the constructor. Then, in the xmlparser function, we call the parsing function interface of the class library, upload XML file name (m_domxmlparser-> parse (xmlfile. c_str ());). Here we use the DOM method, so we need to create the DOM tree: domdocument * xmldoc = m_domxmlparser-> getdocument ();, and get the root node domelement * proot = xmldoc-> getdocumentelement () of the DOM tree ().
What is next? According to the above, we need to traverse this DOM tree, so we need a Traversal method. In the program, I provide two traversal methods, one is to create the traversal tree domtreewalker * Walker = xmldoc-> createtreewalker (proot, domnodefilter: show_text, null, true ), another method is to use the iterator to traverse the entire DOM tree domnodeiterator * iterator = xmldoc-> createnodeiterator (proot, domnodefilter: show_text, null, true ). Both methods can achieve the same effect. The Code commented out in the program is the method for creating the traversal tree.
After traversing and printing out the node value, we need to terminate the Class Library call, so in the Destructor: xmlplatformutils: Terminate ().
The basic steps for parsing simple XML documents are so simple. For complex XML documents, the steps for parsing, especially the methods for creating Dom trees, are a little different. I will not introduce them here. Next, let's talk about the problem of Chinese parsing that has plagued me for many days. We know that xerces C ++ supports only the Chinese characters of the node name by default. As for the node value, the attribute value is not supported. Even if the parsed characters are garbled, you need to solve it yourself. Here we choose the XML document in UTF-8 encoding format. Let's take a look at the garbled characters. Because the strings parsed by the XML parser are in xmlch * (typedef unsigned int xmlch) format, one character occupies one byte, chinese characters occupy two bytes. Therefore, without proper conversion, the output results of Chinese characters will become garbled. In http://www.vckbase.com/document/viewdoc? Id = 738 provides a solution, but that solution can only see normal Chinese output when the locale environment is UTF-8, in environments such as locale gb18030, your Chinese characters are garbled. However, it can be displayed normally in one environment, indicating that it can be parsed normally, but the code type conversion is required on machines in different environments. Therefore, I added two methods to his classes to perform Code Conversion:
Bool translatorutf8tochinese (string & strtranslatormsg); // converts code types from UTF-8 to GBK, gb2312, etc.
Bool utf8_2_gb2312 (char * In, int inlen, char * Out, int outlen );
In this way, you can print out the normal parsing of the UTF-8 encoding.
XML documents have a lot of parsing styles, so the compiled parsing program cannot achieve common results. Different XML documents, the resolution requirements and methods are different (for example, if you print your node name and node value at the same time, the above method is not feasible), so you need to change your program, so if you are interested in parsing XML, you can have a deep understanding of the xerces C ++ class library.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More