Small and fast XML analyzer for local C ++

Last Update:2018-12-05 Source: Internet

Author: User

Tags xml parser xml reader

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article discusses:

Comparison between xmllite and other available XML Analyzer
Advantages and limitations of xmllite
Read and Write XML
XML security considerations

This article uses the following technologies:
XML, C ++

DirectoryWhy is a new XML analyzer available?
Com "Lite"
Read XML
Write XML
Use stream
Text Encoding when reading
Text Encoding at write time
Process big data values
Security considerations
Summary

Management. NET Framework continues to succeed, Microsoft still takes the local C ++ development seriously. This is illustrated by introducing xmllite (a high-performance, low-cost XML reader and writer suitable for applications written in local C ++.

The hosted Code supports XML widely through the system. xml namespace, and relies on the Traditional Visual Basic of COM.^{And C ++ applications can access Microsoft^{Similar functions in the XML Core Service (MSXML. However, these do not provide attractive options for local C ++ developers who need to quickly streamline XML analyzer. Start using xmllite.}}

This article describes the operations that you can perform on xmllite. However, first, to set expectations, I hope to quickly review the content not provided by xmllite, at least not provided in this initial version. For beginners, it neither provides the Document Object Model (DOM) implementation nor provides the XML architecture or document type definition (DTD) verification. It also lacks support for advanced tools, such as cursor-based navigation (such as XPath), style sheets, and serialization. However, the function built on xmllite can fill in any gaps as needed. Almost all XML functions in Microsoft. NET Framework are also built on xmlreader and xmlwriter classes.

So what does xmllite provide? To put it simply, it provides non-cached only-in analyzer (the receiver programming model is provided) and non-cached only-in XML generators. It has been proved that the two are very valuable functions.

Why is a new XML analyzer available?

Developers are increasingly familiar with the libraries they use every day. By using XML widely, they will certainly ask questions about the New XML analyzer. To understand the value of this new analyzer, let's first consider the current situation of the XML analyzer.

Naturally, if the application already uses the. NET Framework, the decision is usually very simple: simply use system. xml. To prove this, xmllite is designed based on the xmlreader and xmlwriter classes in. NET Framework. Using xmllite from a hosted application written in C ++ usually has no advantage. Xmllite provides fewer functions than xmlreader and xmlwriter. (The table in Figure 1 illustrates how the main types in xmllite are mapped to the main types in. NET Framework .) On the other hand, if applications only use local code, MSXML is traditionally the solution of choice for Microsoft technology.

MSXML provides two XML analyzers with great differences. The first analyzer is the DOM implementation available in various situations. If a small XML document is used and the XML document needs to be randomly accessed for reading and writing in the memory, Dom implementation is a reasonable choice. Later versions of MSXML introduce the implementation of "Simple API for XML (sax2. Whether it is actually simple is controversial. When using sax2 (or even before the start), you need to implement at least two com interfaces: one for receiving notifications from each node in the XML document, and the other for receiving notifications of analysis errors.

The reason for adding the sax2 implementation to MSXML is as follows: Unlike the DOM implementation, the sax2 analyzer reads XML documents in the form of data streams and notifies you when to arrive at each node. This means that the memory usage of your application does not increase with the size of the analyzed document.

The internal complexity of the sax2 model lies in the fact that the. NET Framework does not provide its implementation. It requires implementation of interfaces or events, and forces developers to use more indirect programming models, requiring developers to manage other States that are doomed to complicate the application. On the contrary, the xmlreader and xmlwriter classes in. NET Framework, as well as the ixmlreader and ixmlwriter interfaces of xmllite provide easy-to-understand analyzer that can be used directly in functions without having to manage any external States or notifications.

Thanks to its simplified design, xmllite provides excellent performance, even compared to the MSXML sax2 implementation. Although the sax2 analyzer can better process large documents than Dom, It is inferior to xmllite.

In short, xmllite is better than MSXML, and it is easier to use from the local C ++. MSXML will still be the most feasible solution for Visual Basic and com-based scripting languages, but now the local visual C ++^{Finally, we have an XML analyzer specially designed for it. Although xmllite is included in Windows Vista and later versions, An Update for Windows^{XP and Windows Server^{2003 32-bit and 64-bit versions are also available. Because the com registration is not involved, this update package should not cause MSXML problems related to installation and version control.}}}

Com "Lite"

Xmllite is not only an easy-to-remember name; in fact, it is a lightweight XML analyzer. Xmllite utilizes the essence of COM, namely programming norms and conventions, and discards complicated and potentially unnecessary parts, such as com registration, runtime service, proxy, thread model, and mail handling.

Create an XML reader and writer for functions exported from xmllite. dll. Link to xmllite. lib and include the xmllite. h header file in the Windows SDK to access them. The generated com style interface uses the familiar iunknown interface method to manage the lifetime. The com istream interface also plays a role and indicates memory. In addition, there is no com dependency; no need to register any COM class or even call a mandatory coinitialize function. The Active Template Library (ATL) ccomptr class processes a small portion of the remaining COM. However, you do need to pay attention to thread security because xmllite is NOT thread-safe due to the performance in the single-thread solution.

In the following example, I use the com_verify macro to clearly identify where the method returns the hresult to be checked. You can replace this with the corresponding error handling-whether the operation causes an exception or you return the hresult yourself.

Read XML

Xmllite provides the createxmlreader function implemented by the ixmlreader interface:

CComPtr<IXmlReader> reader;COM_VERIFY(::CreateXmlReader(__uuidof(IXmlReader),                             reinterpret_cast<void**>(&reader),                             0));

Although optional, the ccomptr class template ensures that the interface pointer is quickly released.

Createxmlreader accepts the interface identifier (IID) and pointer to the void pointer. This is a common mode in COM programming. It allows the caller to specify the type of the interface pointer to be returned. My example uses the _ uuidof operator, which is a Microsoft-specific keyword used to extract the guid associated with the type. In this case, it is used to retrieve the IID of the interface. The last parameter of createxmlreader accepts the optional imalloc implementation to allow the caller to control memory allocation.

After creating a reader, you must instruct the reader to use it as the Input Storage. The istream interface indicates the memory, so that xmllite can be used with any stream implementation that may be designed:

CComPtr<IStream> stream;// Create stream object here...COM_VERIFY(reader->SetInput(stream));

(I will discuss the stream later in this article .)

After setting the input of the XML reader, you can read it by repeatedly calling the read method. The read method accepts an optional parameter, which returns the node type each time the call is successful. The read method returns s_ OK to indicate that the next node has been successfully read from the stream and s_false to indicate that the end of the stream has been reached. The following is an example of how to enumerate nodes in sequence:

HRESULT result = S_OK;XmlNodeType nodeType = XmlNodeType_None;while (S_OK == (result = reader->Read(&nodeType))){    // Get node-specific info}

To enumerate the attributes of the current node, use the movetofirstattribute and movetonextattribute methods. If the reader is successfully located again, s_ OK is returned for both methods. Otherwise, s_false is returned. The following example illustrates how to enumerate the attributes of a given node in sequence:

for (HRESULT result = reader->MoveToFirstAttribute();      S_OK == result;     result = reader->MoveToNextAttribute()){    // Get attribute-specific info}

When ixmlreader's read method is called, it automatically stores any node attributes in the internal set. In this way, you can use the movetoattributebyname method to move the reader to a specific attribute by name. However, enumeration properties are typically more efficient and stored in application-specific data structures. Please note that you can also use the getattributecount method to determine the number of attributes in the current node.

After determining the node or attribute, it is easy to obtain its information. The following example shows how to obtain the namespace URI and local name of a given node:

PCWSTR namespaceUri = 0;UINT namespaceUriLength = 0;COM_VERIFY(reader->GetNamespaceUri(&namespaceUri,                                    &namespaceUriLength));PCWSTR localName = 0;UINT localNameLength = 0;COM_VERIFY(reader->GetLocalName(&localName,                                 &localNameLength));

All ixmlreader methods that return string values follow this mode. The first parameter accepts a pointer to a wide character pointer constant. The second parameter is optional. If it is not zero, it returns the length of the string measured in characters (excluding the null Terminator ).

The following is another example that emphasizes performance. The string pointer returned from the ixmlreader method is valid only when the reader is moved to another node or in some other way (for example, by setting a new input stream or releasing the ixmlreader Interface) to invalidate the current node. In other words, ixmlreader does not return a copy of the stream to the caller.

Unlike the counterparts in. NET Framework, ixmlreader does not provide any methods to read the entered content. For example, if a specific element or attribute contains a number or date, you must first obtain its string representation and convert it as needed .. Many other helper methods in the xmlreader class of net framework do not exist in ixmlreader, but they can be written as helper functions. Xmllite does comply with the c ++ theory of the minimum interface design.

Figure 2 shows the objects and abstractions involved in reading XML documents using ixmlreader. However, keep in mind that istream can extract any storage. The file displayed here is just a common example.

Figure 2 Reader

Write XML

Xmllite provides the createxmlwriter function implemented by the ixmlwriter interface:

CComPtr<IXmlWriter> writer;COM_VERIFY(::CreateXmlWriter(__uuidof(IXmlWriter),                             reinterpret_cast<void**>(&writer),                             0));

After creating the writer, you must specify that the writer will be used as the output storage:

CComPtr<IStream> stream;// Create stream object hereCOM_VERIFY(writer->SetOutput(stream));

Before writing, you can modify the writer attributes. Xmlwriterproperty enumeration defines available properties. For example, you may want to specify whether to indent the XML output for readers to read (this can be done using the setproperty method ):

COM_VERIFY(writer->SetProperty(XmlWriterProperty_Indent, TRUE));

Then you can use the ixmlwriter method to write data to the basic stream. Xmllite supports XML fragments. If you plan to write a complete XML document, you should start by calling the writestartdocument method (which is responsible for writing the XML Declaration. The Declaration depends on the encoding used, but the default encoding is the UTF-8, which should be appropriate in most cases. (Text Encoding will be introduced later .) Multiple writexxx methods are provided for writing various node types, attributes, and values.

Consider the following example:

COM_VERIFY(writer->WriteStartDocument(XmlStandalone_Omit));COM_VERIFY(writer->WriteStartElement(0, L"html", L"http://www.w3.org/1999/xhtml"));COM_VERIFY(writer->WriteStartElement(0, L"head", 0));COM_VERIFY(writer->WriteElementString(0, L"title", 0, L"My Web Page"));COM_VERIFY(writer->WriteEndElement()); //

The writestartdocument method is used to write XML statements to the stream. It has only one parameter, which accepts values from the xmlstandalone enumeration and indicates whether an independent document Declaration exists. If so, it indicates the saved value. When writing XML fragments, the call to writestartdocument is usually omitted.

The writestartelement method accepts the following three parameters: the first parameter specifies the optional namespace prefix of the element, the second parameter specifies the local name of the element, and the third parameter specifies the optional namespace URI. Writeelementstring is one of the most convenient methods provided by xmllite. The following code for writing the XHTML document title is equivalent to the writeelementstring used in the previous example:

COM_VERIFY(writer->WriteStartElement(0, L"title", 0));COM_VERIFY(writer->WriteString(L"My Web Page"));COM_VERIFY(writer->WriteEndElement());

Obviously, the writeelementstring method is not absolutely necessary, but it is indeed useful.

Finally, the writeenddocument method is used to close the document. You may have noticed that the body and HTML elements are not explicitly closed. Writeenddocument automatically disables any opened elements. In this case, releasing the writer also closes any remaining elements. However, if you are not careful, the practice of not explicitly disabling such elements may cause errors, because the lifetime of the stream is usually different from that of the writer. To ensure that all the content to be written is written to the basic stream, you only need to call the flush method of ixmlwriter.

Figure 3 shows the object and abstract stream involved in writing an XML document using ixmlwriter. Keep in mind that istream can extract any storage. The file here is just a common example.

Figure 3 writer

Use stream

So far, I have not introduced many streams. Unlike some more comprehensive XML libraries, xmllite does not provide any function that supports reading and writing data from public storage locations (such as files or through network protocols. Because of this, you need to provide the istream implementation for any memory that you want to read from or write. Implementing the istream interface is not complex, but in many cases, you do not need to perform this operation because the implementation may already exist.

The createstreamonhglobal function is implemented by istream supported by virtual memory. The first parameter is an optional memory handle created using the globalalloc function. However, you only need to pass zero, and createstreamonhglobal can create a memory object for you. The following example creates an istream implementation that is supported by the system memory and will dynamically increase as needed:

CComPtr<IStream> stream;COM_VERIFY(::CreateStreamOnHGlobal(0, TRUE, &stream));

Releasing a stream will release the memory.

The shcreatestreamonfile function provides another useful istream implementation. It creates istream supported by files:

CComPtr<IStream> stream;COM_VERIFY(::SHCreateStreamOnFile(L"D://Sample.xml",                                  STGM_WRITE | STGM_SHARE_DENY_WRITE,                                  &stream));

Text Encoding when reading

Although xmllite writes using a UTF-8 by default, this behavior can be overwritten if you try to detect text encoding while reading. First, let's take a look at the information you will obtain automatically. For a given stream, ixmlreader detects encoding prompts by marking the byte sequence of the XML pre-synchronization code. Ixmlreader will also allow any encoding specified in the XML declaration. Any XML analyzer is expected to have these two features. If there is an input stream that may not define any encoding information, and xmllite cannot tentatively determine the encoding being used, ixmlreader can be directed to a specific encoding (if a code page or encoding name is given ).

You can use the ixmlreaderinput interface to create an XML Reader Input object instead of passing the stream directly to ixmlreader. Two functions are provided to create input objects for packaging input streams. The createxmlreaderinputwithencodingcodepage function accepts code in the form of code page numbers. The createxmlreaderinputwithencodingname function accepts the encoding using its canonical name. In addition, these two functions have identical signatures. Generally, you can set the input stream of the XML reader as follows:

CComPtr<IStream> stream;// Create stream object hereCOM_VERIFY(reader->SetInput(stream));

To override the encoding, change the code:

CComPtr<IStream> stream;// Create stream object hereCComPtr<IXmlReaderInput> input;COM_VERIFY(::CreateXmlReaderInputWithEncodingName(stream,                                                  0, // default allocator                                                  L"ISO-8859-8",                                                  TRUE, // hint                                                  0, // base URI                                                  &input));COM_VERIFY(reader->SetInput(input));

The first parameter indicates the stream that the XML reader reads from. The second parameter accepts the optional imalloc implementation. If it is provided, it will overwrite the implementation of the XML reader. The third parameter specifies the encoding name. The documents on msdn2.microsoft.com/ms752827.aspx list the encodings supported by the local machine. To support other encodings, you can provide the imulti1_age2 interface. The next parameter indicates whether the specified encoding is required or whether it is just a prompt. If true is specified, it indicates that the analyzer tries to use the recommended encoding, but if it fails, you can try to determine the actual encoding at will. If false is specified, it indicates the recommended encoding tried by the analyzer. If the encoding does not match the input stream, an error is returned. The next parameter accepts optional basic Uris that may be used to parse external entities. The last parameter returns the interface pointer of the input object to be passed to the setinput method.

Text Encoding at write time

The XML writer determines the encoding to be used based on the object passed to the setoutput method. If the object implements the istream interface or even implements a limited isequentialstream interface, the XML writer uses UTF-8 encoding. You can create an XML writer output object to overwrite this behavior. Two functions are provided to create output objects for packaging output streams. The createxmlwriteroutputwithencodingcodepage function accepts code in the form of code page numbers, while the createxmlwriteroutputwithencodingname function accepts the encoding using its standard name. In addition, these two functions have identical signatures. Generally, you can set the output stream of the XML writer as follows:

CComPtr<IStream> stream;// Create stream object hereCOM_VERIFY(writer->SetOutput(stream));

To override the default encoding, write the following code:

CComPtr<IStream> stream;// Create stream object hereCComPtr<IXmlWriterOutput> output;COM_VERIFY(::CreateXmlWriterOutputWithEncodingName(stream,                                                   0,                                                   L"ISO-8859-8",                                                   &output));COM_VERIFY(writer->SetOutput(output));

The first parameter indicates the stream written by the XML writer. The second parameter accepts the optional imalloc implementation. If it is provided, it will overwrite the implementation of the XML writer. The third parameter specifies the encoding name. The last parameter returns the interface pointer of the output object to be passed to the setoutput method.

Process big data values

To enable memory usage when reading big data values, the XML reader provides a mechanism for reading values by data block. The number of characters read by the ixmlreader readvaluechunk method cannot exceed the specified maximum number of characters. It is expected to move the reader forward when subsequent calls are made. The following example shows how to repeatedly call readvaluechunk to read big data values:

CString value;WCHAR chunk[256] = { 0 };HRESULT result = S_OK;UINT charsRead = 0;while (S_OK == (result = reader->ReadValueChunk(chunk,                                                countof(chunk),                                                &charsRead))){    value.Append(chunk, charsRead);}

When no data is available, readvaluechunk returns s_false. In this example, I want to write data blocks to the cstring object. This is only to illustrate how to manage the length of data blocks. Obviously, this will actually offset the advantages of data blocks.

Security considerations

XML-centered applications must always process XML from non-source. Xmllite provides many tools to protect applications from known and future vulnerabilities.

XML documents can contain references to external entities. Some XML analyzers automatically parse these entities. Although this method may be useful, if you do not carefully write an XML parser to mitigate various threats, this method may cause security vulnerability attacks. Xmllite neither automatically parses external entities nor provides XML parsing programs. To provide your own implementation (if necessary), implement the ixmlresolver interface and use the xmlreaderproperty_xmlresolver attribute with the ixmlreader setproperty method to instruct the reader to use your parser.

The XML document may also contain DTD processing instructions. Although xmllite does not support document verification (using an XML architecture or DTD), it supports DTD entity extension and default attributes. Because these DTD can contain references to external entities, they may expose your applications to various attacks. By default, xmllite disables DTD processing. You can allow processing by setting the xmlreaderproperty_dtdprocessing attribute to the dtdprocessing_parse value. In addition, there are built-in mitigation measures for DTD entity extension attacks (also known as billion laughs attacks) controlled by xmlreaderproperty_maxentityexpansion. The default value of this attribute is 100,000.

Another way for attackers to use XML applications is to create documents with very long names. If it fails to be blocked, this may exhaust a large amount of memory and allow DoS attacks. I have prompted you to execute the method. One obvious way to mitigate such threats is to read big data values by data blocks, as described in the previous section. Another useful method is to provide a custom imalloc implementation that limits memory allocation. If the input stream supports random access, you can also instruct the XML reader to use the xmlreaderproperty_randomaccess attribute to avoid caching attributes. This reduces the amount of memory used to read the Start Element tag, but may also reduce the analysis speed, because the analyzer must look back and forth to retrieve attribute values at request time.

If the XML hierarchy is too deep, system resources may be quickly exhausted. To prevent attackers from providing XML documents with too many hierarchies, you can use the xmlreaderproperty_maxelementdepth attribute to limit the depth allowed by the analyzer. The default value of this attribute is 256.

Summary

Xmllite provides a powerful XML analyzer for local C ++ applications. It focuses on performance, knows the system resources it uses, and provides great flexibility to control these features. Xmllite supports all common text encoding and is a very useful utility that simplifies XML usage in local C ++ applications. For more information, see the xmllite document on msdn2.microsoft.com/ms752872.aspx.

Kenny KerrHe is a software expert specialized in Windows software development. He is keen to write articles about programming and software design and to teach developers the relevant knowledge. Contact reach Kenny via http://weblogs.asp.net/kennykerr.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More