Parsing XML text with MSXML

Source: Internet
Author: User
Tags format define format definition object model parse error tag name tagname version
Xml

First, the introduction

The popular scripting language on the Web today is HTML-dominated language structure, HTML is a markup language, not a programming language, and the main markup is for display, not for the structure of the document content itself. In other words, the machine itself is not able to parse its contents, so there is an XML language.

The XML (Extensiblemarkup Language) language is a subset of the SGML language, which retains the primary use of SGML while significantly reducing the complexity of SGML. The purpose of the XML language system is to make it not only can represent the content of the document, but also can represent the structure of the document, which can be understood by the machine at the same time. XML requires adherence to certain stringent standards. XML parsers are more critical of syntax and structure than HTML browsers, and XML requires that the pages being created use syntax and structure correctly, rather than HTML, to speculate about what should be in the document to display HTML. XML makes parsing programs easier to implement, both in terms of performance and stability. Each time the XML document is parsed in a consistent way, unlike HTML, different browsers may make different analysis and display of the same HTML. At the same time, because the analyzer does not need to spend time rebuilding incomplete documents, they can perform their tasks more efficiently than similar HTML. They can do their best to build a tree based on the tree structure already contained in the document, rather than displaying it on the basis of a mixed structure in the information flow.

The XML standard is a processing application of data, not just web pages. Any type of application can be built on top of the analysis program, and the browser is just a small part of XML. Of course, browsing is still extremely important because it provides XML workers with a friendly tool for reading information. But for bigger projects it's just a display window. Because XML has a strict syntax structure, we can even use XML to define a communication protocol for an application layer, such as the Internet Open Trade Protocol (Internet Open Trading Protocol), which is defined by XML. In a sense, some of the protocols and formats that we used to define in the BNF paradigm can be defined in principle in terms of XML. In fact, if we are patient enough, we can use XML to define the specification of a C + + language.

Of course, XML allows for the free development of a large number of HTML styles, but it is more stringent in terms of rules. XML has three main elements: DTD (document type declaration-) or XML Schema (XML outline), XSL (extensible Stylesheet language-Extensible Style language) And XLink (Extensiblelink language-Extensible Link language). The DTD and XML outlines define the logical structure of the XML file, defining the relationships between the elements in the XML file, the attributes of the elements, and the attributes of the elements and elements; Namespace (name Domain) implement unified XML document data representation and data integration XSL is the language used to specify the presentation of XML documents, which makes the data independent of its representations, such as XSL, which enables the Web browser to change the presentation of the document, such as changes in the display order of the data and no need to communicate with the server. By changing the style sheet, the same document can appear larger, or it can be folded to show only one layer of the outside, or it may become a printed format. And XLink will further expand the existing simple links on the web.

Second, the implementation of XML parsing instructions

Theoretically, according to the format definition of XML, we can write an XML parser ourselves, but in fact, Microsoft has provided us with an XML parser, if you install the IE5.0 version, you actually have installed the XML parser. You can download the latest MSXML SDK and parser files from the Microsoft site (www.microsoft.com). It is a dynamic-link library called MSXML.DLL, the latest version of MSXML3, which is actually a COM object library that encapsulates all the necessary objects for XML parsing. Because COM is a language-independent reusable object that appears in binary format. So you can call it in any language (such as vb,vc,delphi,c++ Builder or even a scripting language, etc.) and implement parsing of the XML document in your application. The following introduction to the XML Document Object model is based on Microsoft's newest MSXML3.

Three, XML Document object (XML DOM) model analysis

XML DOM objects provide a standard way to manipulate the information stored in an XML document, and the DOM application Programming Interface (API) is used as a bridge between applications and XML documents.     Dom can be considered a standard architecture for connecting documents and applications (or script languages). The MSXML parser allows you to load and create a document, collect error messages from documents, get and manipulate all the information and structure in the document, and save the document in an XML file. The DOM provides users with an interface to load, reach, and manipulate and serialize XML documents. DOM provides a complete representation of the XML document stored in memory, providing a way to randomly access the entire document. The DOM allows an application to manipulate the information in an XML document based on the logical structure provided by the MSXML parser.

Manipulate XML using the interfaces provided by MSXML. The MSXML parser actually generates a DOM tree structure from an XML document that reads an XML document and creates a logical structure of nodes based on the content of the XML document, which itself is considered to be a node that contains all the other nodes. DOM enables users to think of a document as a structured information tree rather than as a simple text stream. In this way, even if you don't know the semantic details of XML, the application or script can easily manipulate the structure. The DOM contains two key abstractions: a tree-like hierarchy, and a collection of nodes that represent the content and structure of the document. The tree hierarchy includes all of these nodes, and the nodes themselves can contain other nodes. The advantage is that for the developer, he can find and modify the information for one of the nodes through this hierarchy. Dom sees a node as a common object, so it is possible to create a script to load a document and then iterate through all the nodes to show the information of the nodes of interest. Note that nodes can have a number of specific types, such as elements, attributes, and text that can be considered a node.

Microsoft's MSXML parser reads an XML document and then parses its contents into an abstract information container called a node (NODES). These nodes represent the structure and content of the document and allow the application to read and manipulate the information in the document without having to display the semantics of the known XML. After a document is parsed, its nodes can be browsed at any time without the need to maintain a certain order. The most important programming object for developers is DOMDocument. The DOMDocument object exposes properties and methods to allow browsing, querying, and modifying the contents and structure of an XML document, each of which exposes its own properties and methods, so that it can collect information about an object instance, manipulate the value and structure of the object, and navigate to other objects in the tree. Msxml. The main COM interfaces that a DLL includes are:

(1) The DOMDocument DOMDocument object is the foundation of the XML DOM, and you can use the properties and methods it exposes to allow you to browse, query, and modify the contents and structure of an XML document. DOMDocument represents the top-level node of the tree. It implements all the basic methods of the DOM document and provides additional member functions to support XSL and XSLT. It creates a document object, and all other objects can be obtained and created from this Document object.

(2) Ixmldomnodeixmldomnode are the basic objects, elements, attributes, annotations, process directives, and other document components in the Document Object Model (DOM) that can be considered ixmldomnode, in fact,          The DOMDocument object itself is also a IXMLDOMNode object. (3) IXMLDOMNodeList IXMLDOMNodeList is actually a collection of node objects, and the addition, deletion, and change of nodes can be reflected immediately in the collection, and all nodes can be traversed through the "for...next" structure. (4) The Ixmldomparseerror Ixmldomparseerror interface is used to return detailed information that appears during parsing, including error numbers, line numbers, character positions, and textual descriptions. The following describes the process of creating a DOMDocument object, which describes the process of creating a Document object with VC.

HRESULT hr;

ixmldomdocument* Pxmldoc;

ixmldomnode* Pxdn;

Hr=coinitialize (NULL); Initialization of File://COM

file://gets a pointer pxmldoc about the IXMLDOMDocument interface.

Hr=cocreateinstance (Clsid_dom Document,null,clsctx_inpproc_server,

Iid_ixmldomdocument, (void**) &pxmldoc);

file://gets a pointer pxdn about the IXMLDOMNode interface.

Hr=pxmldoc->queryinterface (Iid_ixmldomnode, (void**) &PXDN);

During the use of the MSXML parser, we can use the CreateElement method in the document to create a node to mount and save the XML file. The load or the Loadxml method allows you to load an XML document from a specified URL. The load (Loadxml) method takes two parameters: the first parameter Xmlsource represents the document that needs to be parsed, and the second parameter issuccessful indicates whether the document was loaded successfully. The Save method is used to save the document to a specified location. The Save method has a parameter destination used to indicate the type of object that needs to be saved, an object can be a file, an ASP response method, an XML Document object, or a client object that can support persistent save (persistence). The following is a simple example of the use of the Save method (see HTTP://WWW.SWM.COM.CN/SWM/200101/using MSXML to parse XML text). At the same time, in the parsing process, we need to get and set the resolution flags. Using different parsing flags, we may parse an XML document in different ways. The XML standard allows the parser to validate or not validate the document, allowing the parsing process of an unauthenticated document to skip extraction of external resources. Alternatively, you might set a flag to indicate whether you want to remove the extra space from the document. To achieve this, the DOMDocument object exposes the following attributes, allowing the user to change the behavior of the parser at run time:

(1) Async (compared to C + + is two methods, Get_async and Put_async respectively)

(2) Validateonparse (compared to C + + is two methods, respectively, Get_validate Onparse and Put_validateonparse)

(3) Resolveexternals (compared to C + + is two methods, get_ Resolve externals and Put_resolveexternals respectively)

(4) Persercvewhitespace (compared to C + + is two methods, get_ Perser Cvewhitespace and Put_ Persercve)

Each attribute can accept or return a Boolean value. By default, the value of Anync,validateonparse,resolveexternals is true,perservewhitespace to the setting of the XML document, and if the Xml:space attribute is set in the XML document, The value is false.

At the same time in the document parsing process can collect some information and document information, in fact, in the document parsing process can get the following information:

(1) DOCTYPE (document type): is actually the DTD file that is used to define the document format. If the XML document does not have a DTD document associated with it, it returns NULL.

(2) Implementation: Represents the implementation of the document, which is actually used to indicate the version of the XML supported by the current document.

(3) ParseError (parse error): The last error that occurred in the parsing process.

(4) ReadyState (state information): represents the state information of an XML document, and ReadyState is important for asynchronous use of Microsoft's XML parser to improve performance, and when you asynchronously load an XML document, your program may need to check the state of the parsing. MSXML provides four states, which are in state, are already state, parsing and parsing complete.

(5) URL (Uniform Resource location): A situation about the URL of an XML document being loaded and parsed. Note This property returns a null value if the document was built in memory.

After we get the document tree structure, we can manipulate each node in the tree, and we can get the nodes in the tree in two ways, Nodefromid and getElementsByTagName respectively. The nodefromid includes two parameters, the first parameter idstring is used to represent the ID value, and the second parameter node returns the interface pointer to the node node that matches the ID. Note the ID value in each XML document must be unique according to the technical requirements of the XML, and an element (element) can only be associated with an ID. The getElementsByTagName method has two parameters, the first parameter tagname represents the name of the element that needs to be looked up, and if TagName is "*" returns all elements (element) in the document. The second parameter is resultlist, which is actually a pointer to the interface ixmldomnodelist, which returns a collection of all the node associated with the TagName (tag name).

The following is a simple example (see HTTP://WWW.SWM.COM.CN/SWM/200101/using MSXML to parse XML text). Finally, let's discuss how to create a new node, which can actually create a new node by means of CreateNode. CreateNode includes four parameters, the first parameter type represents the type of node to be created, the second parameter name represents the NodeName value of the new node, and the third parameter NamespaceURI represents the node-related namespace. The fourth parameter node represents the newly created node. Note You can create a node by using the type that you have already provided, name, and namespace (nodename). When a node is created, it is actually created within a namespace range (if a namespace has been provided). If a namespace is not provided, it is actually created within the scope of the document's namespace.

Iv. A simple example of XML document analysis using MSXML

To illustrate how to use the XML DOM model in VC, a simple instance program (see HTTP://WWW.SWM.COM.CN/SWM/200101/using MSXML to parse XML text) is shown here. is a consoleapplication. The following is the main program code used to locate a particular node node in an XML document and insert a new child node.

Summarize

XML documents have much more syntactic requirements than HTML, so it is much easier to use and write an XML parser than to write an HTML parser. At the same time, because the XML document can not only mark the display property of the document, but also mark the structure of the document and the feature of the information, it can conveniently get the information of the specific node through the XML parser and display or modify it, which facilitates the user to operate and maintain the XML document. At the same time, we need to note that XML is an open architecture that does not depend on any one company, so developing xml-based applications will inevitably be supported by most software development platforms. In addition, mainstream software developers like Microsoft are also looking at the xml+com system, where Microsoft's Office series, Web servers and browsers, and database products (SQL Server) have started to support xml-based applications. Through XML to customize the front-end of the application, COM to achieve specific business objects and database objects, so that the system has more flexible scalability and maintenance.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.