First, I must admit that I like computer standards. If everyone complies with industry standards, the Internet will be a better media. Standardized data exchange formats can make open and platform-independent computing modes feasible. This is why I am a fan of XML.
Fortunately, my favorite scripting language not only supports XML, but also keeps increasing support for it. PHP allows me to quickly publish XML documents to the Internet, collect XML document statistics, and convert XML documents to other formats. For example, I often use PHP's XML processing capability to manage articles and books written in XML.
In this article, I will discuss any use of PHP's built-in Expat parser to process XML documents. The example shows how to handle Expat. At the same time, the example shows you how:
Create your own processing functions
Convert the XML file into your own PHP Data Structure
Introduction to Expat
The XML parser, also known as the XML processor, allows the program to access the structure and content of the XML document. Expat is an XML Parser for PHP scripting. It is also used in other projects, such as Mozilla, Apache, and Perl.
What is an event-based parser?
Two basic types of XML Parser:
Tree-based Parser: converts an XML document into a tree structure. This type of parser analyzes the entire article and provides an API to access each element of the generated tree. Its general standard is DOM (Document Object Mode ).
Event-based Parser: treats XML documents as a series of events. When a special event occurs, the parser calls the functions provided by the developer for processing.
The event-based parser has a centralized data view of the XML document, that is, it is concentrated in the data part of the XML document, rather than its structure. These parsers process documents from start to end and report events, such as the beginning of an element, the end of an element, and the start of feature data, to the application through the callback function. The following is an XML document example of "Hello-World:
<Greeting>
Hello World
</Greeting>
The event-based parser reports three events:
Start Element: greeting
Start of the CDATA entry; Value: Hello World
End Element: greeting
Unlike the tree-based parser, the event-based parser does not generate the structure of the description document. In the CDATA item, the event-based parser does not give you information about the parent element greeting.
However, it provides a more underlying access, which enables better resource utilization and faster access. In this way, there is no need to put the entire document into the memory. In fact, the entire document can be larger than the actual memory value.
Expat is such an event-based parser. Of course, if you use Expat, it can also generate a full native tree structure in PHP if necessary.
The preceding Hello-World example contains the complete XML format. However, it is invalid because no DTD (Document Type Definition) is associated with it and no DTD is embedded.
For Expat, there is no difference: Expat is a parser that does not check validity, so ignore any DTD associated with the document. However, it should be noted that the document still needs the complete format, otherwise Expat (the same as other XML-compliant parser) will stop with the error message.
As a parser that does not check the validity, Exapt is very suitable for Internet applications because of its fast and lightweight nature.
Compile Expat
Expat can be compiled into PHP3.0.6 (or later. Since Apache1.3.9, Expat is already part of Apache. In Unix systems, you can use the "-with-xml" option to configure PHP and compile it into PHP.
If you compile PHP into an Apache module, Expat uses it as part of Apache by default. In Windows, you must load the XML dynamic Connection Library.
XML example: XMLstats
One way to understand the Expat function is through examples. The example we will discuss is to use Expat to collect statistics for XML documents.
For each element in the document, the following information is output:
Number of times this element is used in the document
Number of Characters in the element
Element parent Element
Child element of an element
Note: For demonstration, we use PHP to generate a structure to save the parent and child elements of an element.
Preparation
The function used to generate an XML Parser instance is xml_parser_create (). This instance will be used for all future functions. This idea is very similar to the connection mark of MySQL functions in PHP. Before parsing a document, the event-based parser usually requires you to register a callback function-called when a specific event occurs. Expat has no exception event. It defines the following seven possible events:
Object XML parsing function description
Start and end of element xml_set_element_handler ()
Character data xml_set_character_data_handler () Start of character data
External entity xml_set_external_entity_ref_handler () external entity appears
External entity xml_set_unparsed_entity_decl_handler () not resolved external entity appears
Processing Command xml_set_processing_instruction_handler () Processing Command appears
The emergence of the xml_set_notation_decl_handler () method declaration
By default, xml_set_default_handler () is used for events that do not specify a processing function.
All callback functions must take the parser instance as its first parameter (and other parameters ).
For the sample script at the end of this article. Note that it uses both the element processing function and the character data processing function. The element callback handler is registered through xml_set_element_handler.
This function requires three parameters:
Parser instance
Name of the callback function for processing the Start Element
Name of the callback function for processing the End Element
When parsing XML documents, the callback function must exist. They must be defined as consistent with the prototype described in the PHP manual.
For example, Expat passes three parameters to the processing function of the Start Element. In the script example, it is defined as follows:
Function start_element ($ parser, $ name, $ attrs)
The first parameter is the parser identifier, the second parameter is the name of the Start element, and the third parameter is an array containing all attributes and values of the element.
Once you start parsing the XML document, Expat will call your start_element () function and pass the parameters when encountering the starting element.
Case Folding options of XML
Use the xml_parser_set_option () function to disable the Case folding option. This option is enabled by default, so that the element name passed to the handler function is automatically converted to uppercase. However, XML is case sensitive (so it is very important to collect XML documents ). For our example, The case folding option must be disabled.
Parsing document
After completing all the preparations, the script can finally parse the XML document:
Xml_parse_from_file (), a custom function that opens the file specified in the parameter and parses it in 4 kb size.
Xml_parse () is the same as xml_parse_from_file (). In case of an error, that is, if the XML file format is incomplete, false is returned.
You can use the xml_get_error_code () function to get the last wrong numeric code. Pass this numeric code to the xml_error_string () function to get the incorrect text information.
Output the current number of lines in XML to make debugging easier.
Call the callback function during parsing.
Description document structure
When parsing a document, the question that Expat needs to emphasize is: how to maintain the basic description of the document structure?
As mentioned above, the event-based parser itself does not generate any structure information.
The tag structure is an important feature of XML. For example, the element sequence <book> <title> indicates a different meaning than <figure> <title>. That is to say, any author will tell you that the title and the graph name are irrelevant, although they all use the term "title. Therefore, in order to more effectively use the event-based parser to process XML, you must use your own stack (stacks) or list (lists) to maintain the structure information of the document.
To generate an image of the document structure, the script must at least know the parent element of the current element. The Exapt API cannot be implemented. It only reports the events of the current element without any information on the frontend and backend relationships. Therefore, you need to build your own stack structure.
The script example uses the stack structure of FILO. Through an array, the stack will save all the starting elements. For the start element processing function, the current element will be pushed to the top of the stack by the array_push () function. Correspondingly, the End Element handler removes the top element through array_pop.
For the sequence <book> <title> </book>, stack filling is as follows:
Start Element book: Assign "book" to the first element of the stack ($ stack [0]).
Start Element title: Assign "title" to the top of the stack ($ stack [1]).
End Element title: remove the top element from the stack ($ stack [1]).
End Element title: remove the top element from the stack ($ stack [0]).
PHP3.0 uses a $ depth variable to manually control the nesting of elements to implement an example. This makes the script look complicated. PHP4.0 uses the array_pop () and array_push () functions to make the script look more concise.
Collect data
To collect information about each element, the script needs to remember the events of each element. You can use a Global Array variable $ elements to save all the different elements in the document. An array project is an element class instance with four attributes (class variables)
$ Count-number of times this element is found in the document
$ Chars-number of bytes of the character event in the element
$ Parents-parent Element
$ Childs-child element
As you can see, it is easy to store class instances in arrays.
Note: One feature of PHP is that you can traverse the entire class structure through the while (list () = each () loop, just as you traverse the entire array. All class variables (and method names when PHP3.0 is used) are output as strings.
When an element is found, we need to add its corresponding counter to track how many times it appears in the document. Add one to the Count element in the corresponding $ elements item.
We also need to let the parent element know that the current element is its child element. Therefore, the name of the current element will be added to the project of the $ childs array of the parent element. Finally, the current element should remember who is its parent element. Therefore, the parent element is added to the project of the current element $ parents array.
Show statistics
The remaining code cyclically displays the statistical results in the $ elements array and Its subarrays. This is the simplest nested loop. Although the correct results are output, the Code is not concise and has no special skills. It is just a loop that you may use to complete your work every day.
The script example is designed to be called through the command line in CGI Mode of PHP. Therefore