System programmer growth Plan-Text processing (i) Sunday, June 07th, 2009 | Author:admin | »edit«
Please indicate the source and the author's contact information when reproduced.
Article Source: Http://www.limodev.cn/blog
Author contact information: Li Xianjing <xianjimli at hotmail Dot com>
System programmer growth Plan-Text processing (i)
State Machine (4)
XML Parser
XML (extensible Markup Language) extends markup language and is a common data file format. Compared to the INI, it is much more complex, INI can only hold the linear structure of the data, and the XML may save the tree structure of data. Let's look at the following example:
<?xml version= "1.0" encoding= "Utf-8"?>
<mime-type xmlns= "Http://www.freedesktop.org/standards/shared-mime-info" type= "All/all" >
<!--Created automatically by Update-mime-database. Do not EDIT!-->
<comment>all files and folders</comment>
</mime-type>
The first line is called the processing instruction (PI), which is given to the parser. This tells the parser that the current XML file follows the XML 1.0 specification, and the file content is encoded with UTF-8.
The second line is a starting Tag,tag with the name Mime-type. It has two properties, the name of the first property is xmlns, and the value is http://www.freedesktop.org/standards/shared-mime-info. The second property has a name of type and a value of All/all.
The third line is a comment.
Row four includes a start tag, a text, and an end tag.
Line Five is an end tag.
The format of XML itself is not the focus of this article, and we do not discuss it in detail. The focus here is how to parse complex data in a state machine.
In the previous method, read the data into a buffer, let one pointer point to the head of the buffer, and move the pointer until it points to the tail of the buffer. In this process, the pointer may point to: start tag, end tag, comment, process instruction and text. Thus we define the state machine's main state:
1. Start tag Status
2. End tag Status
3. Comment Status
4. Processing instruction Status
5. Text Status
Since the starting tag, the end tag, the comment and the processing instruction are all between the characters ' < ' and ' > ', when the character ' < ' is read, we don't know the current state, and for ease of handling we introduce an intermediate state called the "less than sign" state. In the read characters ' < ' and '! ' After that, it is also necessary to read two '-' in order to determine the status of the annotation, in order to facilitate processing, and then introduce two intermediate states "comments before the first" and "two comments." An "empty" state is introduced to indicate that it is not in any of the above states.
State transition Functions:
1. In the "empty" state, read the character ' < ' into the "less than sign" state.
2. In the "empty" state, read non-' < ' non-whitespace characters, enter the "text" status.
3. After the "less than sign" state, read the character '. ', go to the ' comment before ' state.
4. After the "less than" state, read the character '? ' and enter the "processing instructions" status.
5. After the "less than" state, read the character '/' and enter the "end tag" status.
6. After the "less than" status, read in the valid ID character, enter the "Start tag" status.
7. In the "comment before" state, read the character '-' into the "two" state.
8. In the "comment before two" state, read the character '-' and enter the "comment" status.
9. After the "start tag" status, "end tag" status, "text" status, "comment" status, and "processing instruction" status, return to the "empty" state.
The graphical representation of this state machine is as follows:
Let's take a look at the code implementation:
void Xml_parser_parse (xmlparser* thiz, const char* XML)
{
/* Define enumeration values for status */
Enum _state
{
Stat_none,
Stat_after_lt,
Stat_start_tag,
Stat_end_tag,
Stat_text,
Stat_pre_comment1,
Stat_pre_comment2,
Stat_comment,
Stat_process_instruction,
}state = Stat_none;
Thiz->read_ptr = XML;
/* pointer moves from head to tail */
for (; *thiz->read_ptr! = '/0 '; thiz->read_ptr++)
{
char C = thiz->read_ptr[0];
Switch (state)
{
Case Stat_none:
{
if (c = = ' < ')
{
/* In the "empty" state, read the character ' < ' into the "less than sign" state. */
Xml_parser_reset_buffer (Thiz);
state = Stat_after_lt;
}
else if (!isspace (c))
{
/* in "empty" state, read non-' < ' non-whitespace characters, enter the "text" status. */
state = Stat_text;
}
Break
}
Case STAT_AFTER_LT:
{
if (c = = '? ')
{
/* After the "less than sign" state, read the character '? ' and enter the "processing instructions" status. */
state = Stat_process_instruction;
}
else if (c = = '/')
{
/* In the "less than sign" state, read the character '/' and enter the "end tag" state. */
state = Stat_end_tag;
}
else if (c = = '! ')
{
/* Read the character ' after the less than sign ' state. ', go to the ' comment before ' status */
state = Stat_pre_comment1;
}
else if (Isalpha (c) | | c = = ' _ ')
{
/* After the "less than" state, read in a valid ID character, enter the "Start tag" state. */
state = Stat_start_tag;
}
Else
{
}
Break
}
Case Stat_start_tag:
{
/* Enter SUB status */
Xml_parser_parse_start_tag (Thiz);
state = Stat_none;
Break
}
Case Stat_end_tag:
{
/* Enter SUB status */
Xml_parser_parse_end_tag (Thiz);
state = Stat_none;
Break
}
Case Stat_process_instruction:
{
/* Enter SUB status */
XML_PARSER_PARSE_PI (Thiz);
state = Stat_none;
Break
}
Case Stat_text:
{
/* Enter SUB status */
Xml_parser_parse_text (Thiz);
state = Stat_none;
Break
}
Case STAT_PRE_COMMENT1:
{
if (c = = '-')
{
/* In the "comment before" state, read the character '-' and enter the "top two" status. */
state = Stat_pre_comment2;
}
Else
{
}
Break
}
Case STAT_PRE_COMMENT2:
{
if (c = = '-')
{
/* In the "comment first two" state, read the character '-' into the ' comment ' state. */
state = Stat_comment;
}
Else
{
}
}
Case Stat_comment:
{
/* Enter SUB status */
Xml_parser_parse_comment (Thiz);
state = Stat_none;
Break
}
Default:break;
}
if (*thiz->read_ptr = = '/0 ')
{
Break
}
}
Return
}
Parsing does not end here because it is like the "start tag" state and the "processing instruction" state, which are not atomic, but also contain sub-states such as tag names, property names, and attribute values, which need to be further decomposed. When considering a sub-state, we can forget the context in which it is located and only consider the sub-State itself, so that the problem will be simplified. Let's look at the state machine of the starting tag.
Let's say we're going to parse a starting tag like this:
<mime-type xmlns= "Http://www.freedesktop.org/standards/shared-mime-info" type= "All/all" >
How should we do it? Or, in the previous method, let a pointer point to the head of the buffer, and then move the pointer until it points to the end of the buffer. In this process, the pointer may point to the tag name, the property name, and the property value. Thus we can define the state machine's main state:
1. "Tag name" status
2. "Attribute name" status
3. "Attribute value" status
For ease of handling, it then leads to two intermediate states, "before the property name" state and the "attribute value before" state.
State transition Functions:
The initial state is the "tag name" state
1. In the "tag name" state, read the blank character and enter the "before property name" state.
2. In the "tag name" state, read the character '/' or ' > ' into the ' end ' state.
3. In the state before the attribute name, read the other non-whitespace characters and enter the property name status.
4. In the property name state, read the character ' = ' and enter the ' before attribute value ' state.
5. In the "before attribute value" state, read the character ' "' and enter the" attribute value "status.
6. In the "attribute value" state, read the character ' "', parse the property name and the property value successfully, return to the" before the property name "state.
7. In the "before attribute name" state, read the character '/' or ' > ' into the ' end ' state.
Because the processing instruction (PI) also contains the attribute state, in order to reuse the function of property parsing, we extract the state of the property into a sub-state. The graphical representation of the "start tag" status is as follows:
Let's look at the code implementation:
static void Xml_parser_parse_attrs (xmlparser* thiz, char End_char)
{
int i = 0;
Enum _state
{
Stat_pre_key,
Stat_key,
Stat_pre_value,
Stat_value,
Stat_end,
}state = Stat_pre_key;
Char value_end = '/';
Const char* start = thiz->read_ptr;
THIZ->ATTRS_NR = 0;
for (; *thiz->read_ptr! = '/0 ' && thiz->attrs_nr < max_attr_nr; thiz->read_ptr++)
{
char C = *thiz->read_ptr;
Switch (state)
{
Case Stat_pre_key:
{
if (c = = End_char | | c = = ' > ')
{
/* In the "before property name" state, read the character '/' or ' > ' into the ' end ' state. */
state = Stat_end;
}
else if (!isspace (c))
{
/* In the "before property name" state, read the other non-whitespace characters into the "property name" state. */
state = Stat_key;
Start = thiz->read_ptr;
}
}
Case Stat_key:
{
if (c = = ' = ')
{
/* In the property name state, read the character ' = ' and enter the ' before attribute value ' state. */
thiz->attrs[thiz->attrs_nr++] = (char*) xml_parser_strdup (Thiz, start, Thiz->read_ptr-start);
state = Stat_pre_value;
}
Break
}
Case Stat_pre_value:
{
/* In the "before attribute value" state, read the character ' "', enter the" attribute value "status. */
if (c = = '/' ' | | c = = '/')
{
state = Stat_value;
Value_end = C;
Start = thiz->read_ptr + 1;
}
Break
}
Case Stat_value:
{
/* In the "attribute value" state, read the character ' "', parse the property name and property value successfully, return to the" before the property name "state. */
if (c = = value_end)
{
thiz->attrs[thiz->attrs_nr++] = (char*) xml_parser_strdup (Thiz, start, Thiz->read_ptr-start);
state = Stat_pre_key;
}
}
Default:break;
}
if (state = = Stat_end)
{
Break
}
}
for (i = 0; i < thiz->attrs_nr; i++)
{
Thiz->attrs[i] = Thiz->buffer + (size_t) (Thiz->attrs[i]);
}
THIZ->ATTRS[THIZ->ATTRS_NR] = NULL;
Return
}
Remember that in XML, both single and double quotes can be used to define attribute values, so there is a special handling of this.
static void Xml_parser_parse_start_tag (xmlparser* thiz)
{
Enum _state
{
Stat_name,
Stat_attr,
Stat_end,
}state = Stat_name;
char* tag_name = NULL;
Const char* start = thiz->read_ptr-1;
for (; *thiz->read_ptr! = '/0 '; thiz->read_ptr++)
{
char C = *thiz->read_ptr;
Switch (state)
{
Case Stat_name:
{
/* In the "tag name" state, read into the white space character, the state of temperament. */
/* In the "tag name" state, read the character '/' or ' > ' and enter the "End" status. */
if (Isspace (c) | | c = = ' > ' | | c = = '/')
{
state = (c! = ' > ' && c! = '/')? Stat_attr:stat_end;
}
Break
}
Case STAT_ATTR:
{
/* Enter the "Properties" sub-state */
Xml_parser_parse_attrs (Thiz, '/');
state = Stat_end;
Break
}
Default:break;
}
if (state = = Stat_end)
{
Break
}
}
for (; *thiz->read_ptr! = ' > ' && *thiz->read_ptr! = '/0 '; thiz->read_ptr++);
Return
}
The parsing of the processing instructions and the parsing of the starting tag are basically the same, here is just a look at the code:
static void Xml_parser_parse_pi (xmlparser* thiz)
{
Enum _state
{
Stat_name,
Stat_attr,
Stat_end
}state = Stat_name;
char* tag_name = NULL;
Const char* start = thiz->read_ptr;
for (; *thiz->read_ptr! = '/0 '; thiz->read_ptr++)
{
char C = *thiz->read_ptr;
Switch (state)
{
Case Stat_name:
{
/* In the "tag name" state, read into the white space character, the state of temperament. */
/* In the "tag name" state, ' > ', enter the "End" status. */
if (Isspace (c) | | c = = ' > ')
{
state = c! = ' > '? Stat_attr:stat_end;
}
Break
}
Case STAT_ATTR:
{
/* Enter the "Properties" sub-state */
Xml_parser_parse_attrs (Thiz, '? ');
state = Stat_end;
Break
}
Default:break;
}
if (state = = Stat_end)
{
Break
}
}
Tag_name = Thiz->buffer + (size_t) tag_name;
for (; *thiz->read_ptr! = ' > ' && *thiz->read_ptr! = '/0 '; thiz->read_ptr++);
Return
}
Note, the end of tag and text parsing is very simple, here with the code to see the line:
Processing of the "comment" sub-state:
static void Xml_parser_parse_comment (xmlparser* thiz)
{
Enum _state
{
Stat_comment,
Stat_minus1,
STAT_MINUS2,
}state = stat_comment;
Const char* start = ++thiz->read_ptr;
for (; *thiz->read_ptr! = '/0 '; thiz->read_ptr++)
{
char C = *thiz->read_ptr;
Switch (state)
{
Case Stat_comment:
{
/* In the "comment" state, read '-', enter the "Minus one" status. */
if (c = = '-')
{
state = Stat_minus1;
}
Break
}
Case STAT_MINUS1:
{
if (c = = '-')
{
/* In the "minus one" state, read '-', enter the "minus two" state. */
state = Stat_minus2;
}
Else
{
state = Stat_comment;
}
Break
}
Case STAT_MINUS2:
{
if (c = = ' > ')
{
/* In the "minus two" state, read ' > ', End parsing. */
Return
}
Else
{
state = Stat_comment;
}
}
Default:break;
}
}
Return
}
Processing of the "end tag" sub-state:
static void Xml_parser_parse_end_tag (xmlparser* thiz)
{
char* tag_name = NULL;
Const char* start = thiz->read_ptr;
for (; *thiz->read_ptr! = '/0 '; thiz->read_ptr++)
{
/* Read in ' > ' to end parsing. */
if (*thiz->read_ptr = = ' > ')
{
Break
}
}
Return
}
Processing of the "text" sub-state:
static void Xml_parser_parse_text (xmlparser* thiz)
{
Const char* start = thiz->read_ptr-1;
for (; *thiz->read_ptr! = '/0 '; thiz->read_ptr++)
{
char C = *thiz->read_ptr;
/* Read in ' > ' to end parsing. */
if (c = = ' < ')
{
if (Thiz->read_ptr > Start)
{
}
thiz->read_ptr--;
Return
}
else if (c = = ' & ')
{
/* Read in ' & ' to enter the entity resolution sub-state. */
Xml_parser_parse_entity (Thiz);
}
}
Return
}
Physical (entity) sub-state is relatively simple, there is no further analysis, leave the reader to do the exercise.