How to use bash to parse xml sample code analysis

Source: Internet
Author: User
The initial requirement was that bash could provide a complete and mature xml parsing tool to parse xml, but such a tool was not found. Later, I found a simple xml processing method on StackOverFlow, that is, the initial requirement was that bash could provide a complete and mature xml parsing tool to parse xml, however, no such tool was found. Later, I found a simple xml processing method on StackOverFlow, namely:

rdom () { local IFS=\> ; read -d \< E C ;}

There is only one row! (Of course, the two statements should be two rows ......)

Of course, this can only process the simplest and original xml, but cannot process attributes or annotations.

Because the landlord is too lazy and does not want to introduce (learn) a new scripting language, he plans to transform the above method.

Before the transformation, explain the meaning of the above statements.

In fact, this line of command is used to read <与下一个<之间的字符< p>

(Xml, if it exists outside the node itself <或者> If the property value contains spaces, the function is invalid. Therefore, we assume that this is not the case in xml)

With the above assumptions, then two <字符直接,就一定会有一个> Character,> divides the read content into two parts: E and C. for example:

 
  value
 

When rdom is executed for the first time, read <即结束了,所以e和c都是空字符串。< p>

When rdom is executed for the second time, the read content is: tag> value, and then <字符,read结束。所以e=tag;c=value< p>

When rdom is executed for the third time, the read content is:/tag> to the next <或文件末尾。所以e= tag,c为空白符。< p>

This method is not practical. we want to support nodes with attributes, and we do not want to delete the annotations in xml. we even want to parse the xml declaration ...... Well, we thought too much. Let's see what we can do.

As we can see, <> The content is assigned to E as a whole, so the parsing attribute must be applied to E.

(Assume that there is no <和> , There is no space in the property value)

Next, let's take a look. First, we will introduce an input space to display the hierarchical function echo_tabs.

Echo_tabs () {local tabs = ""; for (I = 0; I <$1; I ++ )); do tabs = $ tabs ''#4 spaces done echo-n" $ tabs "# double quotation marks must be added}

Then we will parse the declaration in xml, which is the following part.

 

The declaration is different from other tag closure methods, and the two ends in the angle brackets are ?, Therefore, we need to distinguish it from a common node.

Read_dom () {# back up IFS local oldIFS = $ IFS local IFS =\># change the field delimiter to> read-d \ <entity content # Change the read separator to <local ret = $? Local ELEMENT = ''# When the first execution is completed, the first character is <. #. Therefore, after the read operation is completed, the ENTITY and CONTENT are blank characters if [[$ ENTITY = ~ ^ [[: Space:] * $] & [[$ CONTENT = ~ ^ [[: Space:] * $]; then return $ ret fi # ENTITY =? Xml version = "1.0" encoding = "UTF-8 "? # Parse the xml declaration, not a common node. the closure method is different from that of the node. if [["$ ENTITY" = ~ ^ \? Xml [[: space:] * (. *) \? $]; Then # use regular expressions to remove question marks and xml characters ENTITY = ''ELEMENT ='' # Not a common node ATTRIBUTES = "$ {BASH_REMATCH [1]}" # get the declared attribute else # normal node ELEMENT =$ {ENTITY % *} # get the node name, if there is a space in the ENTITY, the first space is the node name ATTRIBUTES =$ {ENTITY # *} # to get all the ATTRIBUTES of the node. if there is a space in the ENTITY, the first space is followed by all attributes (#2 and #4, #4, more/) fi}

Next we will parse the annotation. Note: note can contain angle brackets! Only the annotations without angle brackets are parsed!

If [["$ ENTITY" = \! -- * --]; Then # Do not check the annotation return 0fi

Now let's look at the most important part of xml.

We know that CONTENT is the CONTENT of the node and can be displayed.

if [[ ! "$CONTENT" =~ ^[[:space:]]*$ ]]; then    echo -n CONTENT=$CONTENTfi

The node attributes are in ENTITY, so we need to separate the node names and attributes, and then extract the attribute names and attribute values.

We process the following nodes:

 
 
 
  abc
 
 

We have separated the node names and attributes.

ELEMENT =$ {ENTITY % *} # obtain the node name. if the ENTITY contains spaces, the first space is preceded by the node name ATTRIBUTES =$ {ENTITY # *} # to obtain all node ATTRIBUTES. if the ENTITY contains spaces, the first space is followed by all attributes (in the case of #2 and #4, #4, there will be more /)

However, the above ATTRIBUTES variable has a small problem, which will be explained later.

If the ELEMENT starts with a slash (/), this is the closed label of the Read node.

If ELEMENT ends with a slash (/), this is an empty tag, similar

In other cases, ELEMENT is the node name, but read This type of tag has no problem with the ELEMENT. ATTRIBUTES ends with/. that is to say, the tag is closed and we need to delete/from the end of ATTRIBUTES.

#! /Usr/bin/env bash # It is only suitable for parsing simple xml. if the attribute value contains spaces and the comment contains angle brackets, it cannot be parsed # The following can be parsed normally #0.
 #1.
 
  
Only For Test
 #2. #3.
 #4.
 # Attribute = Attribute Name # VALUE = Attribute Value # ELEMENT = Element Name # CONTENT = Element Content # accept an int-level parameter. the level starts from 0 and echo_tabs () {local tabs = ""; for (I = 0; I <$1; I ++ )); do tabs = $ tabs ''#4 spaces done echo-n" $ tabs "# Be sure to add double quotation marks} read_dom () {# back up IFS local oldIFS = $ IFS local IFS =\># change the field delimiter to> read-d \ <entity content # Change the read separator to <local ret =$? Local ELEMENT = ''# When the first execution is completed, the first character is <. #. Therefore, after the read operation is completed, the ENTITY and CONTENT are blank characters if [[$ ENTITY = ~ ^ [[: Space:] * $] & [[$ CONTENT = ~ ^ [[: Space:] * $]; then return $ ret fi # The second execution is divided into the following cases #0.
 # The read result is? Xml version = "1.0" encoding = "UTF-8 "? # CONTENT = several blank characters #1.
 
  
1785
 # At this time, the read result is Size, so ENTITY = Size, CONTENT = '000000' # The third read is set to/Size, so ENTITY =/Size, CONTENT = several blank characters #2.
 
  
# In this case, the read result is ListBucketResult xmlns = "http://s3.amazonaws.com/doc/2006-03-01/", so ENTITY = tListBucketResult xmlns = "http://s3.amazonaws.com/doc/2006-03-01/", CONTENT = same as #1 #3.
  # At this time, the read result is test/, so ENTITY = test/, CONTENT = several blank characters #4.
  # At this time, the read result is test name = "xyz" age = "21"/, so ENTITY = test name = "xyz"/, CONTENT = several blank characters #5.
  # The read result is! -- Q1 --, so ENTITY =! -- Q1 --, CONTENT = ''# ENTITY =? Xml version = "1.0" encoding = "UTF-8 "? # Parse the xml declaration, not a common node. the closure method is different from that of the node. if [["$ ENTITY" = ~ ^ \? Xml [[: space:] * (. *) \? $]; Then # use regular expressions to remove question marks and xml characters ENTITY = ''ELEMENT ='' # Not a common node ATTRIBUTES = "$ {BASH_REMATCH [1]}" # get the declared attribute else # normal node ELEMENT =$ {ENTITY % *} # get the node name, if there is a space in the ENTITY, the first space is the node name ATTRIBUTES =$ {ENTITY # *} # to get all the ATTRIBUTES of the node. if there is a space in the ENTITY, the first space is followed by all attributes (#2 and #4, #4, more/). if [["$ ENTITY" = \! -- * --]; Then # Do not check comments (#5) return 0 fi if [["$ ELEMENT" =/*]; then # END of the node #1 step 3 tabCount = $ [$ tabCount-1] echo_tabs $ tabCount echo END $ {ELEMENT # */} # Delete/return 0 elif [["$ ELEMENT "= */] | [[$ ATTRIBUTES = */]; then #3 or #4 empty = true # The node does not have a subnode or value (its own is a closed tag) if [[$ ATTRIBUTES = */]; then # if it is #4, ATTRIBUTES =$ {ATTRIBUTES % */} # Delete the end/and extract all ATTRIBUTES fi echo_tabs $ tabCount echo-n ELEMENT =$ {ELEMENT % */} ''Elif [! "$ ELEMENT" = '']; then # at the first execution, both ENTITY and CONTENT are empty strings echo_tabs $ tabCount echo-n ELEMENT = "$ ELEMENT" ''# output node name tabCount = $ [$ tabCount + 1] # New node else echo-n" XML declaration "# ELEMENT is empty, non-hierarchical fi local empty = false # no sub-nodes, no value IFS = $ oldIFS # attributes are separated by blank spaces to restore IFS, IFS defaults to space/line feed/tab local hasAttribute = false # whether the node has ATTRIBUTES for a in $ ATTRIBUTES; do # loop all ATTRIBUTES # echo ATTRIBUTES = $ ATTRIBUTES '-+-' if ["$ a" = *] # condition Condition #2 and #4 then hasAttribute = true ATTRIBUTE_NAME =$ {a % = *} # extract attribute name ATTRIBUTE_VALUE = 'TR-d' "'<$ {# * =} '# extract ATTRIBUTE values and remove double quotation marks echo-n ATTRIBUTE = $ ATTRIBUTE_NAME VALUE = $ ATTRIBUTE_VALUE ''# output ATTRIBUTE name/attribute value fi done if [[! "$ CONTENT" = ~ ^ [[: Space:] * $]; then echo-n CONTENT = $ CONTENT fi if ["$ empty" = true]; then echo echo_tabs $ tabCount echo-n END $ {ELEMENT %/*} # Delete/# echo-n '(empty node) 'Fi echo return $ ret} read_xml () {local tabCount = 0 # used to format the output, while read_dom at the computing node level; do: done <test. xml} read_xml
 

Execute this script for the following xml

 
 
     
      
      
              
          
  
   Only For Test
          
          
  
   abc
          
                  
      
 

The output result is

The above is how to use bash to parse the detailed content of the xml sample code analysis. For more information, see other related articles in the first PHP community!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.