Python xml parsing example

Source: Internet
Author: User

Python xml parsing example

#-*-Coding: UTF-8-*-"Created on Thu Apr 16 23:18:27 2015 @ author: shifeng" "''' function: parse the CDR_sample.xml file, the output format is the format received by DNorm, and the "label" of the training set is written to the ''' import codecsimport StringIOimport xmlfrom lxml import etreefrom xml. sax import * from xml. sax. handler import * from xml. etree import ElementTree as ETimport xml. dom. minidom dom = xml. dom. minidom. parse ("CDR_sample.xml") root = dom.doc umentElement # print root. nod EName # print root. nodeValue # print root. nodeType # print root. ELEMENT_NODE # ----------- ''' method 1 (not accepted): # obtain the child element that knows the element name. Use the getElementsByTagName method to obtain the element # colloction as the root node. There are four elements whose names are known and passed to the root node. getElementsByTagName (I) can retrieve its child element colloction_ele = ["source", "date", "key", "document"] for I in colloction_ele: print root. getElementsByTagName (I) [0]. nodeName # obtain the tag name # print root. getElementsByTagName (I) [0]. getAttribute # three tags for documents: doc Ument_ele = ["id", "passage", "annotation"] documents = root. getElementsByTagName ("document") # print len (documents) for I in documents: # for each document, for j in document_ele: # retrieve each label print I. getElementsByTagName (j) [0]. nodeName # obtain the tag name print I. getElementsByTagName (j) [0]. firstChild. data # obtain data between tags if j = "annotation": print I. getElementsByTagName (j) [0]. getAttribute ("id") # obtain the tag attribute ''' # ----------- write_text = ope N ("train_text.txt", "w") # ----------- root_2 = ET. parse ("CDR_sample.xml") documents = root_2.findall (". /document ") for per in documents: # Find all documents for child in per: # resolve the tag id, passage, annotation child_tag = child for each document. tag if child_tag = "id": text_id = child. text print child_tag, ":", text_id write_text.write (text_id + "\ t") # Write the file, id and tab symbol elif child_tag = "passage ": # process each passage. passages = chi Ld for passage in passages: # under each document tag, there are multiple passage tags. # There are four types of passage tags. passage_tag = passage is used to process each tag. tag if passage_tag = "offset": # if r is an offset, extract offset = int (passage. text) print "offset:", offset elif passage_tag = "text": # If it is text, extract the text, title_text or abstract_text text = passage. text print passage_tag, ":", text write_text.write (text) # Write the file, title_text and abstract_text, and write them together successively. elif passage_tag = "anno Tation ": # If it is labeled, annotations = passage print 10 *" * "for annotation in annotations: # There are four types of annotation labels under each passage tag, process annotation_tag = annotation for each tag. tag # print annotation_tag, "++" if annotation_tag = "location": print annotation. attrib ["offset"], annotation. attrib ["length"] elif annotation_tag = "text": diease_name = annotation. text print diease_name elif annotation_tag = "info N "and annotation. attrib [" key "]! = "Type": # There are multiple annotations under each passage tag, and each annotation has two infon tags, and the second infons = annotation print infons. attrib ["key"], infons. text # for infon in infons: # print infon. attrib ["key"] elif child_tag = "annotation": # document_ele [2]: # annotation = child write_text.write ("\ n ") # After each document is traversed, add a line feed character print 30 * "*" write_text.close () # "label" to check whether the file is to be continued .... '''doc = etree. parse ("CDR_sample.xml") xml_string = etree. tostring (doc) root = etree. fromstring (xml_string) parser = make_parser () # MarkDecodeHandler # MarkDecodeHandlerhandler = UserDecodeHandler () parser. setContentHandle (handler) parser. parse (root) for item in handler. marks: for j in item. items (): print I, jprint type (doc) print type (root) # print doc. tagprint root. tag # with codecs. open ("CDR_sample.xml") as xml: # text = xml. readlines () # s_xml = "" # for I in text: # I = I. strip ("\ n") # s_xml + = I # print s_xml # soup = BeautifulSoup (s_xml) # print soup. title # for I in text: # print I '''

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.