Python Cookbook Third Edition study note seven: Python parsing csv,json,xml file

Source: Internet
Author: User
Tags tag name pprint
CSV file read:
The Csv file format is as follows: There are two rows and three columns.
The access code is as follows:
f = open (r‘E: \ py_prj \ test.csv ‘,‘ rb ’)

f_csv = csv.reader (f)

for f in f_csv:

    print f
Here f is a tuple. In order to access a field, the corresponding value needs to be accessed by index. For example, f [0] accesses first, f [1] accesses second, and f [2] accesses third. It is difficult to remember using column indexes. If you don't pay attention, you will make a mistake. Consider naming tuples
Here is the method of namedtuple.
The following example uses namedtuple to create an object and assign it to the user. The object instance is user, which contains three attribute values, which are name, age, and height. Get u after assignment
You can use the property access method u.name to access each property
user = namedtuple (‘user’, [’name’, ‘age’, ‘height’])

u = user (name = ‘zhf‘, age = 20, height = 180)

print u.name
 
This method can also be used to read csv files. The code is modified as follows
f = open (r‘E: \ py_prj \ test.csv ‘,‘ rb ’)

f_csv = csv.reader (f)

heading = next (f_csv)

Row = namedtuple (‘Row’, heading)

for row in f_csv:

    row = Row (* row)

    print row.first, row.second, row.third
 
This makes the visit much more intuitive. Then using the object method seems too complicated. Can I use a dictionary? It is also possible and the method is more concise. Methods as below.
f = open (r‘E: \ py_prj \ test.csv ‘,‘ rb ’)

f_csv = csv.DictReader (f)

for row in f_csv:

    print row [‘first‘]
 
The same writing can also be done with csv.DictWriter ()
 
Json data:
Json and XML are the most used data exchange formats in the network world. Json has the following characteristics:
1 objects are represented as key-value pairs, which is the form of a dictionary
2 data is comma separated again
3 Curly braces save objects
4Square brackets hold array
 
The method of operating the json file is as follows:
data = {‘name‘: ‘zhf’, ‘age’: 30, ‘location’: ‘china’}

f = open (‘test.json’, ‘w’)

json.dump (data, f)

f = open (‘test.json’, ‘r’)

print json.load (f)
The format in the generated json file is as follows:
You can see that this key-value structure of json is much simpler and more clear at a glance than the structure of XML.
We can also extend json. As the following data. There is an array in the key of record, there are 2 dictionary data in the array
data = ('name': 'zhf', 'age': 30, 'location': 'china', 'record': [{'first': 'china', 'during': 10}, {'second' : 'chengdu', 'during': 20}]}
You can see that json can store complex data structures. When we print it out. You can see that the visualization of the structure is not very good.
We can use the pprint method to print the results in a structured way: pprint (json.load (f))


It looks much clearer and more intuitive.
Parse the XML file:
 XML (eXtensible Markup Language) refers to Extensible Markup Language, which is designed to transmit and store data. XML and JSON file formats are the most commonly used data transmission formats on the Internet. Python has three methods for parsing XML: One is the xml.dom. * Module. The second is the xml.sax. * Module The third is the xml.etree.ElementTree module
First introduce the dom module. A DOM parser is parsing an XML document. Read the entire document at once. Stores all elements of the document in a tree structure in memory. You can see that this method is more suitable for parsing small XML documents, otherwise it will consume memory.
For example, the following structure. <string-array> includes many <item> below. This structure contains prefecture-level cities below Beijing


The parsing code is as follows:
def xml_try ():

    domtree = xml.dom.minidom.parse (r‘D: \ test_source \ arrays.xml ‘)

    data = domtree.documentElement

    city = data.getElementsByTagName (‘string-array’)

    for c in city:

        print c.getAttribute (‘name’)

        cityname = c.getElementsByTagName (‘item’)

        for name in cityname:

            print name.childNodes [0] .data
First use domtree.documentElement to enter all the content of the XML file. Then use getElementsByTagName to read all the structures whose tag name is string-array. Then carry out specific analysis in each structure. Use childNodes to access the last end element when traversing to the child node. getAttribute gets specific attributes.

The same file we use ElementTree parsing method is as follows: First find all string-array nodes. Then find all the item nodes in it. Then output node content
from xml.etree.ElementTree import parse
 
def xml_try ():

    doc = parse (r‘D: \ test_source \ arrays.xml ’)

    for city in doc.findall (‘string-array’):

        name = city.findall (‘item’)

        for n in name:

            print n.text
If we want to find a certain node structure accurately. Methods as below:
doc1 = parse (r‘D: \ test_source \ rss20.xml ’)

for item in doc1.iterfind (‘channel / item / title’):

    print item.text
The xml structure is as follows:

Above is the child node located at the last level. If you want to locate the parent node above, and then find all the child nodes:
doc1 = parse (r‘D: \ test_source \ rss20.xml ’)

for item in doc1.iterfind (‘channel / item’):

    print item.findtext (‘title’)

    print item.findtext (‘link’)
 
The first two methods are to read all the data in the XML file at one time, and then search again. The advantage of this method is that the search is fast, but it is memory-intensive. In fact, most of the time we just look for specific elements. Read all at once. Will cause a lot of unnecessary data to be written. If you can look up and judge, you can save memory greatly. Iterparse is this way: Use the previous document.
from xml.etree.ElementTree import iterparse
doc2 = iterparse (r‘D: \ test_source \ rss20.xml ’, (‘ start ’,‘ end ’))

for event, elem in doc2:

    print ‘the event is% s’% event

    print elem.tag, elem.text
We intercepted a small part of the results.
Corresponding XML Structure
Did you find any rules? When encountering the <title> character, event is start, and when encountering </ title> event is end. Iterparse returns 2 elements, one is event. One is elem. This elem is the element between start and end. From the print above, you can see that elem.tag and elem.text are printed twice. This is because it is printed once when event is start. It was printed again when event was end.
We can modify the code as follows to output text only when event is end
doc2 = iterparse (r‘D: \ test_source \ rss20.xml ’, (‘ start ’,‘ end ’))

for event, elem in doc2:

    print ‘the event is% s’% event

    if event == ‘end’:

        print elem.tag, elem.text
Now that iterparse can scan each element and get the corresponding text, then we can convert this function into a generator. The code is modified as follows:
 
 
def xml_try (element):

    tag_indicate = []

    doc2 = iterparse (r‘D: \ test_source \ rss20.xml ’, (‘ start ’,‘ end ’))

    for event, elem in doc2:

        if event == ‘start’:

            tag_indicate.append (elem.tag)

        if event == ‘end’:

            if tag_indicate.pop () == element:

                yield elem.text
if __name __ == ‘__ main__’:

   for r in xml_try (‘title‘):

       print r
 
In the code above. When event is start, record the tag at this time. When event is end, the tag value of the record is compared with the element value passed in. If they are equal, then elem.text is returned. In the code we pass in title. The results obtained are as follows.

As can be seen from the above pieces of code, iterparse is mainly applied to relatively large xml files. In this case, if all the data is read at once to form a tree structure, it is very memory intensive.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.