Huge CSV and XML Files in Python
January 22, 2009. Filed under python
- Twitter
- Facebook
- Pinterest
- Linkedin
- Google +
I, like the most people, never realized I ' d is dealing with large files. Oh, I knew there would is some files with megabytes of data, but I never suspected I ' d was begging Perl to process Hundreds of megabytes of XML, nor that this week I ' d was asking Python to process 6.4 gigabytes of CSV into 6.5 gigabytes O F XML1.
As a few out-of-memory experiences would teach you, the trick for dealing with large files are pretty easy:use code that TR Eats everything as a stream. For inputs, read from disk in chunks. For outputs, frequently write to disk and let system memory forge onward unburdened.
When reading and writing files yourself, this is the easier to do correctly ...
From__future__ImportWith_statement# for Python 2.5Withopen ( ' data.in ' , ' R ' ) as fin: with Open ( ' data.out ' , ' W ' ) as fout: for line in fin: fout. Write (. Join (line. Split ( "
... than it is-do incorrectly ...
WithOpen(' Data.in ',' R ')AsFin:Data=fin. Read () data2 = [ ' join (x. Split ( ")) for x in data ]with open ( ' data.out ' , ' W ' ) as fout: fout. Write (data2)
... at least-simple cases.
Loading Large CSV Files in Python
Python has a excellent CSV library, which can handle large files right out of the box. Sort of.
>> Import csv>> r = Csv.reader (open (' doc.csv ', ' RB '))R:row<module>field Larger than field limit (131072)
Staring at the module documentation2, I couldn ' t find anything of use. So I cracked open the file and confirmed what's the in the csv.py
_csv
error message suggests:the bulk of the module ' s code (and the input parsing in particular) are implemented in C rather than Python.
After a while staring on that error, I began dreaming of what I would create a stream pre-processor using Stringio, but it Didn ' t take too long-to-figure out I would need to recreate my own version of the csv
order to accomplish that.
So back to the blogs, one of the which held the magic grain of information I were looking for: csv.field_size_limit
.
CSVcsv. Field_size_limit()131072csv. Field_size_limit(1000000000)131072csv. Field_size_limit()1000000000
Yep. That's all there are to it. The sucker just works after that.
Well, almost. I did run to an issue with a NULL byte 1.5 gigs into the data. Because The streaming code is written using C-based IO, the NULL byte shorts out the reading of data in an abrupt and non- recoverable manner. To-get around this we need-pre-process the stream somehow, which you could does in Python by wrapping the file with a Cus Tom class that cleans each line before returning it, but I went with the some command line utilities for simplicity.
' Data.out ' >
After this, the 6.4 gig CSV file processed without any issues.
Creating Large XML Files in Python
This part of the process, taking each row of CSV and converting it into a XML element, went fairly smoothly thanks to the xml.sax.saxutils.XMLGenerator
class. The API for creating elements isn ' t a example of simplicity, but it is--unlike many of the more creative schemes--predict Able, and have one killer feature:it correctly writes output to a stream.
As I mentioned, the mechanism for creating elements is a bit verbose, so I made a couple of wrapper functions to simplify (Note that I am sending output to standard out, which lets me simply print
strings to the file I am generating, for Examp Le creating the XML file ' s version declaration).
ImportSysFromXml.sax.saxutilsImportXmlgeneratorFromXml.sax.xmlreaderImportAttributesnsimplG=Xmlgenerator(Sys.StdOut,' Utf-8 ')DefStart_tag(Name,attr={},Body=None,Namespace=None):Attr_vals={}Attr_keys={}ForKey,ValInchattr.Iteritems():Key_tuple=(Namespace,Key)Attr_vals[Key_tuple]=ValAttr_keys[Key_tuple]=KeyAttr2=Attributesnsimpl(Attr_vals,Attr_keys)G.Startelementns((Namespace,Name),Name,Attr2)IfBody:G.Characters(Body)DefEnd_tag(Name,Namespace=None):G.Endelementns((Namespacename), name) def tag (nameattr ={}, body=none , namespace=none): start_tag< Span class= "P" > (nameattrbody< Span class= "P", namespace) end_tag ( namenamespace)
From there, usage looks like this:
Print"" "<?xml version=" 1.0 "encoding=" Utf-8 '?> "" "Start_tag(U ' list ',{U ' ID ':10})ForItemInchSome_list:Start_tag(U ' item ',{u ' id ' : item[0 }) tag (u ' title ' body =item[1] tag< Span class= "P" > (u ' desc ' body= Item[2]) end_tag ( u ' item ' ) end_tag (u ' list ' ) g. Enddocument ()
The one issue I did run to (in my data) was some pagebreak characters floating around ( ^L
aka 12
aka x0c
) which Were tweaking the XML encoder, but you can strip them out in a variety of places, for example by rewriting the main loop:
Some_list: x. Replace('\x0c','# etc
Really, the XMLGenerator
just worked, even when dealing with a quite large file.
Performance
Although my script created a different mix of XML elements than the above example, it wasn ' t any more complex, and had FAI Rly reasonable performance. Processing of the 6.4 gig CSV file into a 6.5 gig XML file took between 19-24 minutes, which means it is able to read-p Rocess-write about five megabytes per second.
In terms of the raw speed, that's ' t particularly epic, but performing a similar operation (is actually XML to XML rather th A CSV to XML) with Perl's XML::Twig
it took eight minutes to process a ~100 megabyte file, so I ' M pretty pleased with the qua Lity of the Python standard library and how it handles large files.
The breadth and depth of the standard library really makes Python a joy to work with for these simple one-shot scripts. If only it had Perl's easier to use regex syntax ...
This was a peculiar nature of data, which makes it different from Media:data files Can--with a large system--become Infini Tely large. Media files, on the other hand, can is extremely dense (a couple of gigs for a high quality movie), but conform to predict Able limits.
If you're dealing with large files, you ' re probably dealing with a company's logs from the last decade or the En Tire dump of their MySQL database.?
I really want to like the new Python documentation. I mean, it certainly looks much better, but I think it had made it harder to actually find what I ' m looking for. I think they ' ve hit the same stumbling block as the Django documentation:the more you customize your documentation, the G Reater The learning curve for using your documentation.
I think the big thing is just the incompleteness of the documentation that gives me trouble. They is certain to cover all the important and frequently used components (along with helpful overviews and examples), BU t the new docs often don ' t even mention less important methods and objects.
For the time being, I am throwing around a lot more dir
commands.?
Huge CSV and XML Files in Python, Error:field larger than field limit (131072)