Huge CSV and XML Files in Python, Error:field larger than field limit (131072)

Last Update:2016-09-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Huge CSV and XML Files in Python

January 22, 2009. Filed under python

Twitter
Facebook
Pinterest
Linkedin
Google +

I, like the most people, never realized I ' d is dealing with large files. Oh, I knew there would is some files with megabytes of data, but I never suspected I ' d was begging Perl to process Hundreds of megabytes of XML, nor that this week I ' d was asking Python to process 6.4 gigabytes of CSV into 6.5 gigabytes O F XML^1.

As a few out-of-memory experiences would teach you, the trick for dealing with large files are pretty easy:use code that TR Eats everything as a stream. For inputs, read from disk in chunks. For outputs, frequently write to disk and let system memory forge onward unburdened.

When reading and writing files yourself, this is the easier to do correctly ...

From__future__ImportWith_statement# for Python 2.5Withopen ( ' data.in ' , ' R ' ) as fin: with  Open ( ' data.out ' , ' W ' ) as fout: for line in fin: fout. Write (. Join (line. Split ( "

... than it is-do incorrectly ...

WithOpen(' Data.in ',' R ')AsFin:Data=fin. Read () data2 = [ ' join (x. Split ( ")) for x in data ]with open  ( ' data.out ' , ' W ' ) as fout: fout. Write (data2)

... at least-simple cases.

Loading Large CSV Files in Python

Python has a excellent CSV library, which can handle large files right out of the box. Sort of.

>> Import csv>> r = Csv.reader (open (' doc.csv ', ' RB '))R:row<module>field Larger than field limit (131072)

Staring at the module documentation^{2, I couldn ' t find anything of use. So I cracked open the file and confirmed what's the in the csv.py _csv error message suggests:the bulk of the module ' s code (and the input parsing in particular) are implemented in C rather than Python.}

After a while staring on that error, I began dreaming of what I would create a stream pre-processor using Stringio, but it Didn ' t take too long-to-figure out I would need to recreate my own version of the csv order to accomplish that.

So back to the blogs, one of the which held the magic grain of information I were looking for: csv.field_size_limit .

CSVcsv.  Field_size_limit()131072csv.  Field_size_limit(1000000000)131072csv.  Field_size_limit()1000000000

Yep. That's all there are to it. The sucker just works after that.

Well, almost. I did run to an issue with a NULL byte 1.5 gigs into the data. Because The streaming code is written using C-based IO, the NULL byte shorts out the reading of data in an abrupt and non- recoverable manner. To-get around this we need-pre-process the stream somehow, which you could does in Python by wrapping the file with a Cus Tom class that cleans each line before returning it, but I went with the some command line utilities for simplicity.

' Data.out ' >

After this, the 6.4 gig CSV file processed without any issues.

Creating Large XML Files in Python

This part of the process, taking each row of CSV and converting it into a XML element, went fairly smoothly thanks to the xml.sax.saxutils.XMLGeneratorclass. The API for creating elements isn ' t a example of simplicity, but it is--unlike many of the more creative schemes--predict Able, and have one killer feature:it correctly writes output to a stream.

As I mentioned, the mechanism for creating elements is a bit verbose, so I made a couple of wrapper functions to simplify (Note that I am sending output to standard out, which lets me simply print strings to the file I am generating, for Examp Le creating the XML file ' s version declaration).

ImportSysFromXml.sax.saxutilsImportXmlgeneratorFromXml.sax.xmlreaderImportAttributesnsimplG=Xmlgenerator(Sys.StdOut,' Utf-8 ')DefStart_tag(Name,attr={},Body=None,Namespace=None):Attr_vals={}Attr_keys={}ForKey,ValInchattr.Iteritems():Key_tuple=(Namespace,Key)Attr_vals[Key_tuple]=ValAttr_keys[Key_tuple]=KeyAttr2=Attributesnsimpl(Attr_vals,Attr_keys)G.Startelementns((Namespace,Name),Name,Attr2)IfBody:G.Characters(Body)DefEnd_tag(Name,Namespace=None):G.Endelementns((Namespacename), name)  def tag (nameattr ={}, body=none , namespace=none): start_tag< Span class= "P" > (nameattrbody< Span class= "P", namespace) end_tag ( namenamespace)

From there, usage looks like this:

Print"" "<?xml version=" 1.0 "encoding=" Utf-8 '?> "" "Start_tag(U ' list ',{U ' ID ':10})ForItemInchSome_list:Start_tag(U ' item ',{u ' id ' : item[0 }) tag (u ' title ' body =item[1] tag< Span class= "P" > (u ' desc ' body= Item[2]) end_tag ( u ' item ' ) end_tag (u ' list ' ) g. Enddocument ()

The one issue I did run to (in my data) was some pagebreak characters floating around ( ^L aka 12 aka x0c ) which Were tweaking the XML encoder, but you can strip them out in a variety of places, for example by rewriting the main loop:

Some_list:    x.  Replace('\x0c','# etc

Really, the XMLGenerator just worked, even when dealing with a quite large file.

Performance

Although my script created a different mix of XML elements than the above example, it wasn ' t any more complex, and had FAI Rly reasonable performance. Processing of the 6.4 gig CSV file into a 6.5 gig XML file took between 19-24 minutes, which means it is able to read-p Rocess-write about five megabytes per second.

In terms of the raw speed, that's ' t particularly epic, but performing a similar operation (is actually XML to XML rather th A CSV to XML) with Perl's XML::Twig it took eight minutes to process a ~100 megabyte file, so I ' M pretty pleased with the qua Lity of the Python standard library and how it handles large files.

The breadth and depth of the standard library really makes Python a joy to work with for these simple one-shot scripts. If only it had Perl's easier to use regex syntax ...

This was a peculiar nature of data, which makes it different from Media:data files Can--with a large system--become Infini Tely large. Media files, on the other hand, can is extremely dense (a couple of gigs for a high quality movie), but conform to predict Able limits.
If you're dealing with large files, you ' re probably dealing with a company's logs from the last decade or the En Tire dump of their MySQL database.?
I really want to like the new Python documentation. I mean, it certainly looks much better, but I think it had made it harder to actually find what I ' m looking for. I think they ' ve hit the same stumbling block as the Django documentation:the more you customize your documentation, the G Reater The learning curve for using your documentation.
I think the big thing is just the incompleteness of the documentation that gives me trouble. They is certain to cover all the important and frequently used components (along with helpful overviews and examples), BU t the new docs often don ' t even mention less important methods and objects.
For the time being, I am throwing around a lot more dir commands.?

Huge CSV and XML Files in Python, Error:field larger than field limit (131072)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Huge CSV and XML Files in Python, Error:field larger than field limit (131072)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Huge CSV and XML Files in Python, Error:field larger than field limit (131072)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support