Who is the fastest xlsx writer on python

Last Update:2015-09-21 Source: Internet

Author: User

Tags benchmark

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Because of your work needs, you need to use Pandas to filter several CSV and merge the filter results into one xlsx.

The resulting xlsx is larger (10-30m)

At first, the OPENPYXL and Xlsxwriter supported by pandas itself (those with a . xlsx extension would be written usingxlsxwriter (if Available) or OPENPYXL.)

Performance is not ideal, the output will wait several minutes.

Found on Google Pyexcelerate, claiming to be accelerated Excel XLSX writing library for Python

Looked at the official offer of the benchmark, because the value is small, at that time felt that compared to the previous two increases limited, did not care.

Instead of looking for a C + + dynamic library scenario, LIBXL was found.

LIBXL Dynamic Library is a tool for the direct manipulation of Excel documents in C/S language, see the benchmark of homepage

High Performance

Writing speed was about 2 cells per second for numbers and all cells per second for 8-character random strings In binary xls format (CPU 3.2 GHz).

Plus the halo of C, feel very strong appearance (now looking back, found that the XLS is not the xlsx. ）

But Libxl is a charge, $199.

If you try, you can read the number of rows and insert the first line of the Excel file in the output.

Created by LIBXL Trial version. Please buy the LIBXL full version of removing this message.

Because the xlsx that needs to be generated is dozens of m, open-delete the first row-save, or more inconvenient.

Then looked for several versions of the cracked version, seemingly do not work, there are purchase information ...

During the compile.

Find a LIBXL python wrapper--"libxlpy" on the Internet, in order to solve the LIB compilation on 64-bit WIN8,

Successively installed VSE2008, Microsoft Visual C + + Compiler for Python 2.7, WIN SDK.

Have not been able to pass, then give up

All reptiles, please indicate the trouble reproduced from the blog park?

http://www.cnblogs.com/lightwind/

Look back on the pyexcelerate and see the library's author is two Chinese who work at Google

Authors:kevin Wang and Kevin Zhang
Kevin Wang, Kevin Zhang. Portions Copyright Google, Inc.

Just find a file to try, feeling should be comparable with LIBXL, then came out to write a script to test the performance of the three ideas

Open up!

System environment:
CPU i5 3337U, RAM 8GB, SSD
Win 8.1 64bit
Python 2.7.9

Pyexcelerate 0.6.7
LIBXL 3.6.2
Xlsxwriter 0.7.3

ImportTime,ctypesImportPandas as PDImportPyexcelerate,xlsxwriterdeftest_pyexcelerate (fn_prefix,data): WB=pyexcelerate. Workbook () Wb.new_sheet ("Sheet name", Data=data) Wb.save (Fn_prefix+"_pyexcelerate.xlsx")defTEST_LIBXL (fn_prefix,data): DLL= Ctypes.windll.LoadLibrary ('Bin64_libxl_362.dll') Book=dll.xlcreatexmlbookca () sheet= Dll.xlbookaddsheeta (book,"Sheet10", 0) forNrow,recinchEnumerate (data, start = 1):#Reserverd the first line for ' Created by LIBXL Trial Version ... '         forNcol,valinchEnumerate (REC, start =0):#print ' row:%s, col:%s '% (Nrow,ncol,type (val))            ifisinstance (val,float): Dll.xlsheetwritenuma (sheet, nrow, Ncol, Ctypes.c_double (val), 0) Else: Dll.xlsheetwritestra (sheet, nrow, Ncol, Ctypes.c_char_p (val), 0) Dll.xlbooksavea (book, fn_prefix
    +"_libxl.xlsx") Dll.xlbookreleasea (book)defTest_xlsxwriter (fn_prefix,data): Workbook= Xlsxwriter. Workbook (fn_prefix+"_xlsxwriter.xlsx") Worksheet=Workbook.add_worksheet () forNrow,recinchEnumerate (data): forNcol,valinchEnumerate (REC): Worksheet.write (Nrow, Ncol, Val) workbook.close ()defbench1 (): forNlineinch(1000,10000, Len (csv_data)): Fn_prefix=fn[:-4]+'_'+Str (nline) Data=Csv_data[:nline]Print '* * *%s Lines xlsx Benchmark * * *'%nline forFuncinch(Test_pyexcelerate, TEST_LIBXL, test_xlsxwriter): T1=time.time () Apply (func, (fn_prefix,data)) TM=time.time ()-T1Print '%s:%0.3f'% (func.__name__, TM)Print ' 'defBench2 (func,fn,nline,csv_data): Fn_prefix=fn[:-4]+'_'+Str (nline) Data=Csv_data[:nline]Print '* * *%s Lines xlsx Benchmark * * *'%nline T1=time.time () Apply (func, (fn_prefix,data)) TM=time.time ()-T1Print '%s:%0.3f'% (func.__name__, TM)Print ' 'defTest_xlsxwriter_to_excel (DF,FN):Print '* * * TO_EXCEL XLSX Benchmark * * *'T1=time.time () df.to_excel (FN+'. xlsx', sheet_name='Sheet1', index=False) TM=time.time ()-T1Print 'Pandas To_excel (%s):%0.3f'% (Pd.get_option ('Io.excel.xlsx.writer'), TM)if(__name__=="__main__"): FN='Eutrancelltdd.csv'DF= Pd.read_csv (fn,encoding='GBK', na_values=["No value"])#, nrows=1000DF =Df.fillna (0)#Test_xlsxwriter_to_excel (DF,FN)Csv_data=df.values.tolist ()#bench1 ()    #For monitoring process memory usage.Func=(Test_pyexcelerate, TEST_LIBXL, Test_xlsxwriter) bench2 (func[2],fn,len (Csv_data), Csv_data)

The test sample is 1 CSV files with 50 columns and 97613 rows, where 5 columns are English strings and the rest are numeric values.

The test results are as follows:

*** +Lines XLSX Benchmark * * *test_pyexcelerate:0.633TEST_LIBXL:0.531Test_xlsxwriter:1.456***10000Lines XLSX Benchmark * * *test_pyexcelerate:5.974TEST_LIBXL:5.475Test_xlsxwriter:13.685***97613Lines XLSX Benchmark * * *test_pyexcelerate:54.461Memory Peak 644MTEST_LIBXL:51.090Memory Peak 1070mtest_xlsxwriter:117.300Memory Peak 930MTo_excel XLSX Benchmark * * *Pandas To_excel (xlsxwriter):239.821Memory Peak 922M

The results show that the difference between pyexcelerate and LIBXL speed is very small and the memory is superior.

Because of the pyexcelerate officially provided by the previously seen benchmark test, pyexcelerate speed is almost twice times the Xlsxwriter, because the first pandas method to generate Excel files is very slow, So mistakenly think Pyexcelerate also slow.

So adding the last test item, the result shows that although the Xlsxwriter engine is used,Pandas is always thoughtful and expensive in all aspects, so try to avoid using To_excel method to generate large files.

Summary: Recommended use of Pyexcelerate

Pyexcelerate file is the largest, decompression found its not used sharedstrings, string is directly inline into the XML, the default cheat, haha

Xlsxwriter can also set parameters to cancel Sharedstrings, which is by default using Sharedstrings

Not using sharedstrings seems to save memory, but it will make the resulting makefile larger.

Who is the fastest xlsx writer on python

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More