Because of your work needs, you need to use Pandas to filter several CSV and merge the filter results into one xlsx.
The resulting xlsx is larger (10-30m)
At first, the OPENPYXL and Xlsxwriter supported by pandas itself (those with a . xlsx extension would be written usingxlsxwriter (if Available) or OPENPYXL.)
Performance is not ideal, the output will wait several minutes.
Found on Google Pyexcelerate, claiming to be accelerated Excel XLSX writing library for Python
Looked at the official offer of the benchmark, because the value is small, at that time felt that compared to the previous two increases limited, did not care.
Instead of looking for a C + + dynamic library scenario, LIBXL was found.
LIBXL Dynamic Library is a tool for the direct manipulation of Excel documents in C/S language, see the benchmark of homepage
High Performance
Writing speed was about 2 cells per second for numbers and all cells per second for 8-character random strings In binary xls format (CPU 3.2 GHz).
Plus the halo of C, feel very strong appearance (now looking back, found that the XLS is not the xlsx. )
But Libxl is a charge, $199.
If you try, you can read the number of rows and insert the first line of the Excel file in the output.
Created by LIBXL Trial version. Please buy the LIBXL full version of removing this message.
Because the xlsx that needs to be generated is dozens of m, open-delete the first row-save, or more inconvenient.
Then looked for several versions of the cracked version, seemingly do not work, there are purchase information ...
During the compile.
Find a LIBXL python wrapper--"libxlpy" on the Internet, in order to solve the LIB compilation on 64-bit WIN8,
Successively installed VSE2008, Microsoft Visual C + + Compiler for Python 2.7, WIN SDK.
Have not been able to pass, then give up
All reptiles, please indicate the trouble reproduced from the blog park?
http://www.cnblogs.com/lightwind/
Look back on the pyexcelerate and see the library's author is two Chinese who work at Google
Authors:kevin Wang and Kevin Zhang
Kevin Wang, Kevin Zhang. Portions Copyright Google, Inc.
Just find a file to try, feeling should be comparable with LIBXL, then came out to write a script to test the performance of the three ideas
Open up!
System environment:
CPU i5 3337U, RAM 8GB, SSD
Win 8.1 64bit
Python 2.7.9
Pyexcelerate 0.6.7
LIBXL 3.6.2
Xlsxwriter 0.7.3
ImportTime,ctypesImportPandas as PDImportPyexcelerate,xlsxwriterdeftest_pyexcelerate (fn_prefix,data): WB=pyexcelerate. Workbook () Wb.new_sheet ("Sheet name", Data=data) Wb.save (Fn_prefix+"_pyexcelerate.xlsx")defTEST_LIBXL (fn_prefix,data): DLL= Ctypes.windll.LoadLibrary ('Bin64_libxl_362.dll') Book=dll.xlcreatexmlbookca () sheet= Dll.xlbookaddsheeta (book,"Sheet10", 0) forNrow,recinchEnumerate (data, start = 1):#Reserverd the first line for ' Created by LIBXL Trial Version ... ' forNcol,valinchEnumerate (REC, start =0):#print ' row:%s, col:%s '% (Nrow,ncol,type (val)) ifisinstance (val,float): Dll.xlsheetwritenuma (sheet, nrow, Ncol, Ctypes.c_double (val), 0) Else: Dll.xlsheetwritestra (sheet, nrow, Ncol, Ctypes.c_char_p (val), 0) Dll.xlbooksavea (book, fn_prefix
+"_libxl.xlsx") Dll.xlbookreleasea (book)defTest_xlsxwriter (fn_prefix,data): Workbook= Xlsxwriter. Workbook (fn_prefix+"_xlsxwriter.xlsx") Worksheet=Workbook.add_worksheet () forNrow,recinchEnumerate (data): forNcol,valinchEnumerate (REC): Worksheet.write (Nrow, Ncol, Val) workbook.close ()defbench1 (): forNlineinch(1000,10000, Len (csv_data)): Fn_prefix=fn[:-4]+'_'+Str (nline) Data=Csv_data[:nline]Print '* * *%s Lines xlsx Benchmark * * *'%nline forFuncinch(Test_pyexcelerate, TEST_LIBXL, test_xlsxwriter): T1=time.time () Apply (func, (fn_prefix,data)) TM=time.time ()-T1Print '%s:%0.3f'% (func.__name__, TM)Print ' 'defBench2 (func,fn,nline,csv_data): Fn_prefix=fn[:-4]+'_'+Str (nline) Data=Csv_data[:nline]Print '* * *%s Lines xlsx Benchmark * * *'%nline T1=time.time () Apply (func, (fn_prefix,data)) TM=time.time ()-T1Print '%s:%0.3f'% (func.__name__, TM)Print ' 'defTest_xlsxwriter_to_excel (DF,FN):Print '* * * TO_EXCEL XLSX Benchmark * * *'T1=time.time () df.to_excel (FN+'. xlsx', sheet_name='Sheet1', index=False) TM=time.time ()-T1Print 'Pandas To_excel (%s):%0.3f'% (Pd.get_option ('Io.excel.xlsx.writer'), TM)if(__name__=="__main__"): FN='Eutrancelltdd.csv'DF= Pd.read_csv (fn,encoding='GBK', na_values=["No value"])#, nrows=1000DF =Df.fillna (0)#Test_xlsxwriter_to_excel (DF,FN)Csv_data=df.values.tolist ()#bench1 () #For monitoring process memory usage.Func=(Test_pyexcelerate, TEST_LIBXL, Test_xlsxwriter) bench2 (func[2],fn,len (Csv_data), Csv_data)
The test sample is 1 CSV files with 50 columns and 97613 rows, where 5 columns are English strings and the rest are numeric values.
The test results are as follows:
*** +Lines XLSX Benchmark * * *test_pyexcelerate:0.633TEST_LIBXL:0.531Test_xlsxwriter:1.456***10000Lines XLSX Benchmark * * *test_pyexcelerate:5.974TEST_LIBXL:5.475Test_xlsxwriter:13.685***97613Lines XLSX Benchmark * * *test_pyexcelerate:54.461Memory Peak 644MTEST_LIBXL:51.090Memory Peak 1070mtest_xlsxwriter:117.300Memory Peak 930MTo_excel XLSX Benchmark * * *Pandas To_excel (xlsxwriter):239.821Memory Peak 922M
The results show that the difference between pyexcelerate and LIBXL speed is very small and the memory is superior.
Because of the pyexcelerate officially provided by the previously seen benchmark test, pyexcelerate speed is almost twice times the Xlsxwriter, because the first pandas method to generate Excel files is very slow, So mistakenly think Pyexcelerate also slow.
So adding the last test item, the result shows that although the Xlsxwriter engine is used,Pandas is always thoughtful and expensive in all aspects, so try to avoid using To_excel method to generate large files.
Summary: Recommended use of Pyexcelerate
Pyexcelerate file is the largest, decompression found its not used sharedstrings, string is directly inline into the XML, the default cheat, haha
Xlsxwriter can also set parameters to cancel Sharedstrings, which is by default using Sharedstrings
Not using sharedstrings seems to save memory, but it will make the resulting makefile larger.
Who is the fastest xlsx writer on python