Redis data import tool optimization process summary, redis Data Import
Background of optimization of Redis data import tool
Developed a Redis data import tool using C ++
Import all table data from oracle to redis;
Instead of simply importing data, the original records in each oracle must be processed by the business logic,
And add indexes (redis set );
After the tool is completed, performance is a bottleneck;
Optimization results
Two sample data tests were used:
8763 records in Table a of sample data;
Table B contains 940279 records;
Before optimization, Table a took 11.417 s;
After optimization, Table a takes 1.883 s;
Tools used
Gprof, pstrace, time
Use the time tool to view the time consumed by each execution, including the user time and system time;
Use pstrace to print and run in real time, query the main system calls of the process, and find the time consumed;
Use the gprof statistical program to summarize the time consumed and concentrate on optimizing the most time consumed;
Introduction:
1. You must add-pg to all the editing and connection options of g ++ (the statistical report cannot be generated because the-pg option is not added to the connection on the first day );
2. After the program is executed, the gmon. out file will be generated in this directory;
3. gprof redistool gmou. out> report to generate a readable file report and enable the most time-consuming function in the report set;
Optimization process
Before optimization 11.417 s:
Time./redistool im a.csv real 0m11. 417 suser 0m6. 035 ssys 0m4. 782 s (it is found that the system call time is too long)
File Memory ing
System calls are too long, mainly for File Reading and Writing. The initial consideration is that the number of api calls is too frequent during file reading;
The reading sample uses the fgets row-based reading of the file. After the File Memory is mapped to mmap, you can directly use the pointer to operate the entire file memory quickly;
Log switch advance
After improving file read/write, it is found that the optimization effect is relatively limited (improved by about 2 seconds); fgets is the C file read library function, compared to the system read (), it has a buffer zone, it should not be too slow (some tests on the Internet, the File Memory ing can be an order of magnitude faster than fgets (), and it seems that the scenario should be special );
The pstrace tool later finds that log. dat is opened too many times. It turns out that the debug log switch is written to the back end, causing the debug log to open the log file open ("log. dat ");
Enable log switch in advance; 3.53 s after improvement
time ./redistool im a a.csvreal 0m3.530suser 0m2.890ssys 0m0.212s
Vector Space pre-allocated
Based on gprof analysis, the vector memory of a function is allocated many times, and there are many replications:
Improve the following line of code:
vector <string> vSegment;
Use static vector variables and pre-allocate memory:
static vector <string> vSegment;vSegment.clear();static int nCount = 0;if( 0 == nCount){ vSegment.reserve(64);}++nCount;
After optimization, It is increased to 2.286 s
real 0m2.286suser 0m1.601ssys 0m0.222s
Similarly, the vector member in another class also uses pre-allocated space (in the constructor ):
m_vtPipecmd.reserve(256);
After optimization, It is increased to 2.166 s;
real 0m2.166suser 0m1.396ssys 0m0.204s
Function rewriting & inline
Continue to execute the program and find that SqToolStrSplitByCh () function consumes too much. Rewrite the entire function logic and inline the rewritten function:
After optimization, It is increased to 1.937 s
real 0m1.937suser 0m1.301ssys 0m0.186s
Remove debuggable and optimized monitoring symbols
Finally, after removing the debug and pg debugging symbols, the final effect is 1.883 s;
real 0m1.883suser 0m1.239ssys 0m0.191s
Meet production requirements
The last few steps above seem to have a millisecond-level improvement. After the full table data is expanded, the effect is obvious;
After optimization, Table a in production is 152 million, and the import time is about 326 s (~ 6 minutes );
The data in table B is 420 million, and the import time is about 1103 s (~ 18 minutes)
Posted by: Large CC | 28JUN, 2015
Blog: blog.me115.com [subscription]
Github: Large CC