Using Python to process a data set of about 1G, running very slowly, how to optimize?

Last Update:2016-06-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The research direction is the Recommender system, which recently implemented a simple label-based recommendation algorithm with Python on the delicious dataset and then computed recall and precision. The run time on a small data set of several m can also be (about more than 10 seconds), but it runs very slowly on a large (hundreds of trillion, 1g) dataset, and I've waited 4 hours to figure out the results. Excuse me, on the basis of not optimizing the algorithm, what method can be used to improve the speed of the program?
Lab environment: Ubuntu 13.10, 4G, Intel i3-2310m, Python 2.75.

Reply content:

There are two reasons for this:

First, there is the problem with the algorithm. The complexity of the algorithm, in the case of large data size, the speed difference will be more and more large. You don't describe the exact algorithm, so we don't know how to improve the algorithm. In my experience, however, the machine learning algorithm is slow and normal, because the computational capacity is very large. Many steps if you refer to some of the existing methods, the basic is already known in the complexity of the algorithm and code complexity of the very good balance and the complexity of the algorithm is already very good method. If you want to improve it, you'll have to invest a lot of time in academic research, or a lot of time writing complex code.

The solution is to analyze your program yourself, determine how complex each part is, figure out the bottleneck of the algorithm, and then spend your energy optimizing the algorithm on the bottleneck.

The second problem is the well-known python itself slow problem, python as fully built on the interpreter support OO support FP and type dynamic language, can use the machine instruction optimization is very limited, it is generally considered 10-100 times slower than the native program is normal.

Workaround: A quick Work-around is to use the JIT compiler such as pypy, which can increase the speed by approximately several times to about 10 times. In addition, using a profile technology to find the bottleneck of the runtime, can be used to rewrite the bottleneck part C, can almost reach native speed.

Finally, in this multi-core and cloud era, you should consider multicore and even multiple machines. Python itself and GIL, a process does not support the computational meaning of multi-threading, the parts of your program well divided into a multi-process. Then run at the same time with multiple CPUs of a single machine, or still run on multiple machines. Lord, let me give you some practical advice!

Consider rewriting C or C + +.
Consider parallel, find a Hadoop cluster, write a mapreduce program run on Hadoop run, more data are not afraid.
Consider upgrading the machine, engage in more memory, and then try to put things in memory.
Consider program optimization.

You have to see where your program is slow, you can follow these steps:

First of all, make sure you really need to go through all the data, if you can filter out the useless data through some rough fast way, it is best. (for example, some of the obvious useless things can be filtered directly through grep, grep this program is generally more than you write the Python program much faster and much more than a lot of many)
Top, see if the CPU is full?
Single-threaded process implementation? Can you make a lot of progress? and top to see if each core is full?
If you don't run full, you should try to make the most of your CPU and get the CPU running full! Look at the program, not running full because of IO? Is it possible for IO to become asynchronous? Or io too many times? Can I reduce the number of IO? Even just once IO, such as your 1G, can be all in memory at once , and then everything in memory processing (this seems to be more convenient to write C)
If each core is full, then look at where your calculations are spent, and use tools such as hotshot to measure one. You can roughly compare the results of hotshot in 1/16 data, 1/8 data, 1/4 data, 1/2 data, and see how much time your function spends. find the one or several things that take the most time (so-called bottlenecks), targeted optimization, can do more with less.
After you find the problem, seek a solution. If the data structure of the Python band is not appropriate, can be solved with something like numpy, can be solved with some database (for example, multiple processes need to be written in a large dictionary, you can consider the whole to a redis). Can some place with Cython packaging a C implementation.
If the algorithm is not good enough, can not optimize the algorithm . (It's a long story)

Try some strange things, like PyPy.

Single-machine case, summed up, is: first reduce the input data, and then do not waste machine resources, to let all the CPU core run full (multi-process & reduce/not wait IO), memory as long as enough, you can use! Then find the slowest place to do all sorts of optimizations for the program.

If you have more than one machine, get into Hadoop, the data is not afraid of any more! With delicious datasets even the most naive count (u,t) * (t,i) shun plus inverse frequency are slow ... After all, tag and item are too many ... Slow is normal ... First of all, you should confirm your algorithm complexity, such as the number of times after doubling the running time increase? Just see this Numfocus/python-benchmarks road GitHub Profile + Cython is generally the most labor-saving and the most easily improved anti-optimization algorithm/Use profile optimization implementation.
The second is the use of Pypy/cython.
Second, use NumPy.
Finally, use a different language instead. Python array traversal is particularly slow and can be combined with Cython to accelerate i3-2310m? The experimental environment is actually in the entry-level notebook, your laboratory (company) in the end how difficult? NumPy is relatively slow, the matrix is computationally large can try Matlab. You can also profile your program, to see which link operation time is relatively long.



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Using Python to process a data set of about 1G, running very slowly, how to optimize?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Using Python to process a data set of about 1G, running very slowly, how to optimize?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support