Processing a 1 GB Data Set using python is very slow. How can I optimize it?

Source: Internet
Author: User
The research direction is the recommendation system. Recently, python is used to implement a simple label-based recommendation algorithm on the delicious dataset, and then calculate recall and precision. The running time on a few M small datasets is still acceptable (about 10 seconds), but it is very slow on a large (several hundred megabytes, 1 GB) dataset, I have waited for four hours and have not yet figured out the result. What methods can be used to increase the running speed of the program based on the optimization of the algorithm? Experimental environment: Ubuntu13.10, 4 GB, and inte are recommended systems. Recently, python is used to implement a simple label-based recommendation algorithm on the delicious dataset, and then calculate recall and precision. The running time on a few M small datasets is still acceptable (about 10 seconds), but it is very slow on a large (several hundred megabytes, 1 GB) dataset, I have waited for four hours and have not yet figured out the result. What methods can be used to increase the running speed of the program based on the optimization of the algorithm?
Lab environment: Ubuntu 13.10, 4G, intel i3-2310M, python. Reply: There are two reasons:

First, it is an algorithm problem. Algorithms with different complexity increase the running speed when the data size is large. You didn't describe specific algorithms, so we don't know how to improve them. However, based on my experience, the machine learning algorithm is very slow because of the large amount of computing. For many steps, if you refer to some existing methods, it is already known that the algorithm complexity and Code complexity are well balanced and the algorithm complexity is already very good. To improve the performance, you must either invest a lot of time in academic research or write complex code.

The solution is to analyze your program, determine the complexity of each part, identify the algorithm bottleneck, and then focus on optimizing the algorithm on the bottleneck.

The second problem is the well-known slow speed of python itself. As a language fully built on the interpreter, python supports OO and supports FP type dynamic, the optimization of available machine commands is very limited. It is generally considered that it is normal to be 10-times slower than the native program.

Solution: A quick work-around und uses the JIT compiler such as PyPy, and the speed can be increased by about several times to 10 times. In addition, if you use a profile technology to locate the bottleneck of the running time, you can rewrite the bottleneck with C to almost reach the native speed.

Finally, in this multi-core and cloud era, you should consider multi-core or even multi-machine. Python itself is GIL. A process does not support multiple threads in the computing sense. Divide all the parts of your program into multiple processes. Then run with multiple CPUs of one machine at the same time, or still run on multiple machines.Subject: Let me give you some practical suggestions!
  1. Consider rewriting C or C ++.
  2. Considering parallel execution, find a hadoop cluster and write mapreduce programs to run on hadoop. More data is not afraid.
  3. Consider upgrading the machine, more memory, and put things in the memory as much as possible.
  4. Consider program optimization.
  • Take the following steps to see where your program is slow:
    1. First, make sure you have to repeat all the data.Filter out useless dataThis is the best. (For example, some obviously useless things can be filtered out directly by grep. The grep program is generally much faster than the python program you write)
    2. Top, check if the CPU is full?
    3. Single thread and single process implementation? Can you create multiple processes? Then, can I see if each core is full?
    4. If it is not full, you need to make full use of your CPU,Make the CPU Full! Check the program. Is it because I/O? Is IO asynchronous? Or too many IO times? Can I reduce I/O times? I/O is even performed only once. For example, can IAll in memory at a timeAnd then everything is processed in the memory (it seems easier to write it as C)
    5. If each core is full, you can test it by using hotshot or other tools. you can roughly compare the results of hotshot with 1/16 data, 1/8 data, 1/4 data, and 1/2 data to see how your function has increased.Find one or more things (so-called bottlenecks) that take the most time, and perform targeted optimization to get twice the result with half the effort..
    6. Find the problem and find a solution. if the data structure in python is not suitable, can it be solved using numpy or something like that? Can it be solved using some databases? (For example, multiple processes need to be written into a large dictionary together, you can consider writing all data to a redis instance ). is it possible to use cython to package a C implementation.
    7. If the algorithm is not good enough, can you?Optimization Algorithm. (This is a long story)
  • Try something strange, such as PyPy.

  • To sum up a single machine, reduce the input data first, and then do not waste machine resources. Make sure that all CPU cores are fully occupied (multi-process & reduce/do not wait for IO ), the memory can be used as long as it is enough! Then find the slowest place of the program and make various optimizations for it.

    If there are multiple hosts and you get them in hadoop, there will be more data than you are afraid! Using the delicious dataset, even the naive count (u, t) * (t, I) plus inverse frequency is slow... After all, there are too many tags and items... Slow is normal... First, check the complexity of your algorithm. For example, how much does the running time increase after the data doubles? This numfocus/python-benchmarks road GitHub is displayed. Profile + cython is generally the most effort-saving and easy to greatly improve, but the optimization algorithm/use profile Optimization implementation.
    The second is to use pypy/cython.
    Then use numpy.
    Finally, use other languages. Python array traversal is particularly slow, can be combined with cython to accelerate i3-2310M? The lab environment is actually on an entry-level notebook. How difficult is your lab (company? Numpy is relatively slow. You can try Matlab for a large amount of matrix operations. In addition, you can profile your program to see which part of the operation takes a long time.

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.