Data Mining _ Multi-process crawl

Source: Internet
Author: User

It has been said that the multithreading of Python can only run on a single core, that is, each thread executes asynchronously in a concurrent manner.

In this article, let's talk about the Python multi-process approach.

Multi-process depends on the number of processors on the machine, multi-core machine on multi-process programming, the core running between the processes are executed in parallel, you can take advantage of the process pool, is running a process on each core, when the number of processes in the fin is greater than the total number of cores, the process to run will wait, Until the other process has finished running out of the kernel

Multi-process is the equivalent of the following ticket sales behavior

Note here that when there is only one single-core CPU in the system, a multi-process does not occur, at which point each process takes up CPU to complete

We can use the Python statement to get the CPU available for the number of cores, such as

In order to form a comparison, we still use the previous example, when the book, search keyword product information crawl

First write the multi-process master method

# coding=utf-8__author__ = "Susmote" from multi_threading import mining_funcimport multiprocessingimport timedef Multiple_process_test ():    start_time = Time.time ()    page_range_list = [        (1, ten),        (one,        21, 32) ,    ]    pool = multiprocessing. Pool (processes=3)    for Page_range in page_range_list:        Pool.apply_async (mining_func.get_urls_in_pages, (page _range[0], page_range[1])    pool.close ()    pool.join ()    end_time = Time.time ()    print ("Crawl time:", End_ Time-start_time)    return end_time-start_time

In this case, let me briefly explain the operation of the multi-process

Pool is defined as a process pool that can concurrently parallel 3 processes, and then through loops, using the Apply_async method to run the process that enters the process pool in parallel in an asynchronous manner

Here is the main function

# coding=utf-8__author__ = "Susmote" from process_func import multiple_process_testif __name__ = "__main__":    pt = mu Ltiple_process_test ()    Print ("PT:", PT)

  

Run the code to get the following results.

5.908

Run one More time

3.954

Last time

4.163

Take average Time

4.341 seconds

In this case, we will review the multi-threading scenario (same network conditions) for the previous article:

Multithreading

Single Thread

It can be seen that the gap is very clear, multi-process is the biggest advantage

Multi-process is these, you can also find a larger data pool, to test these methods

Data Mining _ Multi-process crawl

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.