Python crawler thread pool and process pool

Source: Internet
Author: User

First, the demand

Recently prepared to crawl the data of an e-commerce website, first of all, not consider the agent, distributed, First said efficiency (of course, if you request too fast will be blocked, pro-test, 400 requests in the past, the server directly refused to connect, broken heart), into the topic. Under normal circumstances small white our first thought is for the loop, this is a single-threaded AH. Then we consider the for loop directly open his 5 threads, the problem comes, if there is a URL request has not come back, the back of the wait, so use multithreading equals useless, everywhere posted create stickers.

Second, performance considerations

Determine to use multi-threaded or multi-process, that we are using multi-threaded or multi-process, some people on the multi-process and multi-threaded have a certain prejudice, because of Python's Gil Lock, let us say the difference between the two things.

Three, Multithreading:

In general, we start a. py file, it is equal to start a process, a process inside the default has a thread to work, we use the multi-threading means to enable multiple threads within a process. But the question comes, why use multithreading? I know that when you start a process, you need to create some memory space, which is the equivalent of a house, we have to work in this House, you can think of a person is equal to a thread, you have 10 people in the house room with 20 people space, normally is not the same, Because we know that the threads and threads can communicate by default (the process is not communication by default, but it can be implemented by technology, such as pipelines). Can multithreading in order to ensure the correctness of the calculation data, so there is a Gil lock, to ensure that only one thread at a time in the calculation. Gil Lock you can basically understand that, for example, in this room to calculate an account, at the same time only one person in the account, think of a problem, if this account 5 people can calculate clearly, I need 10 square meters of room on line, then why should please 10 people, spend 20 square meters? So the more threads are not open, the better. But, however, note that you do not have to use the brain (CPU calculation) to calculate the account when you can do other things, for example, after the record in the ledger in order to face the account, this word everyone has their own ledger, so multithreading suitable for IO operation, remember that even if it is suitable for IO operation, it does not mean that the more the better , so this amount still depends on the actual situation.

Iv. Multi-process:

Summarize:

Python crawler thread pool and process pool

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.