Introduction to multi-threaded crawler

Source: Internet
Author: User

There is only one thread in a process, which we call a single-threaded crawler. The single-threaded crawler accesses only one page at a time and does not take full advantage of the computer's network bandwidth. A page is at most hundreds of KB, so the crawler crawls a page, the extra speed is wasted.

And if we can get the crawler to access 10 pages at the same time, it is equivalent to a 10 times-fold increase in our crawl speed. This is the time to use multithreaded technology.

One thing to emphasize here is that the Python language is designed with a Gil lock on it. This thing makes Python multi-threaded is pseudo-multithreading. Essentially there is only one thread, but this thread each thing only a few milliseconds, after a few seconds to save the thread, to do other things for a few milliseconds, in exchange for a round to continue to return to the first thing, the recovery thread to do a few seconds, continue to change ...

In the X-Men-Apocalypse, Magneto's son rescued a lot of people from the professor's College. Because he is very fast, so although it is a person to save, but for others, it feels like all at the same time moved to the outside of the school.

This microscopic single thread, on the macro, looks like it is doing a few things at the same time. This mechanism has little impact on IO-intensive operations, but on CPU-intensive operations, the CPU is very busy, which can have a significant impact on performance. So when it comes to computationally intensive programs, multiple processes are required, and Python's multi-process is not affected by the Gil Lock.

Crawler is an IO-intensive program, so using multithreading does not have a significant impact on performance.

Reprinted: The words of a teacher at the Geek College (a little bit changed).

Introduction to multi-threaded crawler

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.