How to optimize the speed of a Python crawler?

Source: Internet
Author: User
Currently writing a Python crawler, single-threaded urllib feel too slow to reach the data volume requirements (level 100,000 page). Q. What are some ways to improve crawl efficiency?

Reply content:

Consider multi-process + distributed clusters in different computer rooms.

The reasons are as follows:
If a single process, the bottleneck is more on the CPU.

Multi-process can efficiently utilize the CPU. But in fact most of the situation is in the network, so that the better solution is to use multiple machine rooms of multiple machines at the same time running a multi-process crawler, so as to reduce network congestion.

Implement, use Scrapy+rq-queue and then use Redis to make the queue.

Crawl through Douban's tens of millions of pages in this way

Please refer to my answer in another question:
How do Python crawlers get started? 1. Turn on gzip
2. Multithreading
3. For directional capture you can replace XPath with regular
4. Replace Urlib with Pycurl
5. Change to a high-bandwidth environment thanks for inviting me.
Crawler download Slow The main reason is blocking waiting to send to the site request and site return
The workaround is to use a non-blocking epoll model.
Registers the socket connection handle and callback function created with the operating system so that a large number of requests to the page can be made concurrently in single-process and single-threaded situations.
If I think I'm having trouble writing, I've used a ready-made class library: Tornado's asynchronous client
http://www.tornadoweb.org/documentation/httpclient.html
If you can't open it, increase the host or turn over the wall
Host Address:
74.125.129.121http://www.tornadoweb.orgFor Python, it is best to split the task + multi-process you can try to directly use the open source crawler scrapy, native support multi-threading, also can set the crawl rate, the number of concurrent threads and other parameters, in addition, scrapy to the crawler extract HTML content also has good support.
The Chinese introductory tutorial has also been published, can Google a bit. Gevent,eventlet,pycurl

From Multiprocessing.dummy import Pool

OpenShift above run Gevent climb 1024 also minutes of things ...
So why did I just drive 20? "Serious face"
Oh, yes. 1024 will be a short time IP, crawling with the same cookie is fine 1.dns cache
2. Multithreading
3. Asynchronous IO with handwriting such as Asynccore. See if Twisted has a non-blocking, asynchronous HTTP client framework.
Using the multiprocessing package + Utllib to do HTTP client speed is rather unsatisfactory, the thread should be good but my intuition is that Ascension is limited.
----
Recommended Gevent + grequests

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.