Use Python to crawl all the very useful data on GitHub! Leave it to yourself.

Source: Internet
Author: User

This is the code I implemented according to this process, URL: liuroy/github_spider

Recursive implementation

Run results

Because of the high latency of each request, the crawler is running slowly, accessing thousands of requests and getting some data, which is a Python project in descending order of views:

This is a list of users in descending order of number of fans

Operational defects

As a pursuit of the programmer, of course, can not be satisfied with a little achievement, summed up the recursive implementation of several defects:

Because it is depth-first, when the entire user graph is large, single-machine recursion can cause memory overflow, which causes the program to crash, and can only be run in a single machine for a short time. A single request is too long and the data download is too slow.

    1. There is no retry mechanism for links that fail to access for a period of time, and there is a possibility of data loss.

Asynchronous optimizations

Queue implementation

Implementation principle

Take the breadth-first traversal method, can put the URL to be visited in the queue, and then apply the pattern of the producer consumers can easily achieve multiple concurrency, so as to solve the above problem 2. If you fail for a certain period of time, you can completely resolve issue 3 by simply keeping the data still in the queue. Not only that, this approach can also support the continuation of the operation after the interruption, the program flowchart is as follows:

Run the program

In order to achieve multi-level deployment (although I have only one machine), Message Queuing uses RABBITMQ, you need to create exchange with the name GitHub, type Direct, and then create four names for user, repo, follower, respectively, Following queues, detailed binding relationships are shown in:

Incoming group: 125240963 to get dozens of sets of PDFs Oh!

Use Python to crawl all the very useful data on GitHub! Leave it to yourself.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.