Use Python to crawl all the very useful data on GitHub! Leave it to yourself.

Last Update:2018-06-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This is the code I implemented according to this process, URL: liuroy/github_spider

Recursive implementation

Run results

Because of the high latency of each request, the crawler is running slowly, accessing thousands of requests and getting some data, which is a Python project in descending order of views:

This is a list of users in descending order of number of fans

Operational defects

As a pursuit of the programmer, of course, can not be satisfied with a little achievement, summed up the recursive implementation of several defects:

Because it is depth-first, when the entire user graph is large, single-machine recursion can cause memory overflow, which causes the program to crash, and can only be run in a single machine for a short time. A single request is too long and the data download is too slow.

There is no retry mechanism for links that fail to access for a period of time, and there is a possibility of data loss.

Asynchronous optimizations

Queue implementation

Implementation principle

Take the breadth-first traversal method, can put the URL to be visited in the queue, and then apply the pattern of the producer consumers can easily achieve multiple concurrency, so as to solve the above problem 2. If you fail for a certain period of time, you can completely resolve issue 3 by simply keeping the data still in the queue. Not only that, this approach can also support the continuation of the operation after the interruption, the program flowchart is as follows:

Run the program

In order to achieve multi-level deployment (although I have only one machine), Message Queuing uses RABBITMQ, you need to create exchange with the name GitHub, type Direct, and then create four names for user, repo, follower, respectively, Following queues, detailed binding relationships are shown in:

Incoming group: 125240963 to get dozens of sets of PDFs Oh!

Use Python to crawl all the very useful data on GitHub! Leave it to yourself.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Use Python to crawl all the very useful data on GitHub! Leave it to yourself.

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support