Crawler Flow
After finishing last week with Scrapy crawl to know the user information Crawler, GitHub on the number of star on the company's team within the ranking of the row, I also vowed to talk to the superior boss said if you write another, are embarrassed and you mention star, afraid you sad. The superior disdain said, that is to write a crawler climbed a crawl github, find a Python Daniel, the company is also just looking for someone. Stepped, particularly excited, the same day to study the GitHub site, how to parse the page and the crawler's operational strategy. The unexpected discovery GitHub offers very nice APIs as well as documentation so that my love for GitHub has gone deep into the marrow.
Say so much nonsense, talk about the real problem. I need to download GitHub users and their reposities data, expand the way is also very simple, according to a user's following and follower relationship, traverse the entire user network can download all the data, I heard that GitHub registered users only millions of, It's a little excited to get all the data down, and here's the flowchart:
This is the code I implemented according to this process, URL: https://github.com/LiuRoy/github_spider
Recursive implementation Run command
To see such a simple process, the heart of the first idea is to write a simple recursive implementation of Bai, if the performance is poor and then slowly optimized, so the first version of the code is quickly completed (under the directory recursion). Data storage using MONGO, repeated requests to determine the use of Redis, write MONGO data using celery asynchronous calls, RABBITMQ service needs to start properly, after settings.py properly configured, use the following steps to start:
- Enter the Github_spider directory
- Execute a command to
celery -A github_spider.worker worker loglevel=info
start an asynchronous task
- Execute a command to
python github_spider/recursion/main.py
start the crawler
Run results
Because of the high latency of each request, the crawler is running slowly, accessing thousands of requests and getting some data, which is a Python project in descending order of views:
This is a list of users in descending order of number of fans
Operational defects
As a pursuit of the programmer, of course, can not be satisfied with a little achievement, summed up the recursive implementation of several defects:
- Because it is depth-first, when the entire user graph is large, single-machine recursion can cause memory overflow, which causes the program to crash, and can only be run in a single machine for a short time.
- A single request is too long and the data download is too slow.
- There is no retry mechanism for links that fail to access for a period of time, and there is a possibility of data loss.
Asynchronous optimizations
For this I/O time-consuming problem, there are several workarounds, either multiple concurrency or asynchronous access or two-pronged approach. For question 2 above, my first solution was to request the API asynchronously. Because this is the first time that the code is written, the code has been optimized for the calling method and is quickly changed, using the grequests in the implementation mode. This library and requests are the same author, the code is very simple, that is, request requests with gevent made a simple package, can be non-blocking request data.
But when I run, found that the program quickly run the end, a check found that the public IP is GitHub sealed off, then the heart thousands grass mud horse pentium, no way can only sacrifice the ultimate crawler-agent. and specifically wrote a secondary script to crawl from the Internet free HTTPS agent stored in Redis, path proxy/extract.py, each request with the agent, run the error retry automatic replacement agent and the error agent clear. Originally online free HTTPS agent is very few, and many still can not use, due to a large number of error retry, access speed not only not the original fast, and slower than the original, blocked can only go more concurrent implementation.
Implementation principle of queue implementation
Take the breadth-first traversal method, can put the URL to be visited in the queue, and then apply the pattern of the producer consumers can easily achieve multiple concurrency, so as to solve the above problem 2. If you fail for a certain period of time, you can completely resolve issue 3 by simply keeping the data still in the queue. Not only that, this approach can also support the continuation of the operation after the interruption, the program flowchart is as follows:
Run the program
In order to achieve multi-level deployment (although I have only one machine), Message Queuing uses RABBITMQ, you need to create exchange with the name GitHub, type Direct, and then create four names for user, repo, follower, respectively, Following queues, detailed binding relationships are shown in:
The detailed startup steps are as follows:
- Enter the Github_spider directory
- Execute a command to
celery -A github_spider.worker worker loglevel=info
start an asynchronous task
- To perform a command
python github_spider/proxy/extract.py
update Agent
- Execute Command
python github_spider/queue/main.py
startup script
Queue Status Graph:
Python crawls GitHub data