Short video business in the last two years, each video site has its own characteristics of short video content. If there is a program can be the major video site popular users of the latest release of the video download, not only convenient to watch, but also can not copyright video posted on the personal social networking site, to increase their popularity, how good AH
Short video business in the last two years, each video site has its own characteristics of short video content. If there is such a program, the major video sites can be the most popular users of the latest release of the video downloaded, not only for the convenience of their own viewing, but also can not copyright video posted on personal social networking sites, increase their popularity, not beautiful?
Parker is such a project (Project address: Https://github.com/LiuRoy/parker), which uses the celery framework to periodically crawl the user video list, the latest released video through you-get asynchronous download, Distributed deployment is easy to implement. Because the page layout and interface update of each website are more frequent, in order to ensure the high availability of the program, deliberately increased STATSD monitoring, easy to find errors in time.
Code schema
Currently Parker only implements the B-station and second-shot download, from the frame graph can be seen, for each type of website, need to implement two asynchronous interface: from the user Video home page to resolve the release of video playback address, according to the playback address download video. Therefore, to increase the site type, do not need to modify the original code, only need to add new parsing and download interface. After the completion of the video download after the follow-up, I have not achieved, we can according to their own needs free to achieve.
In the run time, celery will be configured to send a good quality user list timed to the corresponding site of the resolution interface asynchronous execution, filter out the latest broadcast video playback address, to the corresponding download interface asynchronous download, the download is completed and then asynchronously invoke subsequent operations. Therefore, it is necessary to start a celery beat process to send timed tasks, as well as several celery asynchronous tasks to perform parsing and downloading operations, for larger videos, the download will be time consuming, and it is recommended to allocate the number of asynchronous tasks according to how much of the task list is reasonable.
Program run
This program is verified to work properly under Ubuntu and Mac and has not been verified in the Windows environment due to the inability of celery to start properly under local windows.
Dependent Library Installation
Python version 3.5, after entering the project directory, executes:
Pip Install-r requirements.txt
Create a database table
Build two tables in the database in advance (SQL:HTTPS://GITHUB.COM/LIUROY/PARKER/BLOB/MASTER/SPIDER/MODELS/TABLES.SQL)
Parameter configuration
Config path Logging.yaml, Params.yaml, sites.yaml respectively corresponding log configuration, run parameter configuration, popular user configuration.
Log configuration
In debug mode, the log is output directly to the standard output stream, and the log content is output to the file in release mode, so the output log file needs to be configured.
Run Configuration
The mode debug debug mode, in which the log points to the standard output, and no monitoring data, release mode, the log output to the development of files, and have monitoring data.
The broker_url corresponds to the broker_url of celery and can be configured as Redis or RABBITMQ
Mysql_url database address, you need to build two tables in advance.
Download_path Video Download path
Statsd_address Monitoring Address
Video_number_per_page the number of video playback addresses from the user's video home page each time, since most users publish fewer videos at a time, just set them to a small value. In the first run, you won't be downloading a lot of old videos.
Download_timeout Video Download time-out
Top User Profiles
Parker will generate a list of celery beat scheduler based on this configuration.
The name rule is < site Type >-< task Id>,parker will be based on this as scheduler task names
URL user's release video home page
celery resolves asynchronous tasks corresponding to a task
Minute how many minutes to check the user video list
Start a task
Enter the project directory and execute the following command to start the celery worker
Celery-a Spider Worker
Execute the following command to start the Celery beat timer task
Celery-a Spider Beat
Monitoring
Strong Amway A Docker image, one minute with good monitoring environment there are wood. Then just add the execution success and execution of the exception of the RBI data, you can easily monitor whether the program is working properly.