Consider using PHP to implement the following scenario: There is a list of URLs that are captured in the queue, the daemon reads the queue, and then forwards it to the child process to crawl the HTML into the file. In order to improve efficiency, multi-tasking is allowed, but in order to avoid excessive machine load, which limits the maximum number of parallel tasks (we set this number to 3 for testing convenience), the program ends when the end tag is taken in the queue.
This scene is implemented with QPM's Supervisor::taskfactorymode (), which is very simple.
QPM full name is Quick Process Management Module for PHP. PHP is a powerful web development language, so many people often forget that PHP can be used to develop robust command-line (CLI) programs to daemon programs. and the preparation of daemon program is unavoidable to deal with various process management. QPM is a class library that is formally developed for simplified process management. QPM's project address is: HTTPS://GITHUB.COM/COMOS/QPM
To simplify the test environment, we can use a text file to simulate the queue's data. The complete example file looks here: Spider_task_factory_data.txt
http: //news.sina .com .cn /http: //news.ifeng .com /http: //news.163 .com /http: //news.sohu .com /http: //ent.sina .com .cn /http: //ent .com /... END
Before using QPM's taskfactorymode, we need to prepare a taskfactory class. We named it the Spidertaskfactory,spdiertaskfactory factory method Fetchtask An instance of the subclass that normally returns runnable. When the end is encountered or the file ends, throw stopsignal, and the program terminates.
Here is the code snippet that assembles the Supervisor and executes it. For a complete example, see: spider_task_factory.php
//If no input is specified from the parameter, Spider_task_factory_data.txt is used as the data source$input=isset($argv[1]) ?$argv[1] :__dir__.'/spider_task_factory_data.txt ';$spiderTaskFactory=NewSpidertaskfactory ($input);$config= [//Specify TaskFactory objects and factory methods ' FactoryMethod '=>[$spiderTaskFactory,' Fetchtask '],//Specifies a maximum concurrent quantity of 3 ' Quantity '=3,];//Start SupervisorQpm\supervisor\supervisor::taskfactorymode ($config)->start ();
The implementation of Spidertaskfactory is as follows:
/** * Mission Factory, the Fetchtask method must be implemented. * This method returns normally * */ class spidertaskfactory {Private $_FH; Public function __construct($input) { $this->_input =$input;$this->_FH = fopen ($input,' R ');if($this->_FH = = =false) {Throw New Exception(' fopen failed: '.$input); }} Public function fetchtask() { while(true) {if(Feof ($this->_FH)) {Throw NewQpm\supervisor\stopsignal (); }$line= Trim (Fgets ($this->_FH));if($line==' END ') {Throw NewQpm\supervisor\stopsignal (); }if(Empty($line)) {Continue; } Break; }return NewSpidertask ($line);}}
The implementation of Spidertask is as follows:
/** * Classes that perform tasks in a child process * must implement the Qpm\process\runnable interface */ class spidertask implements QPM\process\Runnable { Private $_target; Public function __construct($target) { $this->_target =$target;}//Parts that are executed in a child process Public function run() { $r= @file_get_contents ($this->_target);if($r===false) {Throw New Exception(' fail to crawl URL: '.$this->_target); } file_put_contents ($this->getlocalfilename (),$r); }Private function getlocalfilename() { $filename= Str_replace ('/',' ~ ',$this->_target);$filename= Str_replace (': ',' _ ',$filename);$filename=$filename.'-'. Date (' Ymdhis ');return __dir__.'/_spider/'.$filename.'. html ';}}
Real production environment, the process of running a durable producer/consumer model can be achieved by replacing the file input with a queue.
PHP uses QPM to implement multi-process parallel task handlers