PHP Curl realize multi-process concurrent efficient acquisition of reptiles

Source: Internet
Author: User
Tags error code explode file size fpm ftp signal handler sleep sprintf

Demo Code



Operation effect (Figure 1)

Operation effect (Figure 2)

Main Package function

Multi_process ();
According to the parameters, create the number of pointers to the child process.
Bright spot Function 1: The child process various unusual exits, for example segment fault, allowed memory size exhausted, and so on, after breaking a child process, the parent process will fork a new process to the top to keep the number of child processes. If the subprocess completes the task (for example, the judgment Tid reaches 10000), it can exit (9) in the subprocess, and the parent process receives the exit State (9), waits for all child processes to exit, and then exits its own process.
Highlight function 2: Together with the curl package function to achieve a statistical function, after the program is closed will show some of the main statistics (Figure 2 at the bottom).
Mp_counter ();
Communication between the parent process and all child processes is responsible for coordinating the task of assigning each child process and using the lock mechanism. You can set the ' init ' parameter reset count to set the value for each update count.
Curl_get ();
The encapsulation of curl-related functions adds a large number of error mechanisms to support Post,get,cookie,proxy and downloads.
Mp_msg ();
One of the implementation specifications is that each task is finished processing, outputting only one line of information.
Highlights function: This function will determine the height and width of the terminal, the implementation of each screen content will display a statistical information (Figure 1 Purple Line), easy to observe the implementation of the program, control the length of each line output, keep a message will not exceed one line.
Rand_exit ();
As we all know, PHP has an inherently compromised problem, so after performing a certain number of tasks in each subprocess, the multi_process () is responsible for automatically creating new child processes (such as the Green Line in Figure 1).

Program efficiency

This test uses the VULTR minimum configuration machine, 1 CPU (3.6GHz), 768MB RAM, the United States La room (to some extent, the impact of the crawl speed).
After more than 10 minutes of execution, the statistics are as follows:

The Run-time Memory usage statistics (while true; do Psmem | grep php;sleep) are as follows:

Vmstat 1 command results are as follows

Iftop bandwidth monitoring is as follows:

The simple explanation:
50 sub processes, perform 11 minutes 55 seconds, crawl 50,951 times, according to this speed calculation, can crawl 6.15 million times a day.
All processes (1 parent process +50 subprocess) occupy approximately 60MB of memory, occupy about 20% of CPU (1 cores), and bandwidth occupy about 7-8mbps.
According to the above performance parameters, the machine to open 5 times times the number of sub processes can withstand, but the target machine cannot withstand so much pressure.

Crawl speed comparison of different process numbers:
1 processes

10 processes

100 processes

Multi-process encapsulation is almost perfect, but curl because it's too rich and powerful, it's probably never going to be perfect.

The code is as follows

curl.lib.php

The code is as follows Copy Code

<?php

command Line Color output
$colors [' red '] = "\33[31m";
$colors [' green '] = "\33[32m";
$colors [' yellow '] = "\33[33m";
$colors [' end '] = "\33[0m";
$colors [' reverse '] = "\33[7m";
$colors [' purple '] = "\33[35m";

/*
Default parameter settings
*/
$curl _default_config[' ua '] = ' mozilla/5.0 ' (compatible; baiduspider/2.0; +http://www.baidu.com/search/spider.html) ';
$curl _default_config[' referer '] = ';
$curl _default_config[' retry '] = 5;
$curl _default_config[' conntimeout '] = 30;
$curl _default_config[' fetchtimeout '] = 30;
$curl _default_config[' downtimeout '] = 60;

/*
Set referer for the specified domain name (usually for downloading pictures), prior to $curl_default_config
Use empty referer by default, generally not a problem
Eg: $referer _config = Array (
' Img_domain ' => ' Web_domain ',
' e.hiphotos.baidu.com ' => ' http://hi.baidu.com/');
*/
$referer _config = Array (' img1.51cto.com ' => ' blog.51cto.com '),
' 360doc.com ' => ' www.360doc.com ');

/*
Sets the user-agent for the specified domain name precedence over the $curl_default_config
The default use of Baidu Spider UA, refused to Baidu UA site very few
Eg: $useragent _config = Array (
' Web_domain ' => ' User agent ',
' www.xxx.com ' => ' mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; trident/4.0) ');
*/
$useragent _config = Array (' hiphotos.baidu.com ' => ' mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; trident/4.0) ';

/*
* If the machine has more than one IP address, you can change the default export IP, each call will randomly select one in the array. This is not automatically configured for all IPs, considering that there may be an IP that needs to be excluded.
* Eg: $curl _ip_config = Array (' 11.11.11.11 ', ' 22.22.22.22 ');
*/
$local _ip_config = Array ();

Cookies and Temporary Files directory
if (@file_exists ('/dev/shm/') && @is_writable ('/dev/shm/')) {
$cookie _dir = $tmpfile _dir = '/dev/shm/';
}else{
$cookie _dir = $tmpfile _dir = '/tmp/';
}

Clears expired cookie files and downloads temporary files
if (php_sapi_name () = = ' cli ') {
Clear_curl_file ();
}

/**
* Get way Crawl Web page
*
* @param string $url web page URL address
* @param string $encode The page encoding returned, default is GBK, set to null value is not converted
* @return string Web page HTML content
*/
function Curl_get ($url, $encode = ' GBK ') {
Return Curl_func ($url, ' get ', NULL, NULL, NULL, $ENCODE);
}

/**
* Post mode Request Web page
*
* @param string $url The URL address of the request
* Post data sent @param array $data
* @param string $encode The page encoding returned, default is GBK, set to null value is not converted
* @return BOOL
*/
function Curl_post ($url, $data, $encode = ' GBK ') {
Return Curl_func ($url, ' POST ', $data, NULL, NULL, $ENCODE);
}

/**
* Get header information for page
*
* The HTTP status code is not returned in the form of "Name: Value", where Http_code is used as its name, all other values have a fixed name and are converted to lowercase
*
* @param string $url URL address
* @return Array to return header arrays
*/
function Curl_header ($url, $follow =true) {
$header _text = Curl_func ($url, ' header ');

if (! $header _text) {
Failed to get HTTP header
return FALSE;
}
$header _array =explode ("\r\n\r\n", Trim ($header _text));
if ($follow) {
$last _header = Array_pop ($header _array);
}else{
$last _header = Array_shift ($header _array);
}

$lines = explode ("\ n", Trim ($last _header));

   //Processing status Codes
    $status _line = Trim (Array_shift ($lines));
     Preg_match ("/(\d\d\d)/", $status _line, $preg);
    if (!empty ($preg [1])) {
        $header [' http_code '] = $preg [1];
   }else{
        $header [' http_code '] = 0;
    }
    foreach ($lines as $line) {
        list ($ Key, $val) = Explode (': ', $line, 2);
        $key = str_replace ('-', ' _ ', Strtolower (Trim ($key));
        $header [$key] = Trim ($val);
   }
    return $header;
}

/**
* Download files
*
* @param $url file Address
* @param $path saved to the local path
* @return bool Download is successful
*/
function Curl_down ($url, $path, $data =null, $proxy =null) {
if (empty ($data)) {
$method = ' get ';
}else{
$method = ' POST ';
}

Return Curl_func ($url, $method, $data, $path, $proxy);
}

/**
 * uses a proxy to initiate a GET request
 *
 * @param string $url        The URL address of the request
 * @param string $proxy      proxy address
 * @param string $encode     return encoding br>  *
 * @return string           web page content
 */
Function Curl_get_by_proxy ($url, $proxy, $encode = ' GBK ') {
    return Curl_func ($url, ' get ', null , NULL, $proxy, $encode);
}


/**
* Use Agent to initiate POST request
*
* @param string $url The URL address of the request
* @param string $proxy proxy address
* @param string $encode return encoding
*
* @return String Web page content
*/
function Curl_post_by_proxy ($url, $data, $proxy, $encode = ' GBK ') {
Return Curl_func ($url, ' POST ', $data, NULL, $proxy, $encode);
}

/**
* @param string $url The URL address of the request
* @param string $encode return encoding
*
* @return String Web page content
*/

Function Img_down ($url, $path _pre) {
    $img _tmp = '/tmp/curl_imgtmp_pid_ '. Getmypid ();
    $res = Curl_down ($url, $img _tmp);
    if (empty ($res)) {
        return $res;
    }
    $ext = Get_img_ext ($img _tmp);
    if (empty ($ext) {
         return NULL;
   }
    $path = "{$path _pre}. {$ext} ";
    @mkdir (dirname ($path), 0777, TRUE);
   /Transfer Temporary file path
    rename ($img _tmp, $path);
    return $ Path
}

Function Get_img_ext ($path) {
    $types = Array (
         1 => ' gif ',
        2 => ' jpg ',
         3 => ' png ',
        6 => ' bmp '
   );
& nbsp;   $info = @getimagesize ($path);
    if (isset ($types [$info [2]]) {
        $ext = $info [' Type '] = $types [$info [2]];
        $ext = = ' jpeg ' && $ext = ' jpg ';
   } else{
        $ext = FALSE;
   }
    return $ext;
}

/**
* Get file type
*
* @param string $filepath file path
* @return Array to return arrays, formatted as array ($type, $ext)
*/
function Get_file_type ($filepath) {

}

/**
* Returns the size of the file, used to determine whether the file is the same size as the local file after downloading it
* Curl_getinfo () Size_download is not necessarily the true size of the file
*
* @param string $url URL address
* @return the size of a string network file
*/
function Get_file_size ($url) {
$header = Curl_header ($url);
if (!empty ($header [' content_length '])) {
return $header [' content_length '];
}else{
return FALSE;
}
}

/**
 * Get status code
 *
 * @param   string $url URL address
 * @return STRING&NBSP;&NBSP;&NB sp;   Status Code
 */
Function Get_http_code ($url, $follow =true) {
    $header = Curl_ Header ($url, $follow);
    if (!empty ($header [' Http_code ']) {
        return $ header[' Http_code '];
   }else{
        return FALSE;
   }
}

/**
* Get URL file suffix
*
* @param string $url URL address
* @return The suffix of the array file type
*/
function Curl_get_ext ($url) {
$header = Curl_header ($url);
if (!empty ($header [' Content_Type '])) {
@list ($type, $ext) = @explode ('/', $header [' Content_Type ']);
if (!empty ($type) &&!empty ($ext)) {
Return Array ($type, $ext);
}else{
Return Array (', ');
}
}else{
Return Array (', ');
}
}

/**


* Encapsulation Curl operation


*


* @param string $url The URL address of the request


* @param string $method The requested method (POST, GET, HEADER, down)


* @param mix $arg post as post data, down mode for download saved path


* @param string $return The encoding returned by the _encode Web page


* @param string $proxy Agent


* Return content @return mix. 4xx sequence errors and blank pages return false Null,curl crawl errors. Returns the content of the page if the result is normal.


*/


To be improved, download to the temporary file, the download succeeds after the transfer (already has the file overwrite), download failed to delete.


To be improved, the Parameter form is changed to Curl_func ($url, $method, $data =null, Savepath=null, $proxy =null, $return _encode= ' GBK ')


function Curl_func ($url, $method, $data =null, $savepath =null, $proxy =null, $return _encode=null) {


Global $colors, $cookie _dir, $tmpfile _dir, $referer _config, $useragent _config, $local _ip_config, $curl _config;

Console output Color
Extract ($colors);

Remove the/... from the URL. /
$url = Get_absolute_path ($url);

Remove Entity Transfer code
$url = Htmlspecialchars_decode ($url);

Statistical data
if (function_exists (' Mp_counter ')) {
if (!empty ($savepath)) {
Mp_counter (' down_total '); Number of downloads Count
}elseif ($method = = ' HEADER ') {
Mp_counter (' header_total '); Count the number of fetching HTTP headers
}else{
Mp_counter (' fetch_total '); Number of crawl pages count
}
}

for ($i = 0; $i < curl_config_get (' retry '); $i + +) {

       //initialization
        $ch = Curl_init ( );
        curl_setopt ($ch, Curlopt_url, $url);

Set timeout
curl_setopt ($ch, Curlopt_connecttimeout, Curl_config_get (' conntimeout ')); Connection Timeout
if (empty ($savepath)) {
curl_setopt ($ch, Curlopt_timeout, Curl_config_get (' fetchtimeout ')); Crawl page (including header) timeout
}else{
curl_setopt ($ch, Curlopt_timeout, Curl_config_get (' downtimeout ')); Download file timeout
}

Receive page content to variables
curl_setopt ($ch, Curlopt_returntransfer, TRUE);

Ignore SSL authentication
curl_setopt ($ch, curlopt_ssl_verifyhost, 0);
curl_setopt ($ch, Curlopt_ssl_verifypeer, 0);

Set Referer, the highest priority in the file configuration


foreach ($referer _config as $domain =&gt; $ref) {


if (Stripos ($url, $domain)!== FALSE) {


$referer = $ref;


Break


}


}


Check to see if there is a curl_set_referer () setting Referer


if (Empty ($referer) &amp;&amp;!empty ($curl _config[getmypid ()] [' referer ']) {


$referer = $curl _config[getmypid () [' Referer '];


}


if (!empty ($referer)) {


curl_setopt ($ch, Curlopt_referer, $referer);


}


       //Set HTTP request identification, highest priority in file configuration
         foreach ($useragent _config as $domain => $ua) {
             if (Stripos ($url, $domain)!== FALSE) {
                 $useragent = $ua;
                break;
           }
       }
  //checks to see if there is a Curl_set_ua () set useragent
        if (Empty ($ useragent) {
            $useragent = Curl_config_get (' UA ');
       }

curl_setopt ($ch, curlopt_useragent, $useragent);

Export IP
if (!empty ($local _ip_config)) {
curl_setopt ($ch, Curlopt_interface, $local _ip_config[array_rand ($local _ip_config)]);
}

Set up agents
if (!empty ($proxy)) {
curl_setopt ($ch, Curlopt_proxy, $proxy);
curl_setopt ($ch, Curlopt_proxytype, CURLPROXY_SOCKS5);
}

Settings allow to receive gzip compressed data, as well as extract, crawl header when not used (get the correct file size, impact judgment download success)
if ($method!= ' HEADER ') {
curl_setopt ($ch, Curlopt_httpheader, Array (' Accept-encoding:gzip, deflate '));
curl_setopt ($ch, Curlopt_encoding, "");
}

Encountered 301 and 302 turn automatic jump continue crawl, if used for Web program and set Open_basedir, this option is invalid
@curl_setopt ($ch, curlopt_followlocation, TRUE);
Maximum turn times to avoid entering the dead loop
curl_setopt ($ch, Curlopt_maxredirs, 5);

Enable cookies
$cookie _path = $cookie _dir. ' Curl_cookie_pid_ '. Get_ppid ();
curl_setopt ($ch, Curlopt_cookiefile, $cookie _path);
curl_setopt ($ch, Curlopt_cookiejar, $cookie _path);

Set post parameter contents
if ($method = = ' POST ') {
curl_setopt ($ch, Curlopt_header, 0);
curl_setopt ($ch, Curlopt_postfields, $data);
}

       //Set parameters for download
        if (! Empty ($savepath)) {
            $tmpfile = $tmpfile _dir. '/curl_tmpfile_pid_ '. Getmypid ();
            file_exists ($tmpfile) && unlink ($ Tmpfile);
            $fp = fopen ($tmpfile, ' w ');
            curl_setopt ($ch, Curlopt_file, $fp);
       }

Get header only
if ($method = = ' HEADER ') {
curl_setopt ($ch, Curlopt_nobody, TRUE);
curl_setopt ($ch, Curlopt_header, TRUE);
}

Crawl Results
$curl _res = curl_exec ($ch);
Curl Info
$info = Curl_getinfo ($ch);

Debug Curl time, record connection time, wait time, transmission time, total time.
Test method, any output before setting sleep, output middle set sleep
/*
foreach ($info as $key => $val) {
echo "$key: $val \ n";
}
Exit (9);
*/
Error message
$error _msg = Curl_error ($ch);
$error _no = Curl_errno ($ch);

Close Curl Handle
Curl_close ($ch);

       //If Curl has error messages, the decision is to crawl failed, try again
         if (!empty ($error _no) | | |!empty ($error _msg)) {
             $error _msg = "{$error _msg} ($error _no)";
            curl_msg ($error _msg, $method, $url, ' Yellow ');
            continue;
       }

       //Statistics Flow
        if (function_ Exists (' Mp_counter ')) {
            if (!empty $info [' Size_download '] && $info [' size_download '] > 0) {
                 mp_counter (' download_total ', $info [' size_download ']);
           }
       }

To process the results
if ($method = = ' HEADER ') {
Return header information
return $curl _res;
}else{
The final status code
$status _code = $info [' Http_code '];

if (In_array ($status _code, Array_merge (range (in 417), array (500, 444))) {


Non-server fault error, exit directly, return NULL


$error _msg = $status _code;


if (!empty ($savepath)) {


$method = "{$method}| Down ";


}


Curl_msg ($error _msg, $method, $url, ' Red ');


return NULL;


}if ($status _code!= 200) {


Prevent the site 502 and other temporary errors, excluding the above situation, not 200 to try again. This rule needs to be improved by the circumstances.


Curl will automatically jump during execution, where 301 and 302 will not appear, unless the number of jumps exceeds the Curlopt_maxredirs value


$error _msg = $status _code;


Curl_msg ($error _msg, $method, $url, ' yellow ');


Continue


}

if (empty ($savepath)) {


Crawl page


if (Empty ($curl _res)) {


Blank Page


$error _msg = "Blank page";


Returns a null value, where the call is taken to determine


return NULL;


}else{


Default to return pages in GBK encoding

Parse page encoding
Preg_match_all ("/<meta.*?charset=" |) (.*?) (;|\"|'| \s)/is ", $curl _res, $matches);

transcoding condition: 1 matching to the encoding, 2) return encoding is not NULL, 3 matching to the encoding and return encoding is not the same


if (!empty ($matches [2][0]) &amp;&amp;!empty ($return _encode)


&amp;&amp; str_replace ('-', ', ', Strtolower ($matches [2][0])


!= str_replace ('-', ', Strtolower ($return _encode))) {


$curl _res = @iconv ($matches [2][0], "{$return _encode}//ignore", $curl _res);


Replace the code indicated on the Web page


$curl _res = Str_ireplace ($matches [2][0], $return _encode, $curl _res);


}

Iconv returns a blank page if it fails


if (Empty ($curl _res)) {


return NULL;


}else{


Converts a relative path to an absolute path


$curl _res = Relative_to_absolute ($curl _res, $url);


return $curl _res;


}


}


}else{


Download files


if (@filesize ($tmpfile) = = 0) {


$error _msg = ' emtpy Content ';


Continue


}

               //Statistics Download Volume
                if (function_exists (' Mp_counter ')) {
                     mp_counter (' Download_size ', FileSize ($tmpfile));
               }
               //Create directory
                 @mkdir (DirName ($savepath), 0777, TRUE);
               /Transfer temporary file path
                Rename ($tmpfile, $ Savepath);

return TRUE;
}
}
}

If the header is downloaded or crawled, and the error code is 6 (the domain name cannot be resolved), the error is not printed. Invalid picture too many references.
Domain name is not legitimate when the error can not be exported, need to improve, in front of the legality of the decision URL
if (!) ( ($method = = ' HEADER ' | |!empty ($savepath)) &&!empty ($error _no) && $error _no = = 6)) {
if (!empty ($savepath)) {
$method = "{$method}| Down ";
}
Curl_msg ($error _msg, $method, $url, ' Red ');
}

Statistical data
if (function_exists (' Mp_counter ')) {
if (!empty ($savepath)) {
Mp_counter (' down_failed ');
}elseif ($method = = ' HEADER ') {
Mp_counter (' header_failed ');
}else{
Mp_counter (' fetch_failed ');
}
}

return FALSE;
}

/**
* Output error message
*
* @param string $msg error message
* @param string $method request method
* @param string $url URL address
* @param string $color color
*/
function curl_msg ($msg, $method, $url, $color) {
Global $colors;
Extract ($colors);

It is recommended to turn off yellow error output under multiple concurrency
$available _msg[] = ' yellow ';
$available _msg[] = ' red ';

if (Php_sapi_name ()!= ' CLI ') {
Return
}

if (!in_array ($color, $available _msg)) {
Return
}

echo ' {$reverse} '. $colors [$color]. " ({$method}) [CURL ERROR: {$msg}] {$url} {$end}\n ";
}

/**


* Convert URL address to absolute path


* URL address may encounter include '/. /' constitutes a relative path, curl will not automatically convert


* Echo Get_absolute_path ("http://www.a.com/a/../b/../c/../././index.php");


* The result is: http://www.a.com/index.php


*


* @param string $path The URL to be processed


* @return String returns the absolute path of the URL


*/


function Get_absolute_path ($path) {


$parts = Array_filter (Explode ('/', $path), ' strlen ');


$absolutes = Array ();


foreach ($parts as $part) {


if ('. ' = = $part) continue;


if ('.. ' = $part) {


Array_pop ($absolutes);


} else {


$absolutes [] = $part;


}


}


Return Str_replace (':/', '://', implode ('/', $absolutes));


}

/**
 * uses the MD5 value of the picture URL as the path, and the pseudo static rule is rewrite ^/(.) When the hierarchical directory
 * depth is E. (.) (.) (. *) $/$1/$2/$3/$4 break;
 * An average of 1 articles 1 pictures, 30 million articles, 30 million pictures, 3 level table of Contents end 4096 subdirectories, average 7,324 images per catalog
 *
 * @param string $str original picture address
& nbsp;* @param int $deep    directory depth
 * @return string     return rating directory
 */
function Md5_path ($str, $deep = 3) {
    $md 5 = substr (MD5 ($STR), 0,);
    Preg_ma Tch_all ('/./', $MD 5, $preg);
    $res = ';
    for ($i = 0; $i < count ($preg [0]); $i + +) {
        $res. = $preg [0][$i];
        if ($i < $deep) {
             $res. = '/';
       }
   }
    return $res;
}

function Relative_to_absolute ($content, $url) {
$content = Preg_replace ("/src\s*=\s*\" \s*/"," src= ", $content);
$content = Preg_replace ("/href\s*=\s*\" \s*/"," href= ", $content);

Preg_match ("/(HTTP|HTTPS|FTP): \/\/[^\/]*/", $url, $preg _base);
if (!empty ($preg _base[0])) {
$preg _base[0] content such as http://www.yundaiwei.com
This deals with the links that fall/begin with, that is, the path relative to the site's root directory
$content = preg_replace ('/href=\s* ' \//i ', ' href= ', '. $preg _base[0]. ' /', $content);
$content = preg_replace ('/src=\s* ' \//ims ', ' src= ', '. $preg _base[0]. ' /', $content);
}

Preg_match ("/(HTTP|HTTPS|FTP): \/\/.*\//", $url, $preg _full);
if (!empty ($preg _full[0])) {
This handles the path relative to the directory, such as src=. /.. /images/jobs/lippman.gif "
Excludes the local file link at the beginning of the file://, excluding the BASE64 picture of the Data:image way
$content = preg_replace ('/href=\s* ') \s* (?!) Http|file:\/\/|data:image|javascript)/I ', ' href= '. $preg _full[0], $content);
$content = preg_replace ('/src=\s* ') \s* (?!) Http|file:\/\/|data:image|javascript)/I ', ' src= '. $preg _full[0], $content);
}

return $content;
}

/**
* Clears expired cookie files and downloads temporary files
*/
function Clear_curl_file () {
Global $cookie _dir;

$cookie _files = Glob ("{$cookie _dir}curl_*_pid_*");
$tmp _files = Glob ("/tmp/curl_*_pid_*");
$files = Array_merge ($cookie _files, $tmp _files);

foreach ($files as $file) {


Preg_match ("/pid_ (\d*)/", $file, $preg);


$pid = $preg [1];


$exe _path = "/proc/{$pid}/exe";


If the file does not exist, the process does not exist, determine if it is a PHP process, exclude the PHP-FPM process


if (!file_exists ($exe _path)


|| Stripos (Readlink ($exe _path), ' php ') = = FALSE


|| Stripos (Readlink ($exe _path), ' php-fpm ') = = TRUE) {


$sem = @sem_get (@ftok ($file, ' a '));


if ($sem) {


@sem_remove ($sem);


}


Unlink ($file);


}


}


}


/**
* If it is in the subprocess, get the parent process PID, otherwise get the self PID
* @return int
*/
if (!function_exists (' get_ppid ')) {
function Get_ppid () {

if (Php_sapi_name ()!= ' CLI ') {
If it is a Web-mode call, return the PHP execution process PID, such as Apache or PHP-FPM
Getmypid ();
}else{
Command line execution Enter here
Here you need to identify whether to call in a child process or in a parent process, in different forms, where the file location of the saved variable content needs to be kept consistent
$ppid = Posix_getppid ();
In theory, this way of judging can be a hole. However, in practice, in addition to fork out of the child process, it is unlikely that the PHP process of the parent process of the program name appears in PHP.
if (Strpos (Readlink ("/proc/{$ppid}/exe"), ' php ') = = FALSE) {
$pid = Getmypid ();
}else{
$pid = $ppid;
}
return $pid;
}

}
}


UTF-8 Turn GBK
if (!function_exists (' u2g ')) {
function u2g ($string) {
Return @iconv ("UTF-8", "Gbk//ignore", $string);
}
}


GBK Turn UTF-8
if (!function_exists (' g2u ')) {
function g2u ($string)
{
Return @iconv ("GBK", "Utf-8//ignore", $string);
}
}

function curl_rand_ua_pc () {
$ua = ' mozilla/5.0 (compatible; Msie '. Rand (7, 9).
'. 0; Windows NT 6.1; WOW64; trident/'. Rand (4, 5). 0) ';
return $ua;
}

function Curl_rand_ua_mobile () {
$op = ' mozilla/5.0 (Linux; U Android '. Rand (4,5). '. '. Rand (1,5). Rand (1,5). '; ZH-CN; MI '. Rand (3, 5). '); ';
$browser = ' applewebkit/'. Rand (500, 700). Rand (1,100). '. ' Rand (1,100)
.' (khtml, like Gecko) version/'. Rand (5,10)
.'. 0 Mobile safari/537.36 xiaomi/miuibrowser/'. Rand (1,5). Rand (1,5). rand (1,5);
return $op. $browser;
}

function Curl_config_get ($key) {
Global $curl _config, $curl _default_config;

if (!empty ($curl _config[getmypid () [$key])) {
return $curl _config[getmypid ()] [$key];
}elseif (!empty ($curl _default_config[$key])) {
return $curl _default_config[$key];
}else{
Echo ' $curl _default_config '. [$key] Not found!\n ";
Exit (9);
}
}

function Curl_config_set ($key, $val) {
Global $curl _config;
$curl _config[getmypid ()] [$key] = $val;
}

function Curl_set_ua ($ua) {
Curl_config_set (' UA ', $ua);
}

function Curl_set_referer ($referer) {
Curl_config_set (' Referer ', $referer);
}

function Curl_set_retry ($retry) {
Curl_config_set (' Retry ', $retry);
}

function Curl_set_conntimeout ($conntimeout) {
Curl_config_set (' Conntimeout ', $conntimeout);
}

function Curl_set_fetchtimeout ($fetchtimeout) {
Curl_config_set (' Fetchtimeout ', $fetchtimeout);
}

function Curl_set_downtimeout ($downtimeout) {
Curl_config_set (' Downtimeout ', $downtimeout);
}

process.lib.php

The code is as follows Copy Code

<?php
if (Php_sapi_name ()!= ' CLI ') {
Return
}

DECLARE (ticks = 1);

Interrupt Signal
$signals = Array (
SIGINT => "SIGINT",
Sighup => "Sighup",
Sigquit => "Sigquit"
);

command Line Color output
$colors [' red '] = "\33[31m";
$colors [' green '] = "\33[32m";
$colors [' yellow '] = "\33[33m";
$colors [' end '] = "\33[0m";
$colors [' reverse '] = "\33[7m";
$colors [' purple '] = "\33[35m";
$colors [' cyan '] = "\33[36m";

Program Start run time
$start _time = time ();

Parent Process PID
$fpid = Getmypid ();

File save directory,/dev/shm/is memory space mapped to the hard disk, Io speed.
Some environments may not have this directory, such as OpenVZ VPS, this path is actually on the hard drive
if (file_exists ('/dev/shm/') && is_dir ('/dev/shm/')) {
$process _file_dir = '/dev/shm/';
}else{
$process _file_dir = '/tmp/';
}

Clean up Expired resources (file and SEM signal locks), every time the program execution needs to be called to clear off the previous execution of the residue files.
Clear_process_resource ();

To determine whether or not in a child process
function is_subprocess () {
Global $fpid;
if (Getmypid ()!= $fpid) {
return true;
}else{
return false;
}
}

/**
* Multi-process Count
*
* 1, for multitasking when the task assignment and count, such as to collect some DZ forum posts, you can use the counter in/thread-tid-1-1.html
* Tid, to achieve the process of coordination between the work
* 2, because the Shm_* series functions are not flexible operation, so this is mainly used in/proc/and/dev/shm/these two directories to achieve data reading and writing (memory drill
* Do, not affected by hard disk IO performance, use semaphore signal to implement locking and mutual exclusion mechanism
* 3, you need to compile PHP with the parameters--enable-sysvmsg the required modules to install
*
* @param string $countername counter name
* @param the updated value of the Mix $update counter, if it is ' init ', the counter is initialized to 0
* @return Int return count
*/
function Mp_counter ($countername, $update =1) {
Global $process _file_dir;
$time = Date (' y-m-d h:i:s ');

Parent process pid or self-pid
$top _pid = Get_ppid ();

System Boot Time
$sysuptime = Get_sysuptime ();

Process Start time
$ppuptime = Get_ppuptime ($top _pid);

Determine the variable file path prefix by the parent process ID
$path _pre = "{$process _file_dir}mp_counter_{$countername}_pid_{$top _pid}_";

The file used by the count is determined by the system start time and the current parent process start time (jiffies format)
$cur _path = "{$path _pre}btime_{$sysuptime}_ptime_{$ppuptime}";

Update Count, Lock first
$lock = Sem_lock ();

if (!file_exists ($cur _path)) {
Debugging code. The startup time on individual systems changes, causing file paths to follow changes that eventually result in a count of 0.
$log = "[{$time}]-{$countername} ($cur _path)-init\n";
File_put_contents ('/tmp/process.log ', $log, file_append);

$counter = 0;
}else{
Theoretically, there must be a file.
$counter = file_get_contents ($cur _path);
}

Update the count, continue to study the judge init cannot use = =
if ($update = = ' init ') {
If you receive an update value of init, or if the variable file does not exist, the count is initialized to 0.
$new _counter = 0;
}else{
$new _counter = $counter + $update;
}

Write count, unlock
File_put_contents ($cur _path, $new _counter);
Sem_unlock ($lock);

return $new _counter;
}

/**
* Create multiple processes
*
* 1, achieving task coordination between processes through the Mp_counter () function
* 2, because the PHP process may exit because of the exception (mainly segment fault), and because the problem of handling memory leaks requires the child process active exit, this function can be automatically established
* New process that keeps the number of child processes in $num
* 3, you need to compile PHP with the parameters--enable-pcntl the required modules to install
* 4, if exit (9) is invoked in a subprocess, then both the main process and all child processes will exit
*
* @param int $num number of processes
* @param whether the bool $stat output statistics after the end
*/
function multi_process ($num, $stat =false) {
Global $colors, $signals;
Extract ($colors);

if (empty ($num)) {
$num = 1;
}

Record the number of processes and statistics
Mp_counter (' Process_num ', ' init ');
Mp_counter (' Process_num ', $num);

Number of child processes
$child = 0;

Task Completion identification
$task _finish = FALSE;

while (TRUE) {

Empty Child process exit status
Unset ($status);

If the task is not completed and the number of child processes is not highest, create
if ($task _finish = = FALSE && $child < $num) {
$pid = Pcntl_fork ();
if ($pid) {
There are PID, this is the parent process
$child + +;

Registering the signal processing function for the parent process


if ($stat) {


foreach ($signals as $signal =&gt; $name) {


if (!pcntl_signal ($signal, "Signal_handler")) {


Die ("Install signal handler for {$name} failed");


}


}


}

$stat && pcntl_signal (SIGINT, "Signal_handler");

echo {$reverse} {$green}[+]new Process forked: {$pid} {$end}\n ";
Mp_counter (' T_lines ',-1);
} else {
After fork, the subprocess will enter here

1, registers a signal, the processing function direct exit (), the goal is lets the child process not to carry on any processing, only then handles this signal by the main process
2, seemingly does not register the signal for the child process alone, the child process will use the parent process's handler function
$stat && pcntl_signal (SIGINT, "sub_process_exit");

After registering the signal, return directly, and continue to process the subsequent portions of the main program.
Return
}
}

Sub-process Management Section
if ($task _finish) {
If the task is completed
if ($child > 0) {
If there are child processes that are not exiting, wait, or exit
Pcntl_wait ($status);
$child--;
} else {
All child processes exit, parent process exits

Statistical information


$stat &amp;&amp; Final_stat ();





Here, the parent process does not quit, returns instead, continues processing the successor tasks, such as deleting files


Exit ();


Return


}


}else{


If the task does not complete


if ($child &gt;= $num) {


The child process has reached the number and waits for the subprocess to exit


Pcntl_wait ($status);


$child--;


}else{


The child process does not reach the quantity, the next loop continues to create


}


}

When the child process exits the status code of 9 o'clock, it is judged to be complete for all tasks, and then wait for all child processes to exit
if (!empty ($status) && pcntl_wexitstatus ($status) = = 9) {
$task _finish = TRUE;
}
}
}




/**


* Check that the same script is already running to make sure that only one instance is running


* @return BOOL


*/


function single_process () {


if (Get_ppid ()!== getmypid ()) {


echo "Fatal Error:can ' t called Single_process () in the child process!\n";


Exit (9);


}


$self = Get_path ();


$files = Glob ("/proc/*/exe");


foreach ($files as $exe _path) {


if (Stripos (@readlink ($exe _path), ' php ')!== FALSE


&amp;&amp; Stripos (Readlink ($exe _path), ' php-fpm ') = = FALSE) {


If it's a PHP process, get in here.


Preg_match ("/\/proc\/(\d+) \/exe/", $exe _path, $preg);


if (!empty ($preg [1]) &amp;&amp; Get_path ($preg [1]) = = $self &amp;&amp; $preg [1]!= getmypid ()) {


Exit ("Fatal error:this script is already running!\n");


}


}


}


return TRUE;


}


/**
 * Gets the absolute path of the script itself, requiring that it must run in PHP foo.php
 * @param int $pid
 * @return string
  ; */
Function Get_path ($pid =0) {
    if ($pid = = 0) {
         $pid = Get_ppid ();
   }
    $cwd = @readlink ("/proc/{$pid}/cwd");
    $cmdline = @file_get_contents ("/proc/{$pid}/cmdline");
    Preg_match ("/php (. *?\.php)/", $cmdline, $preg);
 if (Empty ($preg [1])) {
  return FALSE;
 }else{
   $script = $preg [1];
&NBSP}
   

if (Strpos ($script, '/') = = False | | Strpos ($script, ' ... ')!== false) {
$path = "{$cwd}/{$script}";
}else{
$path = $script;
}
$path = Realpath (Str_replace ("Strval", "", $path)));
if (!file_exists ($path)) {
Exit ("Fatal error:can ' t located php script path!\n");
}

return $path;
}

function Final_stat () {
Global $colors;
Extract ($colors);

Time statistics
Global $start _time;
$usetime = Time ()-$start _time;
$usetime < 1 && $usetime = 1;
$H = Floor ($usetime/3600);
$i = ($usetime/60)% 60;
$s = $usetime% 60;
$str _usetime = sprintf ("%02d hours,%02d minutes,%02d seconds", $H, $i, $s);
echo "\n{$green}========================================================================\n";
echo "All Task done! Used time: {$str _usetime} ({$usetime}s). \ n ";

Curl Crawl Statistics
$fetch _total = mp_counter (' fetch_total ', 0);
$fetch _success = $fetch _total-mp_counter (' fetch_failed ', 0);

$down _total = mp_counter (' down_total ', 0);
$down _success = $down _total-mp_counter (' down_failed ', 0);

$header _total = mp_counter (' header_total ', 0);
$header _success = $header _total-mp_counter (' header_failed ', 0);

$download _size = HS (Mp_counter (' Download_size ', 0));

echo "Request stat:fetch ({$fetch _success}/{$fetch _total}), Header ({$header _success}/{$header _total}),";
echo "Download ({$down _success}/{$down _total}, {$download _size}). \ n";

Curl Flow Statistics
$BW _in = HS (Mp_counter (' Download_total ', 0));
$rate _down = HBW (Mp_counter (' Download_total ', 0)/$usetime);
echo "Bandwidth Stat (Rough): Total ({$BW _in}), Rate ($rate _down). \ n";

Efficiency statistics
$process _num = mp_counter (' Process_num ', 0);
$fetch _rps = hnum ($fetch _success/$usetime);
$fetch _rph = hnum ($fetch _success * 3600/$usetime);
$fetch _RPD = hnum ($fetch _success * 3600 * 24/$usetime);
echo "Efficiency:process ({$reverse} {$process _num}{$end}), Second ({$fetch _rps}),";
echo "Hour ({$fetch _rph}), day ({$reverse} {$fetch _rpd}{$end {$green}). \ n";

echo "========================================================================{$end}\n";
}

/**
* @param $signal
*/
function Signal_handler ($signal) {
Global $colors, $signals;
Extract ($colors);
if (Array_key_exists ($signal, $signals)) {
Kill_all_child ();
echo "\n{$cyan}ctrl + C caught, quit! {$end}\n ";
Final_stat ();
Exit ();
}
}

function Sub_process_exit () {
Exit (9);
}

Function Hnum ($num) {
    if ($num < ten) {
        $res = Round ($num, 1);
   }elseif ($num < 10000) {
        $res = Floor ($num);
   }elseif ($num < 100000) {
        $res = round ($num/10000, 1). ' W ';
   }else{
        $res = Floor ($num/10000). ' W ';
   }
    return $res;
}

/**
 * Human display bandwidth rate
 *
 * @param $size    byte number
 * @return string
&nbs p;*/
Function HBW ($size) {
    $size *= 8;
    if ($size > 1024 * 1024 * 1024) {
        $rate = round ($size/1073741824 * 100)/100. ' Gbps ';
   } elseif ($size > 1024 * 1024) {
        $rate = round ($s ize/1048576 * 100)/100. ' Mbps ';
   } elseif ($size > 1024) {
        $rate = round ($size/1 024 * 100)/100. ' Kbps ';
   } else {
        $rate = Round ($size). ' bbps ';
   }
    return $rate;
}


/**
 * Display data for human nature
 *
 * @param $size
 * @return string
 */
function HS ($size) {
    if ($size > 1024 * 1024 * 1024) {
        $si Ze = round ($size/1073741824 * 100)/100. ' GB ';
   } elseif ($size > 1024 * 1024) {
        $size = round ($s ize/1048576 * 100)/100. ' MB ';
   } elseif ($size > 1024) {
        $size = round ($size/1 024 * 100)/100. ' KB ';
   } else {
        $size = Round ($size). ' Bytes ';
   }
    return $size;
}

/**
 * kills all child processes
 */
Function Kill_all_child () {
    $ppid = Getmypid ();
& nbsp;   $files = Glob ("/proc/*/stat");
    foreach ($files as $file) {
        if (Is_file ($file)) {
            $sections = Explode (', file_get_contents ($ file));
            if ($sections [3] = = $ppid) {
                 Posix_kill ($sections [0], Sigterm) ;
           }
       }
   }
}

If (!function_exists (' get_ppid ')) {
    function get_ppid () {
        //Here you need to identify whether to call in the subprocess or in the parent process, in different forms, the file location of the saved variable content needs to be consistent
         $ppid = Posix_getppid ();
       //Theoretically, this way of judging can be a hole. However, in practice, in addition to fork out of the child process, it is unlikely that the PHP process of the parent process of the program name appears in PHP.
        if (Strpos readlink ("/proc/{$ppid}/exe"), ' php ') = = FALSE) {
            $pid = Getmypid ();
       }else{
             $pid = $ppid;
       }
        return $pid;
   }
}

//with processes (multiple processes running, using parent processes), each process uses a lock.
Function Sem_lock ($lock _name=null) {
    global $process _file_dir;
    $pid = Get_ppid ();
    if (empty ($lock _name)) {
        $lockfile = ' {$process _file_dir}sem_keyfile_main_pid_{$pid} ";
   }else{
        $lockfile = "{$process _file_dir}sem_ keyfile_{$lock _name}_pid_{$pid} ";
   }
    if (!file_exists ($lockfile)) {
        Touch ($ Lockfile);
   }
    $shm _id = Sem_get (Ftok ($lockfile, ' a '), 1, 0600, true);
    if (Sem_acquire ($shm _id)) {
        return $shm _id;
   }else{
        return FALSE;
   }
}

Unlock
function Sem_unlock ($shm _id) {
Sem_release ($shm _id);
}

Cleanup resources (file and SEM signal locks)
function Clear_process_resource () {
Global $process _file_dir;

Remove the SEM files and semaphores


$files = Glob ("{$process _file_dir}sem_keyfile*pid_*");


foreach ($files as $file) {


Preg_match ("/pid_ (\d*)/", $file, $preg);


$pid = $preg [1];


$exe _path = "/proc/{$pid}/exe";


If the file does not exist, the process does not exist, determine if it is a PHP process, exclude the PHP-FPM process


if (!file_exists ($exe _path)


|| Stripos (Readlink ($exe _path), ' php ') = = FALSE


|| Stripos (Readlink ($exe _path), ' php-fpm ') = = TRUE) {


$sem = @sem_get (@ftok ($file, ' a '));


if ($sem) {


@sem_remove ($sem);


}


@unlink ($file);


}


}

   //clear Mp_counter files (only this type of file is not reusable, so strictly handled, match system start time and process start time)
    $files = Glob ("{$ Process_file_dir}mp_counter* ");
    foreach ($files as $file) {
        preg_match ("/pid_" (\ d*) _btime_ (\d*) _ptime_ (\d*)/", $file, $preg);
        $pid = $preg [1];
        $btime = $preg [2];
        $ptime = $preg [3];
        $exe _path = "/proc/{$pid}/exe";

Purge files
if (!file_exists ($exe _path)
|| Stripos (Readlink ($exe _path), ' php ') = = FALSE
|| Stripos (Readlink ($exe _path), ' php-fpm ') = = TRUE
|| $btime!= Get_sysuptime ()
|| $ptime!= Get_ppuptime ($pid)) {
@unlink ($file);
}
}
}

System Boot Time
function Get_sysuptime () {
Preg_match ("/btime (\d+)/", file_get_contents ("/proc/stat"), $preg);
return $preg [1];
}

When invoked in a child process, the start time of the parent process is taken. If it is not invoked in a child process, then its start time is taken. Time is jiffies format.
function Get_ppuptime ($pid) {
$stat _sections = Explode ("", File_get_contents ("/proc/{$pid}/stat"));
return $stat _sections[21];
}

Prevents the PHP process memory from leaking, each child process performs a certain number of tasks to exit.
function Rand_exit ($num =100) {
if (Floor ($num *0.5), Floor ($num *1.5)) = = = $num) {
Exit ();
}
}

One-time task result output function
function Mp_msg () {
Global $start _time, $colors;
Extract ($colors);

   //Collation statistics
    $msg = Date (' [h:i:s] ');
    $max = 0;
  ;   $msg _array = Func_get_args ();
    foreach ($msg _array as $key => $val) {
        $val = Preg_replace ("/\s{2,}/", "", $val);
        $msg _array[$key] = $val;
        if (Is_int ($key)) {
             $msg. = "$val";
       }else{
             $msg. = "{$key}: $val";
       }
        if (strlen ($val) > strlen ($msg _array[$max)) {
             $max = $key;
       }
   }

Cron Mode Run
if (Empty ($_server[' Ssh_tty ')) {
$msg = Preg_replace ("/\\\33\[\d\dm/", "", $msg);
echo "{$msg}\n";
Return
}

$lock = Sem_lock (' mp_msg ');
$t _lines = mp_counter (' T_lines ',-1);
if ($t _lines <= 1) {
Mp_counter (' t_lines ', ' init ');
Mp_counter (' T_lines ', shell_exec (' tput lines '));
Mp_counter (' T_cols ', ' init ');
Mp_counter (' T_cols ', shell_exec (' tput cols '));
}
Sem_unlock ($lock);

    $t _cols = mp_counter (' T_cols ', 0);
    $msg _len = strlen ($msg);
    if ($msg _len > $t _cols) {
        $cut _len = strlen ($ msg_array[$max])-($msg _len-$t _cols);
        $msg = str_replace ($msg _array[$max], substr ($msg _array[$max], 0, $cut _len), $msg);
   }
    echo "{$msg}\n";

if ($t _lines <= 1) {
$usetime = Time ()-$start _time;
$usetime < 1 && $usetime = 1;
$H = Floor ($usetime/3600);
$i = ($usetime/60)% 60;
$s = $usetime% 60;
$str _usetime = sprintf ("%02d:%02d:%02d", $H, $i, $s);

$process _num = mp_counter (' Process_num ', 0);

$fetch _total = mp_counter (' fetch_total ', 0);
$fetch _success = $fetch _total-mp_counter (' fetch_failed ', 0);
$fetch = Hnum ($fetch _success);
$fetch _all = hnum ($fetch _total);
$fetch _RPD = hnum ($fetch _success * 3600 * 24/$usetime);

echo "{$reverse} {$purple}";
Echo ' Stat:time ({$str _usetime}) Process ({$process _num}) Fetch ({$fetch}/{$fetch _all}) Day ({$fetch _RPD}) ";
echo "{$end}\n";
Flush ();
}

}

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.