PHP multi-process programming (3) multi-process web page capture demonstration

Source: Internet
Author: User
PHP multi-process programming (3) demonstration of multi-process crawling web pages to understand this part of code, please read:

PHP multi-process programming (1)

PHP multi-process programming (2) pipeline communication

We know that it is easier to transmit data from the parent process to the child process, but it is more difficult to transmit data from the child process to the parent process.

There are many ways to achieve process interaction. in php, pipeline communication is more convenient. Of course, you can also use socket_pair for communication.

The first is what the server needs to do in response to each request (send a url sequence, and the url sequence is separated by t. The end mark is n)

function clientHandle($msgsock, $obj){    $nbuf = '';    socket_set_block($msgsock);    do {        if (false === ($buf = @socket_read($msgsock, 2048, PHP_NORMAL_READ))) {            $obj->error("socket_read() failed: reason: " . socket_strerror(socket_last_error($msgsock)));            break;        }        $nbuf .= $buf;        if (substr($nbuf, -1) != "\n") {            continue;        }        $nbuf = trim($nbuf);        if ($nbuf == 'quit') {            break;        }        if ($nbuf == 'shutdown') {            break;        }        $url = explode("\t", $nbuf);        $nbuf = '';        $talkback = serialize(read_ntitle($url));        socket_write($msgsock, $talkback, strlen($talkback));        debug("write to the client\n");        break;    } while (true);}

The key part of the above code is read_ntitle. this function implements multi-threaded Title reading.

The code is as follows: (for each url fork, a thread opens the pipeline, and the title read is written into the pipeline. the main thread keeps reading the pipeline data until all the data is read, delete the MPs queue)

Function read_ntitle ($ arr) {$ pipe = new Pipe ("multi-read"); foreach ($ arr as $ k => $ item) {$ pids [$ k] = pcntl_fork (); if (! $ Pids [$ k]) {$ pipe-> open_write (); $ pid = posix_getpid (); $ content = base64_encode (read_title ($ item )); $ pipe-> write ("$ k, $ content \ n"); $ pipe-> close_write (); debug ("$ k: write success! \ N "); exit ;}} debug (" read begin! \ N "); $ data = $ pipe-> read_all (); debug (" read end! \ N "); $ pipe-> rm_pipe (); return parse_data ($ data);} the code for parse_data is as follows, which is very simple and will not be mentioned. The code for parse_data is as follows, which is very simple. Function parse_data ($ data) {$ data = explode ("\ n", $ data); $ new = array (); foreach ($ data as $ value) {$ value = explode (",", $ value); if (count ($ value) = 2) {$ value [1] = base64_decode ($ value [1]); $ new [intval ($ value [0])] = $ value [1];} ksort ($ new, SORT_NUMERIC); return $ new ;}

In the above code, there is also a function read_title that is more skillful. For compatibility, I did not use curl, but directly used socket communication.

After downloading the title tag, stop reading the content to save time. The code is as follows:

Function read_title ($ url) {$ url_info = parse_url ($ url); if (! Isset ($ url_info ['host']) |! Isset ($ url_info ['scheme ']) {return false;} $ host = $ url_info ['host']; $ port = isset ($ url_info ['port'])? $ Url_info ['port']: null; $ path = isset ($ url_info ['path'])? $ Url_info ['path']: "/"; if (isset ($ url_info ['query']) $ path. = "? ". $ Url_info ['query']; if (empty ($ port) {$ port = 80;} if ($ url_info ['scheme '] = 'https ') {$ port = 443;} if ($ url_info ['scheme '] = 'http') {$ port = 80 ;} $ out = "GET $ path HTTP/1.1 \ r \ n"; $ out. = "Host: $ host \ r \ n"; $ out. = "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv: 1.9.1.7) \ r \ n"; $ out. = "Connection: Close \ r \ n"; $ fp = fsockopen ($ host, $ port, $ errno, $ errstr, 5); if ($ fp = = NULL) {error ("get title from $ url, error. $ errno: $ errstr \ n "); return false;} fwrite ($ fp, $ out); $ content =''; while (! Feof ($ fp) {$ content. = fgets ($ fp, 1024); if (preg_match ("/(.*?) <\/Title>/is ", $ content, $ matches) {fclose ($ fp); return encode_to_utf8 ($ matches [1]);} fclose ($ fp); return false;} function encode_to_utf8 ($ string) {return mb_convert_encoding ($ string, "UTF-8", mb_detect_encoding ($ string, "UTF-8, GB2312, ISO-8859-1 ", true) ;}</pre>

Here, I only detected the three most common encodings. Other codes are simple. these codes are used for testing. if you want to use such a server, you must optimize them. In particular, you need to do more to prevent too many processes from being opened at a time.

Many times, we complain that php does not support multiple processes. In fact, php supports multiple processes. Of course, there are not so many options for process communication, and the core of multi-process lies in the communication and synchronization of processes. In web development, such multithreading is basically not used because of serious performance problems. To implement simple multi-process and high load, you must use its extension.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.