PHP crawler million-level knowledge of user data crawling and analysis, PHP crawler
This time grabbed 1.1 million of the user data, the data analysis results are as follows:
Pre-development preparation
- Install a Linux system (Ubuntu14.04) and install an Ubuntu under VMware virtual machines;
- Install the PHP5.6 or above version;
- Install the MySQL5.5 or above version;
- Install curl, Pcntl extension.
Crawl page data using PHP's Curl extension
PHP's Curl extension is a library supported by PHP that allows you to connect and communicate with various servers using various types of protocols.
This procedure is to crawl the user data, to be able to access the user's personal page, users need to log in to access. When we click on a user's avatar link in the browser page to enter the user's Personal center page, we can see the user's information, because when the link is clicked, the browser helps you to bring the local cookie to the new page, so you can go to the user's Personal center page. Therefore, you need to obtain the user's cookie information before accessing the personal page, and then bring the cookie information on each curl request. In obtaining cookie information, I use my own cookie to see my cookie information on the page:
Copy each of them to "__utma=?; __utmb=?; " This form consists of a cookie string. You can then use the cookie string to send the request.
The initial example:
$url = ' http://www.zhihu.com/people/mora-hu/about '; Here Mora-hu represents user Id$ch = Curl_init ($url); Initialize session curl_setopt ($ch, Curlopt_header, 0); curl_setopt ($ch, Curlopt_cookie, $this->config_arr[' User_cookie ']); Set Request Cookiecurl_setopt ($ch, curlopt_useragent, $_server[' http_user_agent '); curl_setopt ($ch, Curlopt_ Returntransfer, 1); The information obtained by CURL_EXEC () is returned as a file stream, rather than as a direct output. curl_setopt ($ch, curlopt_followlocation, 1); $result = curl_exec ($ch); return $result; The result of the crawl
Run the above code to get the Mora-hu user's Personal center page. Using the result, the regular expression is used to process the page, and the information that needs to be crawled is obtained, such as name, gender, etc.
1, picture anti-theft chain
When outputting personal information after a regular processing of the returned results, it is found that the user's avatar cannot be opened on the page. After access to the information, it is because the picture did the anti-theft chain processing. The solution is to request a picture and forge a referer in the request.
After using the regular expression to get the link to the picture, send a request again, this time with the source of the chip request, indicating that the request from the Web site forwarding. Specific examples are as follows:
function getimg ($url, $u _id) { if (file_exists ('./images/') $u _id. ". jpg") { return "images/$u _id". '. jpg '; } if (empty ($url)) { return '; } $context _options = Array ( ' http ' = = Array ( ' header ' = ' referer:http://www.zhihu.com '// With the Referer parameter )); $context = Stream_context_create ($context _options); $img = file_get_contents (' http: '. $url, FALSE, $context); File_put_contents ('./images/'. $u _id. ". jpg", $img); Return "images/$u _id". '. jpg ';}
2. Crawl more users
Once you have crawled your personal information, you need to access the user's followers and the list of users you are interested in to get more user information. And then a layer by layer of access. As you can see, in the Personal center page, there are two links as follows:
Here are two links, one for attention, the other for followers, and a "follow-up" link for example. Use regular match to match to the corresponding link, get the URL and then send a cookie with curl to request again. After grabbing the user's attention for the list page, you can get the following page:
Analyze the HTML structure of the page, because as long as you get the user's information, you only need to frame this piece of div content, the user name is here. As you can see, the URL of the page that the user is following is:
The URL for different users is almost the same, and the difference is in the user name. Use regular match to get the user name list, one by one to spell the URL, and then send a request (of course, one is relatively slow, there are solutions, this will be said later). After you go to the new user's page, repeat the steps above and loop until you reach the amount of data you want.
3. Number of Linux statistics files
After running for a period of time, the script needs to see how many images have been acquired, and when the amount of data is larger, open the folder to see the number of pictures is a bit slow. Scripts are run in a Linux environment, so you can use Linux commands to count the number of files:
Ls-l | grep "^-" | Wc-l
Where Ls-l is a long list of file information in this directory (the file can be a directory, links, device files, etc.), grep "^-" Filter the long list output information, "^-" only keep the general file, if only the directory is "^d"; Wc-l is the number of rows that statistic output information. Here is a sample run:
4. Duplicate data processing when inserting MySQL
After the program has been running for some time, it is found that the data of many users is duplicated, so it needs to be processed when inserting duplicate user data. The processing scenarios are as follows:
1) Check whether the data already exists before inserting the database;
2) Add a unique index, insert with inserts into ... On DUPLICATE KEY UPDATE ...
3) Add a unique index, insert with Ingnore into ...
4) Add a unique index, insert with REPLACE into ...
The first scenario is the simplest, but also the least efficient, scenario and is therefore not taken. The results of the implementation of the two and four scenarios are the same, and the difference is that when the same data is encountered, INSERT into ... On DUPLICATE KEY Update is updated directly, and REPLACE into is the first to delete the old data and then insert the new, in this process, also need to re-maintain the index, so slow. So the second option was chosen between the two and 42. In the third scenario, insert Ingnore ignores errors that occur when the INSERT statement is executed, does not ignore the syntax issues, but ignores the presence of the primary key. This makes it even better to use the INSERT ingnore. Finally, the second scenario is adopted in the program, taking into account the number of bars in the database that are to be recorded for duplicate data.
5, using Curl_multi to implement multi-threaded crawl page
Just start single process and a single curl to crawl data, very slow, hang up the machine climbed a night can only catch 2W of data, and then thought can be in the new user page to send Curl request when a one-time request for multiple users, and later found Curl_multi this good thing. Curl_multi Such functions can be used to request multiple URLs at the same time, rather than requests, which is similar to the ability of a process in a Linux system to run multiple threads. Here is an example of using Curl_multi to implement a multithreaded crawler:
$MH = Curl_multi_init (); Returns a new curl batch handle for ($i = 0; $i < $max _size; $i + +) {$ch = Curl_init ();//Initialize a single Curl session curl_setopt ($ch, curlopt _header, 0); curl_setopt ($ch, Curlopt_url, ' http://www.zhihu.com/people/'. $user _list[$i]. '/about '); curl_setopt ($ch, Curlopt_cookie, self:: $user _cookie); curl_setopt ($ch, Curlopt_useragent, ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/44.0.2403.130 safari/537.36 '); curl_setopt ($ch, Curlopt_returntransfer, true); curl_setopt ($ch, curlopt_followlocation, 1); $requestMap [$i] = $ch; Curl_multi_add_handle ($MH, $ch); Add a separate curl handle to the Curl batch session} $user _arr = Array (); Do {//runs the current CURL handle's child connection while ($cme = Curl_multi_exec ($MH, $active)) = = = Curlm_call_multi_perform); if ($cme! = CURLM_OK) {break;} Gets the associated transport information for the currently resolved curl while ($done = Curl_multi_info_read ($MH)) {$info = Curl_getinfo ($done [' handle ']); $tmp _result = curl_multi_getcontent ($done [' HandLe ']); $error = Curl_error ($done [' handle ']); $user _arr[] = array_values (GetUserInfo ($tmp _result)); Ensure that there are $max_size requests at the time of processing if ($i < sizeof ($user _list) && isset ($user _list[$i]) && $i < count ($user _list) {$ch = Curl_init (); curl_setopt ($ch, Curlopt_header, 0); curl_setopt ($ch, Curlopt_url, ' http://www.zhihu.com/people/'. $user _list[$i]. '/about '); curl_setopt ($ch, Curlopt_cookie, self:: $user _cookie); curl_setopt ($ch, Curlopt_useragent, ' mozilla/5.0 (Windows NT 6.1; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/44.0.2403.130 safari/537.36 '); curl_setopt ($ch, Curlopt_returntransfer, true); curl_setopt ($ch, curlopt_followlocation, 1); $requestMap [$i] = $ch; Curl_multi_add_handle ($MH, $ch); $i + +; } curl_multi_remove_handle ($MH, $done [' handle ']); } if ($active) curl_multi_select ($MH, 10); } while ($active); Curl_multi_close ($MH); return $user _aRr
6, HTTP 429 Too many requests
Use the Curl_multi function can send multiple requests at the same time, but during the execution of the 200 requests at the same time, found that many requests can not be returned, that is, the case of packet loss was found. Further analysis, using the Curl_getinfo function to print each request handle information, returns an associative array containing the HTTP response information, with a field of Http_code representing the HTTP status code returned by the request. See there are many requests Http_code is 429, this return code means to send too many requests. I guess is to do the anti-crawler protection, so I took other sites to do the test, found a one-time 200 requests no problem, proved my guess, know in this respect to do the protection, that is, a one-time request number is limited. So I kept reducing the number of requests and found that there was no packet loss at 5. Note In this program can only send up to 5 requests at a time, although not many, but this is also a small increase.
7. Use Redis to save users who have already visited
Crawl users in the process, found that some users have been visited, and his followers and concerned users have been acquired, although in the database layer to do the processing of data duplication, but the program will use Curl to send requests, so repeated sending requests have a lot of repeated network overhead. There is also a user to be crawled to be temporarily saved in a place for the next execution, the first is put into the array, and later found to add a multi-process in the program, in multi-process programming, the child process will share the program code, library, but the process uses the variables and other processes used in different. The variables between the different processes are detached and cannot be read by other processes, so arrays cannot be used. So I thought of using the Redis cache to save the processed users and the users to crawl. This will push the user to a already_request_queue queue each time it is executed, push the user to be crawled (that is, each user's followers and the list of users who are interested) into Request_queue, and then from the request before each execution _queue Pop a user, and then determine whether in Already_request_queue, if in, then proceed to the next, or continue to execute.
Using the Redis example in PHP:
<?php $redis = new Redis (); $redis->connect (' 127.0.0.1 ', ' 6379 '); $redis->set (' tmp ', ' value '); if ($redis->exists (' tmp ')) { echo $redis->get (' tmp '). "\ n"; }
8. Using PHP's PCNTL extension for multi-process
Using the Curl_multi function to implement multi-thread crawl user information, the program ran a night, the final data obtained is 10W. Still can not achieve their ideal goal, so then continue to optimize, and later found that PHP has a pcntl extension can be implemented in multi-process programming. The following are examples of multi-programming programming:
PHP Multi-process DEMO//FORK10 process for ($i = 0; $i < $i + +) { $pid = Pcntl_fork (); if ($pid = =-1) { echo "Could not fork!\n"; Exit (1); } if (! $pid) { echo "child process $i running\n"; Exit after the child process has finished executing, lest you continue to fork out the new child process exit ($i);} } Wait for the child process to complete and avoid the zombie process while (pcntl_waitpid (0, $status)! =-1) { $status = Pcntl_wexitstatus ($status); echo "Child $status completed\n";}
9. View the CPU information of the system under Linux
After the implementation of multi-process programming, thinking more open a few processes to constantly crawl the user's data, and then opened the 8 process ran a night after the discovery can only get 20W of data, not much promotion. So the data found that according to the system optimization of CPU performance tuning, the maximum process number of programs can not be given, according to the CPU's core number and to give, the maximum process number is preferably twice times the number of CPU cores. So you need to look at the CPU's information to see the CPU's number of cores. commands for viewing CPU information under Linux:
Cat/proc/cpuinfo
The results are as follows:
Where the model name represents the CPU type information, CPU cores represents the CPU core count. The number of cores here is 1, because it is run under the virtual machine, the CPU is assigned to a smaller number of cores, so only 2 processes can be opened. The end result is that 1.1 million of the user's data was crawled over a weekend.
10. Redis and MySQL connectivity issues in multi-process programming
In multi-process conditions, the program has been running for a period of time, the discovery data can not be inserted into the database, will report MySQL too many connections error, Redis is the same.
The following code will fail to execute:
<?php for ($i = 0; $i < $i + +) { $pid = Pcntl_fork (); if ($pid = =-1) { echo "Could not fork!\n"; Exit (1); } if (! $pid) { $redis = predis::getinstance (); Do something exit; } }
The root cause is that when each child process is created, it inherits the exact same copy of the parent process. objects can be copied, but the created connection cannot be copied into multiple, resulting in the result that each process uses the same Redis connection, each of the various things, resulting in a baffling conflict.
WORKAROUND: The > program does not fully guarantee that the parent process will not create a Redis connection instance before the fork process. Therefore, to solve this problem can only rely on the child process itself. Imagine that if the instance fetched in the child process is related only to the current process, then the problem does not exist. So the solution is to tweak the static mode of Redis class instantiation and bind to the current process ID.
The modified code is as follows:
<?php public static function getinstance () { static $instances = Array (); $key = Getmypid ();//Gets the current process ID if ($empty ($instances [$key])) { $inctances [$key] = new self (); } return $instances [$key]; }
11. PHP Statistics Script Execution time
Because you want to know how much time each process takes, write a function to count the execution time of the script:
function Microtime_float () { list ($u _sec, $sec) = explode (' ', Microtime ()); Return (Floatval ($u _sec) + floatval ($sec));} $start _time = microtime_float ()//do somethingusleep (), $end _time = Microtime_float (); $total _time = $end _time-$ Start_time, $time _cost = sprintf ("%.10f", $total _time); $time _cost. "S\n";
The above is the whole content of this article, for your reference, I hope that everyone's learning has helped.
Articles you may be interested in:
- PHP IIS Log analysis search engine crawler log Program
- PHP displays different content to visitors and crawlers
- How Python crawls Web site data is saved using
- Python crawler tutorial crawl baidu paste and download the example
- Python crawls the detailed process of Coursera course resources
- A lightweight and simple crawler implemented by PHP
- PHP implementation of simple crawler methods
- PHP code to implement crawler records--super-works
http://www.bkjia.com/PHPjc/1094751.html www.bkjia.com true http://www.bkjia.com/PHPjc/1094751.html techarticle PHP crawler million level of knowledge of user data crawling and analysis, PHP crawler this time to crawl 1.1 million of user data, data analysis results are as follows: before the development of the installation of Linux system (Ubun ...