This article mainly introduces the data related to crawling and analysis of PHP million-level user data. you can refer to this article to capture 1.1 million of user data. The data analysis results are as follows:
Preparations before development
- Install Ubuntu on the vmwarevm;
- Install PHP5.6 or later;
- Install MySQL 5.5 or later;
- Install curl and pcntl extensions.
Use PHP curl extension to capture page data
PHP's curl extension is a PHP-supported library that allows you to connect to and communicate with various servers using various protocols.
This program captures user data. to access a user's personal page, you must log on to the user before accessing the user. When we click a user profile picture link on the browser page to go to the user's personal center page, we can see the user information because when we click the link, the browser can help you bring local cookies to the new page, so you can enter the user's personal Center page. Therefore, before accessing a personal page, you must first obtain the user's cookie information, and then carry the cookie information at each curl request. I used my own cookie to obtain the cookie information. on the page, I can see my own cookie information:
Copy them one by one, using "_ utma = ?; _ Utmb = ?;" This form forms a cookie string. Next, you can use this cookie string to send requests.
Initial example:
$ Url = 'http: // www.zhihu.com/lele/mora-hu/about'; // Here, mora-hu represents the user ID $ ch = curl_init ($ url); // initialize the session curl_setopt ($ ch, CURLOPT_HEADER, 0 ); curl_setopt ($ ch, CURLOPT_COOKIE, $ this-> config_arr ['User _ cooker']); // Set the request COOKIEcurl_setopt ($ ch, CURLOPT_USERAGENT, $ _ SERVER ['http _ USER_AGENT ']); curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1); // returns the information obtained by curl_exec () as a file stream, instead of direct output. Curl_setopt ($ ch, CURLOPT_FOLLOWLOCATION, 1); $ result = curl_exec ($ ch); return $ result; // The captured result
Run the above code to obtain the personal center page of the mora-hu user. You can use this result to process the page with a regular expression to obtain the information that needs to be crawled, such as name and gender.
1. image anti-Leech protection
When the returned results are output after regular expression processing, the user profile cannot be displayed when the user profile is output on the page. After reading the information, I learned that Zhihu has done anti-Leech processing on the image. The solution is to forge a referer in the request header when requesting images.
After obtaining the image link using a regular expression, send a request again. The source of the request with a piece indicates that the request comes from the forwarding of Zhihu website. An example is as follows:
Function getImg ($ url, $ u_id) {if (file_exists ('. /images /'. $ u_id. ". jpg ") {return" images/$ u_id ". '.jpg ';} if (empty ($ url) {return '';} $ context_options = array ('http' => array ('header' =>" Referer: http://www.zhihu.com "// with referer parameters); $ context = stream_context_create ($ context_options); $ img = file_get_contents ('http :'. $ url, FALSE, $ context); file_put_contents ('. /images /'. $ u_id. ". jpg ", $ img); return" images/$ u_id ". '.jpg ';}
2. crawl more users
After capturing your personal information, you need to access the user's followers and the list of users you have followed to obtain more user information. Then access the service layer by layer. As you can see, there are two links in the Personal Center page:
There are two links here. one is followed, and the other is followed. the link "followed" is used as an example. Use regular expression matching to match the corresponding link. after obtaining the url, use curl to carry the cookie and send a request again. After capturing the list page followed by the user, you can get the following page:
Analyze the html structure of the page. as long as you get the user information, you only need to enclose the p content and the user name. We can see that the url of the page that the user follows is:
The URLs of different users are almost the same. The difference lies in the user name. Get the username list using regular expression matching, splice URLs one by one, and then send requests one by one (of course, one by one is relatively slow. there is a solution below, which will be discussed later ). After entering the new user page, repeat the above steps to keep repeating until the data size you want is reached.
3. number of Linux statistical files
After the script has been running for a while, you need to check how many images are obtained. when the data volume is large, it is a little slow to open the folder and check the number of images. The script runs in a Linux environment, so you can use the Linux command to count the number of files:
ls -l | grep "^-" | wc -l
Here, ls-l is the long list that outputs the file information under this directory (the files here can be directories, links, device files, etc ); grep "^-" filters long list output information. "^-" only retains general files. if only the directory is "^ d", wc-l indicates the number of rows of statistics output information. The following is an example:
4. process duplicate data when inserting MySQL
After running the program for a period of time, we found that many users have duplicate data. Therefore, we need to process the data when inserting duplicate user data. The solution is as follows:
1) Check whether the data already exists in the database before the database is inserted;
2) add a unique index. insert into... on duplicate key update...
3) add a unique index. insert ingnore...
4) add a unique index. replace...
The first solution is the simplest but the most efficient solution, so we will not adopt it. The execution results of the second and fourth party cases are the same. The difference is that when the same data is encountered, insert... On duplicate key update is directly updated, while replace into deletes old data and inserts new data. in this process, indexes need to be maintained again, so the speed is slow. Therefore, the second solution is selected between the two and the four. In the third solution, insert ingnore Ignores Errors in executing the INSERT statement, does not ignore syntax problems, but ignores the existence of primary keys. In this way, it is better to use insert ingnore. Finally, considering the number of duplicate data records in the database, the second solution is adopted in the program.
5. use curl_multi to capture pages with multiple threads
When a single process is started and a single curl is used to capture data, the speed is very slow. when the host crawls for one night, it can only capture 2 W of data, so I thought that I could request multiple users at a time when I sent a curl request to the new user page. later I found the good thing curl_multi. Functions such as curl_multi can request multiple URLs at the same time, rather than one request. This is similar to the function of multiple threads in a process in linux. The following is an example of using curl_multi to implement multi-thread crawler:
$ Mh = curl_multi_init (); // return a new cURL batch handle for ($ I = 0; $ I <$ max_size; $ I ++) {$ ch = curl_init (); // initialize a single cURL session curl_setopt ($ ch, CURLOPT_HEADER, 0); curl_setopt ($ ch, CURLOPT_URL ,' http://www.zhihu.com/people/ '. $ User_list [$ I]. '/about'); curl_setopt ($ ch, CURLOPT_COOKIE, self: $ user_cookie); curl_setopt ($ ch, CURLOPT_USERAGENT, 'mozilla/5.0 (Windows NT 6.1; WOW64) appleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.130 Safari/537.36 '); curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, true); curl_setopt ($ ch, CURLOPT_FOLLOWLOCATION, 1 ); $ requestMap [$ I] = $ ch; curl_multi_add_handle ($ mh, $ ch); // add a ticket to the curl batch processing session Single curl handle} $ user_arr = array (); do {// run the sub-connection while of the current cURL handle ($ cme = curl_multi_exec ($ mh, $ active )) = CURLM_CALL_MULTI_PERFORM); if ($ cme! = CURLM_ OK) {break;} // obtain the relevant transmission information of the currently resolved cURL while ($ done = curl_multi_info_read ($ mh )) {$ info = curl_getinfo ($ done ['handle']); $ tmp_result = curl_multi_getcontent ($ done ['handle']); $ error = curl_error ($ done ['handle']); $ user_arr [] = array_values (getUserInfo ($ tmp_result )); // ensure that $ max_size requests are simultaneously processed in if ($ I <sizeof ($ user_list) & isset ($ user_list [$ I]) & $ I <count ($ user_list) {$ ch = curl_init (); curl_setopt ($ ch, CURLOPT_HEADER, 0); curl_setopt ($ ch, CURLOPT_URL ,' http://www.zhihu.com/people/ '. $ User_list [$ I]. '/about'); curl_setopt ($ ch, CURLOPT_COOKIE, self: $ user_cookie); curl_setopt ($ ch, CURLOPT_USERAGENT, 'mozilla/5.0 (Windows NT 6.1; WOW64) appleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.130 Safari/537.36 '); curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, true); curl_setopt ($ ch, CURLOPT_FOLLOWLOCATION, 1 ); $ requestMap [$ I] = $ ch; curl_multi_add_handle ($ mh, $ ch); $ I ++;} curl_multi_remove_handle ($ mh, $ done ['handle']);} if ($ active) curl_multi_select ($ mh, 10) ;}while ($ active); curl_multi_close ($ mh); return $ user_arr;
6. HTTP 429 Too required Requests
You can use the curl_multi function to send multiple requests at the same time. However, when 200 requests are simultaneously sent during execution, many requests cannot be returned, that is, packet loss. For further analysis, use the curl_getinfo function to print each request handle information. This function returns an associated array containing HTTP response information. one of the fields is http_code, indicating the HTTP status code returned by the request. We can see that the http_code of many requests is 429. this return code indicates that too many requests are sent. I guess I knew that I had anti-crawler protection, so I took other websites for testing and found that it was okay to send 200 requests at a time, proving my guess, zhihu has provided protection in this regard, that is, there is a limit on the number of one-time requests. So I continuously reduced the number of requests and found that there was no packet loss at the time of 5. It indicates that only five requests can be sent at a time in this program. although there are not many requests, this is also a small improvement.
7. use Redis to save users who have already accessed
During the process of capturing a user, it is found that some users have already accessed the user, and their followers and followers have already obtained the user, although repeated data is processed at the database layer, the program still uses curl to send requests, so that repeated requests have a lot of repetitive network overhead. Another is that the user to be crawled needs to temporarily save it in one place for the next execution. at the beginning, it is placed in the array and later found that multiple processes should be added to the program, in multi-process programming, sub-processes share program code and function libraries, but the variables used by the process are different from those used by other processes. Variables in different processes are separated and cannot be read by other processes. Therefore, arrays are not allowed. Therefore, we thought of using Redis Cache to save users that have already been processed and users to be captured. In this way, the user is pushed to an already_request_queue queue every time the execution is complete, and the user to be crawled (that is, each user's referers and the list of users concerned) is pushed to request_queue, then, pop a user from request_queue before each execution, and then judge whether the user is in already_request_queue. if the user is in, perform the next step; otherwise, continue the execution.
Example of using redis in PHP:
<?php $redis = new Redis(); $redis->connect('127.0.0.1', '6379'); $redis->set('tmp', 'value'); if ($redis->exists('tmp')) { echo $redis->get('tmp') . "\n"; }
8. use the pcntl extension of PHP to implement multi-process
After using the curl_multi function to capture user information in multiple threads, the program runs for one night and the final data obtained is. I still cannot achieve my ideal goal, so I continued to optimize it. later I found that there is a pcntl extension in php that can implement multi-process programming. The following is an example of multi-programming:
// PHP multi-process demo // fork10 processes for ($ I = 0; $ I <10; $ I ++) {$ pid = pcntl_fork (); if ($ pid =-1) {echo "cocould not fork! \ N "; exit (1) ;}if (! $ Pid) {echo "child process $ I running \ n"; // exit after the sub-process is executed to prevent the new sub-process exit ($ I );}} // wait until the sub-process is completed to avoid the occurrence of the Zombie process while (pcntl_waitpid (0, $ status )! =-1) {$ status = pcntl_wexitstatus ($ status); echo "Child $ status completed \ n ";}
9. view the cpu information of the system in Linux
After implementing multi-process programming, I thought about opening several more processes to capture user data constantly. then I started the eight-step process and ran it for one night and found that I could only get 20 million data, not much improvement. Therefore, I found that, according to the CPU performance optimization optimized by the system, the maximum number of processes in the program cannot be provided at will, but should be given according to the number of CPU cores, the maximum number of processes should be twice the number of cpu cores. Therefore, you need to view the cpu information to see the number of cpu cores. Command for viewing cpu information in Linux:
cat /proc/cpuinfo
The result is as follows:
Model name indicates the cpu type information, and cpu cores indicates the number of cpu cores. Here, the number of cores is 1. because it is run under a virtual machine and the number of cpu cores allocated is relatively small, only two processes can be started. The final result is that 1.1 million of user data is captured over a weekend.
10. connection between Redis and MySQL in multi-process programming
After running the program for a period of time under multi-process conditions, if the data cannot be inserted into the database, the mysql too connector connections error will be reported, as is redis.
The following code fails to be executed:
<?php for ($i = 0; $i < 10; $i++) { $pid = pcntl_fork(); if ($pid == -1) { echo "Could not fork!\n"; exit(1); } if (!$pid) { $redis = PRedis::getInstance(); // do something exit; } }
The root cause is that when each sub-process is created, it has inherited the parent process from a completely identical copy. Objects can be copied, but created connections cannot be copied into multiple ones. The result is that each process uses the same redis connection to perform various operations, eventually, an inexplicable conflict occurs.
Solution:> the program cannot completely guarantee that the parent process will not create a redis connection instance before the fork process. Therefore, the sub-process itself is the only way to solve this problem. Imagine that if the instance obtained in the sub-process is only related to the current process, this problem will not exist. The solution is to slightly modify the static method of redis class instantiation and bind it with the current process ID.
The transformed code is as follows:
<? Php public static function getInstance () {static $ instances = array (); $ key = getmypid (); // Obtain the current process ID if ($ empty ($ instances [$ key]) {$ inctances [$ key] = new self ();} return $ instances [$ key];}
11. PHP statistics script execution time
Because you want to know the time spent by each process, write a function to count the script execution time:
function microtime_float(){ list($u_sec, $sec) = explode(' ', microtime()); return (floatval($u_sec) + floatval($sec));}$start_time = microtime_float();//do somethingusleep(100);$end_time = microtime_float();$total_time = $end_time - $start_time;$time_cost = sprintf("%.10f", $total_time);echo "program cost total " . $time_cost . "s\n";
The above is all the content of this article for your reference and hope to help you learn it.