Php crawler: crawling and analysis of Zhihu user data
Background description: the crawlers written using php curl crawl the basic information of almost users. Meanwhile, for The crawled data, A simple analysis is presented. Demo address
Php spider code and user dashboard display code. after finishing the code, upload it to github and update the code library on the personal blog and public account. The program is only for entertainment and learning; if you have any rights or interests, please contact me as soon as possible to delete them.
No picture, no truth
Analyze data on the web end
Wise end analysis data
The entire crawling, analysis, and presentation process is divided into the following steps.
Curl crawling from Zhihu web page data
Regular Expression Analysis
Data warehouse receiving and program deployment
Data analysis and presentation
Curl crawling webpage data
PHP's curl extension is supported by PHP and allows you to connect to and communicate with various servers using various protocols. It is a very convenient tool for capturing web pages and supports multi-threaded extension.
This program captures the personal information page https://www.zhihu.com/lele/xxxthat Zhihu provides external users to capture the process and carry cookieto the user to obtain the page. Directly upload code
// Log on to Zhihu, open personal center, open the console, and obtain cookiedocument. cookie "_ za = medium; _ ga = ga1.2.21428188.1433767929; q_c1 = medium | 1452172601000 | 1452172601000; _ xsrf = medium; cap_id =" medium = | 1453444256 | medium "; _ utmt = 1; unlock_ticket = "success = | 1453444421 | fail"; _ utma = success; _ utmb = 51854390.14.8.14534441_11; _ utmc = 51854390; _ utmz = 51854390.1452846679.1.dd1.utmcsr = google | utmccn = (organic) | utmcmd = organic | utmctr = (not % 20 provided ); _ utmv = 51854390.100-1 | 2 = registration_date = 20150823 = 1 ^ dd3 = entry_date = 20150823 = 1"
/*** Capture the personal center page by user name and store the page ** @ param $ username str: User name flag * @ return boolean: Success or failure sign */public function spiderUser ($ username) {$ cookie = "xxxx"; $ url_info = 'http: // www.zhihu.com/lele /'. $ username; // here cui-xiao-zhuai represents the user id. you can directly view the url to get your id $ ch = curl_init ($ url_info); // initialize the session curl_setopt ($ ch, CURLOPT_HEADER, 0); curl_setopt ($ ch, CURLOPT_COOKIE, $ cookie); // Set the request COOKIE curl_setopt ($ ch, CURLOPT_USERAGE NT, $ _ SERVER ['http _ USER_AGENT ']); curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1); // return the information obtained by curl_exec () as a file stream, instead of direct output. Curl_setopt ($ ch, CURLOPT_FOLLOWLOCATION, 1); $ result = curl_exec ($ ch); file_put_contents ('/home/work/zxdata_ch/php/zhihu_spider/file/comment ', $ result); return true ;}
Regular expression analysis, web page data analysis, new link, further crawling
For further crawling of captured web pages, the page must contain links that can be used to further crawl users. By analyzing the Zhihu page, we found that the personal center page contains the followers and some likes and those who have been followed.
As shown below
// A new user is found on the captured html page, which can be used for crawling.
OK. in this way, you can use "follow people" to follow people-". For continuous crawling. The next step is to extract the information through regular expression matching.
// Match the preg_match_all ('/\/people \/([\ w-] +) \ "/I', $ str, $ match_arr) of all users on the captured page ); // merge the de-overlap into the new user group, and the user further captures self: $ newUserArr = array_unique (array_merge ($ match_arr [1], self: $ newUserArr ));
By now, the entire crawler process can proceed smoothly.
If you need to capture a large amount of data, you can study curl_multi and pcntl to quickly crawl multiple threads.
Analyze user data to provide analysis
You can use regular expressions to further match more user data and directly upload the code.
// Get the user profile preg_match ('// I', $ str, $ match_img); $ img_url = $ match_img [1]; // match the user name: // Cui xiaotuo preg_match ('/([\ x {4e00}-\ x {9fa5}] + ). + span>/U', $ str, $ match_name); $ user_name = $ match_name [1]; // Match user profile // class bio span Chinese preg_match ('/([\ x {4e00}-\ x {9fa5}] + ). + span>/U', $ str, $ match_title); $ user_title = $ match_title [1]; // match gender //Male // gender value1; end Chinese preg_match ('/
41 topicsPreg_match ('/class = \"? Zg-link-litblue \ "?>
(\ D +) \ s. + strong>/I ', $ str, $ match_topic); $ user_topic = $ match_topic [1]; // Number of followers // preg_match_all ('/(\ D +) <. +/I ', $ str, $ match_care); $ user_care = $ match_care [1] [0]; $ user_be_careed = $ match_care [1] [1]; // historical page views // The personal homepage is17Browsing by preg_match ('/class = \"? Zg-gray-normal \"?. +> (\ D +) <. + span>/I ', $ str, $ match_browse); $ user_browse = $ match_browse [1];
Data warehouse receiving and program optimizationIn the crawling process, if conditions are met, you must import data to the database through redis, which can indeed improve the crawling and warehouse receiving efficiency. If there are no conditions, you can only use SQL optimization. Here are some good ideas.
Exercise caution when designing indexes for database tables. During the crawling process of the spider, it is recommended that the user name be generated, and neither the left or right fields should be indexed, including the primary key. We try to improve the warehouse receiving efficiency as much as possible. imagine adding one data every time, how much does it take to create an index. Index creation in batches when data analysis is required.
Data warehouse receiving and update operations must be performed in batches. Mysql official recommendations and speed of addition, deletion and modification: http://dev.mysql.com/doc/refman/5.7/en/insert-speed.html
# Insert into yourtable VALUES (1, 2), (5, 5 ),...;
Deployment operation. During the capture process, the program may encounter exceptions. to ensure high efficiency and stability, write a scheduled script as much as possible. Killing and re-running at intervals, so that even if the exception fails, it will not waste too much valuable time. after all, time is money.
#! /Bin/bash # kill ps aux | grep spider | awk '{print $2}' | xargs kill-9 sleep 5 s # re-run nohup/home/cuixiaohuan/lamp/php5 /bin/php/home/cuixiaohuan/php/zhihu_spider/spider_new.php &
Data analysis presentationEcharts 3.0 is mainly used for data presentation, and it feels compatible with mobile terminals.
Compatible with mobile page response layout, the code is as follows.
/* Compatibility and response p design */@ media screen and (max-width: 480px) {body {padding: 0 ;}. adapt-p {width: 100%; float: none; margin: 20px 0 ;}. half-p {height: 350px; margin-bottom: 10px ;}. whole-p {height: 350px ;}}
. Half-p {width: 48%; height: Pixel px; margin: 1%; float: left }. whole-p {width: 98%; height: Pixel px; margin: 1%; float: left}
Deficiency and waiting for learningThe entire process involves basic knowledge of php, shell, js, css, html, regular expressions, and other languages and deployment. However, there are still a lot to be improved and improved. I would like to record it as follows:
Php adopts multicul for multithreading.
Further optimization of regular expression matching
Redis is used to enhance storage during deployment and crawling
Improved compatibility of mobile terminal layout
Js modularization and sass writing css.
[Reprinted please note: php crawler: Zhihu user data crawling and analysis | reliable Cui Xiaoyan]