Php crawler: crawling and analysis of Zhihu user data

Last Update:2017-05-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Php crawler: Zhihu user data crawling and analysis background description, the data crawled is analyzed and presented in a simple way. Demo address

Php spider code and user dashboard display code. after finishing the code, upload it to github and update the code library on the personal blog and public account. The program is only for entertainment and learning; if you have any rights or interests, please contact me as soon as possible to delete them.

No picture, no truth

Mobile analytics data

Pc-side analysis data

The entire crawling, analysis, and presentation process is divided into the following steps.

Curl crawling from Zhihu web page data
Regular Expression Analysis
Data warehouse receiving and program deployment
Data analysis and presentation

Curl crawling webpage data

PHP's curl extension is supported by PHP and allows you to connect to and communicate with various servers using various protocols. It is a very convenient tool for capturing web pages and supports multi-threaded extension.

This program crawls the personal information page https://www.zhihu.com/people/xxx that knows to provide external user access, capture the process needs to carry the user cookie to get the page. Directly upload code

Retrieve page cookies

// Log on to Zhihu, open personal center, open the console, and obtain cookiedocument. cookie "_ za = medium; _ ga = ga1.2.21428188.1433767929; q_c1 = medium | 1452172601000 | 1452172601000; _ xsrf = medium; cap_id =" medium = | 1453444256 | medium "; _ utmt = 1; unlock_ticket = "success = | 1453444421 | fail"; _ utma = success; _ utmb = 51854390.14.8.14534441_11; _ utmc = 51854390; _ utmz = 51854390.1452846679.1.dd1.utmcsr = google | utmccn = (organic) | utmcmd = organic | utmctr = (not % 20 provided ); _ utmv = 51854390.100-1 | 2 = registration_date = 20150823 = 1 ^ dd3 = entry_date = 20150823 = 1"

Capture the personal center page and use curl to carry the cookie. capture the personal center page first.

/*** Capture the personal center page by user name and store the page ** @ param $ username str: User name flag * @ return boolean: Success or failure sign */public function spiderUser ($ username) {$ cookie = "xxxx"; $ url_info = 'http: // www.zhihu.com/lele /'. $ username; // here cui-xiao-zhuai represents the user id. you can directly view the url to get your id $ ch = curl_init ($ url_info); // initialize the session curl_setopt ($ ch, CURLOPT_HEADER, 0); curl_setopt ($ ch, CURLOPT_COOKIE, $ cookie); // Set the request COOKIE curl_setopt ($ ch, CURLOPT_USERAGE NT, $ _ SERVER ['http _ USER_AGENT ']); curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1); // return the information obtained by curl_exec () as a file stream, instead of direct output. Curl_setopt ($ ch, CURLOPT_FOLLOWLOCATION, 1); $ result = curl_exec ($ ch); file_put_contents ('/home/work/zxdata_ch/php/zhihu_spider/file/comment ', $ result); return true ;}

Regular expression analysis, web page data analysis, new link, further crawling

For further crawling of captured web pages, the page must contain links that can be used to further crawl users. By analyzing the Zhihu page, we found that the personal center page contains the followers and some likes and those who have been followed.

As shown below

// A new user is found on the captured html page, which can be used for crawling.

OK. in this way, you can use "follow people" to follow people-". For continuous crawling. The next step is to extract the information through regular expression matching.

// Match the preg_match_all ('/\/people \/([\ w-] +) \ "/I', $ str, $ match_arr) of all users on the captured page ); // merge the de-overlap into the new user group, and the user further captures self: $ newUserArr = array_unique (array_merge ($ match_arr [1], self: $ newUserArr ));

By now, the entire crawler process can proceed smoothly.

If you need to capture a large amount of data, you can study curl_multi and pcntl to quickly crawl multiple threads.

Analyze user data to provide analysis

You can use regular expressions to further match more user data and directly upload the code.

// Get the user profile preg_match ('// I', $ str, $ match_img); $ img_url = $ match_img [1]; // match the user name: // Cui xiaotuo preg_match ('/([\ x {4e00}-\ x {9fa5}] + ). + span>/U', $ str, $ match_name); $ user_name = $ match_name [1]; // Match user profile // class bio span Chinese preg_match ('/([\ x {4e00}-\ x {9fa5}] + ). + span>/U', $ str, $ match_title); $ user_title = $ match_title [1]; // match gender //Male // gender value1; end Chinese preg_match ('/
 
  41 topicsPreg_match ('/class = \"? Zg-link-litblue \ "?>
  (\ D +) \ s. + strong>/I ', $ str, $ match_topic); $ user_topic = $ match_topic [1]; // Number of followers // preg_match_all ('/(\ D +) <. +/I ', $ str, $ match_care); $ user_care = $ match_care [1] [0]; $ user_be_careed = $ match_care [1] [1]; // historical page views // The personal homepage is17Browsing by preg_match ('/class = \"? Zg-gray-normal \"?. +> (\ D +) <. + span>/I ', $ str, $ match_browse); $ user_browse = $ match_browse [1];

Data warehouse receiving and program optimization
In the crawling process, if conditions are met, you must import data to the database through redis, which can indeed improve the crawling and warehouse receiving efficiency. If there are no conditions, you can only use SQL optimization. Here are some good ideas.

Exercise caution when designing indexes for database tables. During the crawling process of the spider, it is recommended that the user name be generated, and neither the left or right fields should be indexed, including the primary key. We try to improve the warehouse receiving efficiency as much as possible. imagine adding one data every time, how much does it take to create an index. Index creation in batches when data analysis is required.

Data warehouse receiving and update operations must be performed in batches. Mysql official recommendations and speed of addition, deletion and modification: http://dev.mysql.com/doc/refman/5.7/en/insert-speed.html

# Insert into yourtable VALUES (1, 2), (5, 5 ),...;

Deployment operation. During the capture process, the program may encounter exceptions. to ensure high efficiency and stability, write a scheduled script as much as possible. Killing and re-running at intervals, so that even if the exception fails, it will not waste too much valuable time. after all, time is money.

#! /Bin/bash # kill ps aux | grep spider | awk '{print $2}' | xargs kill-9 sleep 5 s # re-run nohup/home/cuixiaohuan/lamp/php5 /bin/php/home/cuixiaohuan/php/zhihu_spider/spider_new.php &

Data analysis presentation
Echarts 3.0 is mainly used for data presentation, and it feels compatible with mobile terminals. Compatible with mobile page response layout, the code is as follows:

/* Compatibility and response p design */@ media screen and (max-width: 480px) {body {padding: 0 ;}. adapt-p {width: 100%; float: none; margin: 20px 0 ;}. half-p {height: 350px; margin-bottom: 10px ;}. whole-p {height: 350px ;}} . Half-p {width: 48%; height: Pixel px; margin: 1%; float: left }. whole-p {width: 98%; height: Pixel px; margin: 1%; float: left}
Deficiency and waiting for learning
The entire process involves basic knowledge of php, shell, js, css, html, regular expressions, and other languages and deployment. However, there are still a lot to be improved and improved. I would like to record it as follows:

Php adopts multicul for multithreading.

Further optimization of regular expression matching

Redis is used to enhance storage during deployment and crawling

Improved compatibility of mobile terminal layout

Js modularization and sass writing css.

[Reprinted please note: php crawler: Zhihu user data crawling and analysis | reliable Cui Xiaoyan]

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Php crawler: crawling and analysis of Zhihu user data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Php crawler: crawling and analysis of Zhihu user data

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support