PHP for crawling and analyzing user data, php_php tutorials

Source: Internet
Author: User

PHP for crawling and analyzing user data, PHP


Background: Drag the use of Php curl write crawler, experimental crawled to know the basic information of 5w users, at the same time, for crawling data, a simple analysis of the rendering.

PHP spider code and user dashboard code, after finishing uploading GitHub, in the personal blog and the public number update code base, the program is only for entertainment and learning exchange, if there is infringement of the relevant interests, please contact me as soon as possible to delete.

No picture, no truth.

Analytics data at the mobile end

PC-side Analytics data

The whole crawl, analysis, presentation process is roughly the following steps, the small drag will be introduced separately

    1. Curl crawls web Data
    2. Regular analysis of Web page data
    3. Data data warehousing and program deployment
    4. Data analysis and presentation

Curl Crawling Web page data

PHP's Curl extension is a library supported by PHP that allows you to connect and communicate with various types of servers using a variety of protocols. is a very handy tool for crawling Web pages, while supporting multi-threaded extensions.

This program captures the personal information page https://www.zhihu.com/people/xxx that provides user access to the outside, and the crawl process requires a user cookie to get the page. Directly on the code

Get page Cookies

Copy the Code code as follows:
Login to know, open the Personal center, open the console, get cookies
Document.cookie
"_ZA=67254197-3WWB8D-43F6-94F0-FB0E2D521C31; _ga=ga1.2.2142818188.1433767929; q_c1=78ee1604225d47d08cddd8142a08288b23|1452172601000|1452172601000; _xsrf=15f0639cbe6fb607560c075269064393; Cap_id= "n2qwmtexngq0yty2ngvddlmgiynmq4njdjotu0ytm5mmq=|1453444256|49fdc6b43dc51f702b7d6575451e228f56cdaf5d"; __utmt=1; Unlock_ticket= "qujdtwpmm0lszdd2dyqufbqvlrslzuvtnvb1zandvoqxjlblvmwgj0wgwyahlddvdscxdzu1vrpt0=|1453444421| c47a2afde1ff334d416bafb1cc267b41014c9d5f "; __utma=51854390.21428dd18188.1433767929.1453187421.1453444257.3; __utmb=51854390.14.8.1453444425011; __utmc=51854390; __utmz=51854390.1452846679.1.dd1.utmcsr=google|utmccn= (organic) |utmcmd=organic|utmctr= (not%20provided); __utmv=51854390.100-1|2=registration_date=20150823=1^dd3=entry_date=20150823=1 "

Crawl a Personal Center page

By curl, bring a cookie, crawl my center page first

/** * Crawl the Personal Center page by username and store *  * @param $username str: Username flag * @return Boolean   : Success flag */public function Spideruser ($ username) {  $cookie = "xxxx";  $url _info = ' http://www.zhihu.com/people/'. $username; Here Cui-xiao-zhuai on behalf of the user ID, you can directly see the URL to get my id  $ch = curl_init ($url _info);//Initialize session  curl_setopt ($ch, Curlopt_ HEADER, 0);  curl_setopt ($ch, Curlopt_cookie, $cookie); Set Request Cookie  curl_setopt ($ch, curlopt_useragent, $_server[' http_user_agent ']);  curl_setopt ($ch, Curlopt_returntransfer, 1); The information obtained by CURL_EXEC () is returned as a file stream, rather than as a direct output.  curl_setopt ($ch, curlopt_followlocation, 1);  $result = curl_exec ($ch);   File_put_contents ('/home/work/zxdata_ch/php/zhihu_spider/file/'. $username. HTML ', $result);   return true; }

Regular analysis of Web page data analysis new links, further crawl

For crawling over the pages to store, 要想进行进一步的爬取,页面必须包含有可用于进一步爬取用户的链接 . Through the analysis of the Knowledge page found: In the Personal Center page has a focus on people and part of the likes and attention people.
As shown below

Copy the Code code as follows:
New user found in crawled HTML page, available for crawler

OK, this way can be through their own-"focus on people-" focus on people's attention-"... For continuous crawling. The next step is to extract the information from a regular match.

Copy the Code code as follows:
Match all users to the crawl page
Preg_match_all ('/\/people\/([\w-]+) \ '/'/', $STR, $match _arr);
To re-merge into the new user array, the user crawls further
Self:: $NEWUSERARR = Array_unique (Array_merge ($match _arr[1], self:: $NEWUSERARR));

In this way, the entire reptile process can be carried out smoothly.
If you need a lot of data fetching, you can study the next curl_multi and pcntl multi-threaded fast crawl, here do not repeat.

Analyze user data to provide analysis

Through the regular can further match out more of the user data, directly on the code.

Get user Avatar Preg_match ('//i ', $str, $match _img); $img _url = $match _img[1];//match username://Cui Xiao-dragPreg_match ('/
 
  ([\x{4e00}-\x{9fa5}]+). +span>/u ', $str, $match _name); $user _name = $match _name[1];//Match user profile//class bio span Chinese preg_ Match ('/
  
   ([\x{4e00}-\x{9fa5}]+). +span>/u ', $str, $match _title); $user _title = $match _title[1];//match sex//
   Male//gender value1; end Chinese Preg_match ('/
   
    Preg_match ('/
    
     /U ', $str, $match _city); $user _city = $match _city[1];//matching work//
     The company that people see and scoldPreg_match ('/
     
      /U ', $str, $match _employment); $user _employ = $match _employment[1];//matching position//
      Program ApePreg_match ('/
      
       /U ', $str, $match _position); $user _position = $match _position[1];//matching education//
       Research MonkPreg_match ('/
       
        
          U ', $str, $match _education); $user _education = $match _education[1];//work//
          kick  preg_match ('/
         
           ([\x{4e00}-\x{9fa5}]+) 
           41 topics  preg_match ('/class=\ '?) Zg-link-litblue\ "?> 
           (\d+) \s.+strong>/i ', $str, $match _topic); $user _topic = $match _topic[1];//number of followers/ Span class= "Zg-gray-normal" > focus on Preg_match_all ('/ (\d+) <.+ /I ', $str, $match _care); $user _care = $ Match_care[1][0]; $user _be_careed = $match _care[1][1];//historical views// personal homepage was  17  people browsing  preg_match ('/class=\ "? zg-gray-normal\"?. +> (\d+) <.+span>/i ', $str, $match _browse); $user _browse = $match _browse[1];      
         
        
       
      
     
    
   
  
 

In the process of crawling, there are conditions, it must be through the Redis storage, it can improve the capture and storage efficiency. If there is no condition, it can only be optimized by SQL. Here's a couple of hearts.

The database table design index must be cautious. In the spider crawl process, it is recommended that the user name, left and right fields are not indexed, including the primary key are not, 尽可能的提高入库效率 imagine 5000w of data, each time you add a, how much to build the index consumption. After the crawl is complete, you need to analyze the data in batches to build the index.

Data warehousing and update operations, be sure to batch. MySQL official gives the suggestion and the speed of adding and deleting: http://dev.mysql.com/doc/refman/5.7/en/insert-speed.html

# Official Best bulk inserts INSERT INTO yourtable VALUES (5,5), ...;

Deployment operations. program during the crawl process, there may be abnormal hanging off, in order to ensure efficient and stable, as far as possible to write a timed script. Every once in a while to kill, rerun, so that even if the exception hangs will not waste too much valuable time, after all, is money.

#!/bin/bash# kill PS aux |grep spider |awk ' {print $} ' |xargs Kill-9sleep 5s# re-run nohup/home/cuixiaohuan/lamp/php5/bin/php/ Home/cuixiaohuan/php/zhihu_spider/spider_new.php & 

Data analysis Rendering

The data is rendered primarily using Echarts 3.0, which feels good for mobile compatibility. Mobile-compatible page-responsive layouts are mainly controlled by a few simple CSS, the code below

Get user Avatar Preg_match ('//i ', $str, $match _img); $img _url = $match _img[1];//match username://Cui Xiao-dragPreg_match ('/  ([\x{4e00}-\x{9fa5}]+). +span>/u ', $str, $match _name); $user _name = $match _name[1];//Match user profile//class bio span Chinese preg_ Match ('/  ([\x{4e00}-\x{9fa5}]+). +span>/u ', $str, $match _title); $user _title = $match _title[1];//match sex// Male//gender value1; end Chinese Preg_match ('/  Preg_match ('/  /U ', $str, $match _city); $user _city = $match _city[1];//matching work// The company that people see and scoldPreg_match ('/  /U ', $str, $match _employment); $user _employ = $match _employment[1];//matching position// Program ApePreg_match ('/  /U ', $str, $match _position); $user _position = $match _position[1];//matching education// Research MonkPreg_match ('/   U ', $str, $match _education); $user _education = $match _education[1];//work//  kick  preg_match ('/  ([\x{4e00}-\x{9fa5}]+)  41 topics  preg_match ('/class=\ '?) Zg-link-litblue\ "?>  (\d+) \s.+strong>/i ', $str, $match _topic); $user _topic = $match _topic[1];//number of followers/ Span class= "Zg-gray-normal" > focus on Preg_match_all ('/ (\d+) <.+ /I ', $str, $match _care); $user _care = $ Match_care[1][0]; $user _be_careed = $match _care[1][1];//historical views// personal homepage was  17  people browsing  preg_match ('/class=\ "? zg-gray-normal\"?. +> (\d+) <.+span>/i ', $str, $match _browse); $user _browse = $match _browse[1];            

Inadequate and pending learning

The whole process involves php,shell,js,css,html, regular and other languages and deployment of the basic knowledge, but there are many needs to improve the improvement, the small drag hereby record, follow-up supplementary examples:

    1. PHP uses Multicul for multithreading.
    2. Regular matching further optimization
    3. The deployment and crawl process uses Redis boost storage
    4. Compatibility boost for mobile layouts
    5. JS's modular and sass writing CSS.

Articles you may be interested in:

    • PHP IIS Log analysis search engine crawler log Program
    • PHP displays different content to visitors and crawlers
    • A lightweight and simple crawler implemented by PHP
    • PHP implementation of simple crawler methods
    • PHP code to implement crawler records--super-works
    • PHP Crawler's million-level knowledge of user data crawling and analysis

http://www.bkjia.com/PHPjc/1096148.html www.bkjia.com true http://www.bkjia.com/PHPjc/1096148.html techarticle PHP Implementation Crawl and analysis of user data, PHP background description: Drag the use of Php curl write crawler, experimental crawled to know the basic information of 5w users, while the data for crawling ...

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.