PHP Crawler: Knowledge of user data crawling and analysis

Source: Internet
Author: User
Background: Drag the use of Php curl write crawler, experimental crawled to know the basic information of 5w users, at the same time, for crawling data, a simple analysis of the rendering. Demo Address

PHP spider code and user dashboard code, after finishing uploading GitHub, in the personal blog and the public number update code base, the program is only for entertainment and learning exchange, if there is infringement of the relevant interests, please contact me as soon as possible to delete.

No picture, no truth.

Analytics data at the mobile end

PC-side Analytics data

The whole crawl, analysis, presentation process is roughly the following steps, the small drag will be introduced separately

    • Curl crawls web Data

    • Regular analysis of Web page data

    • Data data warehousing and program deployment

    • Data analysis and presentation

Curl Crawling Web page data

PHP's Curl extension is a library supported by PHP that allows you to connect and communicate with various types of servers using a variety of protocols. is a very handy tool for crawling Web pages, while supporting multi-threaded extensions.

This program captures the personal information page https://www.zhihu.com/people/xxx that provides user access to the outside, and the crawl process requires a user cookie to get the page. Directly on the code

  • Get page cookie

    //login, open Personal Center, open console, get Cookiedocument.cookie "_za= 67254197-3WWB8D-43F6-94F0-FB0E2D521C31; _ga=ga1.2.2142818188.1433767929; q_c1=78ee1604225d47d08cddd8142a08288b23|1452172601000|1452172601000; _xsrf=15f0639cbe6fb607560c075269064393; Cap_id= "n2qwmtexngq0yty2ngvddlmgiynmq4njdjotu0ytm5mmq=|1453444256|49fdc6b43dc51f702b7d6575451e228f56cdaf5d"; __utmt=1; Unlock_ticket= "qujdtwpmm0lszdd2dyqufbqvlrslzuvtnvb1zandvoqxjlblvmwgj0wgwyahlddvdscxdzu1vrpt0=|1453444421| c47a2afde1ff334d416bafb1cc267b41014c9d5f "; __utma=51854390.21428dd18188.1433767929.1453187421.1453444257.3; __utmb=51854390.14.8.1453444425011; __utmc=51854390; __utmz=51854390.1452846679.1.dd1.utmcsr=google|utmccn= (organic) |utmcmd=organic|utmctr= (not%20provided); __utmv=51854390.100-1|2=registration_date=20150823=1^dd3=entry_date=20150823=1 "
  • Crawl Personal Center page through curl, carry cookies, first crawl my center page

    /** * Crawl the Personal Center page by username and store *  * @param $username str: Username flag * @return Boolean      : Success flag */public function Spideruser ($ username) {    $cookie = "xxxx";    $url _info = ' http://www.zhihu.com/people/'. $username; Here Cui-xiao-zhuai on behalf of the user ID, you can directly see the URL to get my id    $ch = curl_init ($url _info);//Initialize session    curl_setopt ($ch, Curlopt_ HEADER, 0);    curl_setopt ($ch, Curlopt_cookie, $cookie);  Set Request Cookie    curl_setopt ($ch, curlopt_useragent, $_server[' http_user_agent ']);    curl_setopt ($ch, Curlopt_returntransfer, 1);  The information obtained by CURL_EXEC () is returned as a file stream, rather than as a direct output.    curl_setopt ($ch, curlopt_followlocation, 1);    $result = curl_exec ($ch);     File_put_contents ('/home/work/zxdata_ch/php/zhihu_spider/file/'. $username. HTML ', $result);     return true; }

Regular analysis of Web page data

Analyze new links to further crawl

For crawling pages to be stored, for further crawling, the page must contain a link that can be used to crawl the user further. Through the analysis of the Knowledge page found: In the Personal Center page has a focus on people and part of the likes and attention people.

As shown below

New user found in crawled HTML page, available for crawler

OK, this way can be through their own-"focus on people-" focus on people's attention-"... For continuous crawling. The next step is to extract the information from a regular match.

Match all users to the crawl page Preg_match_all ('/\/people\/([\w-]+) \ "/I ', $str, $match _arr);//To re-merge into the new user array, the user crawls the self:: $NEWUSERARR = Array_unique (Array_merge ($match _arr[1], self:: $NEWUSERARR));

In this way, the entire reptile process can be carried out smoothly.

If you need a lot of data fetching, you can study the next Curl_multi and pcntl for multi-threaded fast crawl, here do not repeat.

Analyze user data to provide analysis

Through the regular can further match out more of the user data, directly on the code.

Get user Avatar Preg_match ('//i ', $str, $match _img); $img _url = $match _img[1];//match username://Cui Xiao-dragPreg_match ('/
 
  ([\x{4e00}-\x{9fa5}]+). +span>/u ', $str, $match _name); $user _name = $match _name[1];//Match user profile//class bio span Chinese preg_ Match ('/
  
   ([\x{4e00}-\x{9fa5}]+). +span>/u ', $str, $match _title); $user _title = $match _title[1];//match sex//
   Male//gender value1; end Chinese Preg_match ('/
   
    Preg_match ('/
    
     /U ', $str, $match _city); $user _city = $match _city[1];//matching work//
     The company that people see and scoldPreg_match ('/
     
      /U ', $str, $match _employment); $user _employ = $match _employment[1];//matching position//
      Program ApePreg_match ('/
      
       /U ', $str, $match _position); $user _position = $match _position[1];//matching education//
       Research MonkPreg_match ('/
       
        / 
        
         u ', $str, $match _education); $user _education = $match _education[1];//work// 
         kick preg_match ('/ 
         
          ([\x {4e00}-\x{9fa5}]+) 
          41 topics preg_match ('/class=\ "? zg-link-litblue\"?> 
          (\d+) \s.+strong>/i ', $str, $ match_topic); $user _topic = number of followers $match _topic[1];//// focus on Preg_match_all ('/(\d+) <.+/I', $str, $match _ CARE); $user _care = $match _care[1][0]; $user _be_careed = $match _care[1][1];//historical views// profile page by people  view preg _match ('/class=\ "? zg-gray-normal\"?. +> (\d+) <.+span>/i ', $str, $match _browse); $user _browse = $match _browse[1];    
          
        
       
      
     
    
   
  
 

Data Warehousing and Program optimization

In the process of crawling, there are conditions, it must be through the Redis storage, it can improve the capture and storage efficiency. If there is no condition, it can only be optimized by SQL. Here's a couple of hearts.

    • The database table design index must be cautious. Spider crawl process, the proposed user name, the left and right fields are not indexed, including the primary key do not, as far as possible to improve the efficiency of storage, imagine 5000w of data, each time you add a, how much to build the index consumption. After the crawl is complete, you need to analyze the data in batches to build the index.

    • Data warehousing and update operations, be sure to batch. MySQL official gives the suggestion and the speed of adding and deleting: http://dev.mysql.com/doc/refman/5.7/en/insert-speed.html

      # Official Best bulk inserts INSERT INTO yourtable VALUES (5,5), ...;
    • Deployment operations. program during the crawl process, there may be abnormal hanging off, in order to ensure efficient and stable, as far as possible to write a timed script. Every once in a while to kill, rerun, so that even if the exception hangs will not waste too much valuable time, after all, is money.

      #!/bin/bash# kill PS aux |grep spider |awk ' {print $} ' |xargs Kill-9sleep 5s# re-run nohup/home/cuixiaohuan/lamp/php5/bin/php/ Home/cuixiaohuan/php/zhihu_spider/spider_new.php &    

Data analysis Rendering

The data is rendered primarily using Echarts 3.0, which feels good for mobile compatibility. Mobile-compatible page-responsive layouts are mainly controlled by a few simple CSS, the code below

/* Compatibility and responsive div design */@media screen and (max-width:480px) {    body{        padding:0;    }    . adapt-div {        width:100%;        Float:none;        margin:20px 0;    }    . half-div {        height:350px;        margin-bottom:10px;    }    . whole-div {        height:350px;    }}
 
     . half-div {    width:48%;    height:430px;    margin:1%;    float:left}.whole-div {    width:98%;    height:430px;    margin:1%;    Float:left}

Inadequate and pending learning

The whole process involves php,shell,js,css,html, regular and other languages and deployment of the basic knowledge, but there are many needs to improve the improvement, the small drag hereby record, follow-up supplementary examples:

    • PHP uses Multicul for multithreading.

    • Regular matching further optimization

    • The deployment and crawl process uses Redis boost storage

    • Compatibility boost for mobile layouts

    • JS's modular and sass writing CSS.

"Reprint please specify: PHP Crawler: User data crawling and analysis | "A very small and reliable"

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.