The reptile tour of XX House

Source: Internet
Author: User

[Guide] because my company is in the peer-to industry, analysis of industry data, the platform's operational decision-making has a great role, so need to crawl xx home data.

1. Analysis

By right-viewing the page source Code Discovery page structure for the table layout, so imagine can be divided into four steps to collect data: 1, the use of crawler to crawl down the page, 2, the page data analysis, 3, storage, 4, write a timed service daily crawl. Because the company's website also used PHP recently learned a little, I heard that curl is very suitable for crawling Web pages, decided to use PHP program to crawl.

2. Crawl Page

There is a small episode, just beginning to crawl, the page information returned is 404.html, the final analysis found that the site of non-browser requests are masked, jump directly 404. The part of the green code is added in the background to fetch the data successfully.

function Crawl ($url) {         $ch = Curl_init ();         curl_setopt ($ch, Curlopt_url, $url);         false);         

curl_setopt ($ch, curlopt_useragent, ' mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;. NET CLR 1.1.4322) ');

         curl_setopt ($curl, Curlopt_post, 1);          curl_setopt ($ch, Curlopt_returntransfer, 1);          $result =curl_exec ($ch);         Curl_close ($ch);         return $result;    }

3. Parsing data

View page source code found, the first behavior of the title removed, the last two listed platform connection and attention, are filtered out. The ID of the first column needs to be intercepted according to the following connection, all the data in the middle will have the man's unit and some special characters, replace with Preg_replace, and finally install the data of XX platform, and make the SQL return of splicing into the library.

function analyze ($dom, $satTime) {$html =NewSimple_html_dom (); $sql =INSERT INTO Xxx_data (Xxplatid,platname,averagemonth,daytotal,averagerate,investnumber,averageinvestmoney, Averageborrowtime,borrownumer,borrowbidnumber,averageborrowmoney,firstteninvestrate,firsttenborrowrate, Bidendtime,registertime,registermoney,leveragefund,invest30total,repay60total,repaytotal,statisticstime, Excutetime) Values ";        $html->load ($dom); $istitle = 0;foreach($html->find (' TR ') as$TR) {$istitle = $istitle +1;if($istitle ==1) {Continue; } $sql. ="("; $count = 0;foreach($tr->find (' TD ') as$element) {$count = $count +1;if($count ==1) {$href = $element->next_sibling ()->find (' A ', 0)->href; $href =strstr ($href,'. ', TRUE); $href =strstr ($href,'-'); $sql. ="'". substr ($href, 1)."',"; }elseif ($count ==2) {$val = $element->find (' A ', 0)->innertext; $sql. ="'". $val."',";                }elseif ($count <21) {$patterns = array (); $patterns [0] ='/([\x80-\xff]*)/I '; $patterns [1] ='/[%*]/'; $val =preg_replace ($patterns,"', $element->innertext); $sql. ="'". $val."',"; }} $sql. ="'". $satTime."', '". Date (' y-m-d h:i:s ')."'),";        } $sql = substr ($sql, 0,strlen ($sql)-1); $sql = Strip_tags ($sql);return$sql; }

4. Storage

through the online search study, found that PHP operation MySQL than Java is very simple, a few lines of code to fix.

 function Save ($sql) {$con = mysql_connect ( "192.168.0.1" ,  "root" , ); if  (! $con)        {Die (. mysql_error ());        } mysql_select_db ( "Xx_data" , $con);        mysql_query ( "set names UTF8" );        mysql_query ($sql);    Mysql_close ($con); }

5. Batch Crawl

by analyzing the query criteria of the data, each query is based on the date of the URL suffix to query the transaction data of the day, http://XXX/indexs.html? startTime=2015-04-01&endTime= 2015-04-01, because it is only necessary to traverse the historical date to splice the URL to crawl through the history of all trades.

function execute () {$starttime = "2014-04-15" ;      $endtime = "2015-04-15" ; for  ($start = Strtotime ($starttime); $start <= strtotime ($endtime); $start + = 86400)         {$date =date ( ' y-m-d ' , $start); $url = "http://shuju.XX.com/indexs.html?startTime=" . $date.          "&endtime=" . $date;         //the first step to crawl  $dom =crawl ( $url) ;         //second step parsing  $sql =analyze ($dom, $date);     //third step storage  Save ($sql);    } echo  "execute End" ; } execute (); 

6. Set up the timer service

Set the scheduled task to the daily fixed time to crawl the latest data, so as not to execute each time manually, PHP also has its own timing tasks, but the web to see the implementation is too complex, so the use of Linux crontab to achieve, Linux input Crontab–e into the editing state, add a timed use of curl to call, this crawler function is complete.

    * * * * Curl  http://192.168.0.1/crawl.php

This program is only for learning communication, if there is a need for full source of friends can contact me alone.

The reptile tour of XX House

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.