The reptile tour of XX House

Last Update:2015-04-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

[Guide] because my company is in the peer-to industry, analysis of industry data, the platform's operational decision-making has a great role, so need to crawl xx home data.

1. Analysis

By right-viewing the page source Code Discovery page structure for the table layout, so imagine can be divided into four steps to collect data: 1, the use of crawler to crawl down the page, 2, the page data analysis, 3, storage, 4, write a timed service daily crawl. Because the company's website also used PHP recently learned a little, I heard that curl is very suitable for crawling Web pages, decided to use PHP program to crawl.

2. Crawl Page

There is a small episode, just beginning to crawl, the page information returned is 404.html, the final analysis found that the site of non-browser requests are masked, jump directly 404. The part of the green code is added in the background to fetch the data successfully.

function Crawl ($url) {         $ch = Curl_init ();         curl_setopt ($ch, Curlopt_url, $url);         false);

curl_setopt ($ch, curlopt_useragent, ' mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;. NET CLR 1.1.4322) ');

         curl_setopt ($curl, Curlopt_post, 1);          curl_setopt ($ch, Curlopt_returntransfer, 1);          $result =curl_exec ($ch);         Curl_close ($ch);         return $result;    }

3. Parsing data

View page source code found, the first behavior of the title removed, the last two listed platform connection and attention, are filtered out. The ID of the first column needs to be intercepted according to the following connection, all the data in the middle will have the man's unit and some special characters, replace with Preg_replace, and finally install the data of XX platform, and make the SQL return of splicing into the library.

function analyze ($dom, $satTime) {$html =NewSimple_html_dom (); $sql =INSERT INTO Xxx_data (Xxplatid,platname,averagemonth,daytotal,averagerate,investnumber,averageinvestmoney, Averageborrowtime,borrownumer,borrowbidnumber,averageborrowmoney,firstteninvestrate,firsttenborrowrate, Bidendtime,registertime,registermoney,leveragefund,invest30total,repay60total,repaytotal,statisticstime, Excutetime) Values ";        $html->load ($dom); $istitle = 0;foreach($html->find (' TR ') as$TR) {$istitle = $istitle +1;if($istitle ==1) {Continue; } $sql. ="("; $count = 0;foreach($tr->find (' TD ') as$element) {$count = $count +1;if($count ==1) {$href = $element->next_sibling ()->find (' A ', 0)->href; $href =strstr ($href,'. ', TRUE); $href =strstr ($href,'-'); $sql. ="'". substr ($href, 1)."',"; }elseif ($count ==2) {$val = $element->find (' A ', 0)->innertext; $sql. ="'". $val."',";                }elseif ($count <21) {$patterns = array (); $patterns [0] ='/([\x80-\xff]*)/I '; $patterns [1] ='/[%*]/'; $val =preg_replace ($patterns,"', $element->innertext); $sql. ="'". $val."',"; }} $sql. ="'". $satTime."', '". Date (' y-m-d h:i:s ')."'),";        } $sql = substr ($sql, 0,strlen ($sql)-1); $sql = Strip_tags ($sql);return$sql; }

4. Storage

through the online search study, found that PHP operation MySQL than Java is very simple, a few lines of code to fix.

 function Save ($sql) {$con = mysql_connect ( "192.168.0.1" ,  "root" , ); if  (! $con)        {Die (. mysql_error ());        } mysql_select_db ( "Xx_data" , $con);        mysql_query ( "set names UTF8" );        mysql_query ($sql);    Mysql_close ($con); }

5. Batch Crawl

by analyzing the query criteria of the data, each query is based on the date of the URL suffix to query the transaction data of the day, http://XXX/indexs.html? startTime=2015-04-01&endTime= 2015-04-01, because it is only necessary to traverse the historical date to splice the URL to crawl through the history of all trades.

function execute () {$starttime = "2014-04-15" ;      $endtime = "2015-04-15" ; for  ($start = Strtotime ($starttime); $start <= strtotime ($endtime); $start + = 86400)         {$date =date ( ' y-m-d ' , $start); $url = "http://shuju.XX.com/indexs.html?startTime=" . $date.          "&endtime=" . $date;         //the first step to crawl  $dom =crawl ( $url) ;         //second step parsing  $sql =analyze ($dom, $date);     //third step storage  Save ($sql);    } echo  "execute End" ; } execute ();

6. Set up the timer service

Set the scheduled task to the daily fixed time to crawl the latest data, so as not to execute each time manually, PHP also has its own timing tasks, but the web to see the implementation is too complex, so the use of Linux crontab to achieve, Linux input Crontab–e into the editing state, add a timed use of curl to call, this crawler function is complete.

    * * * * Curl  http://192.168.0.1/crawl.php

This program is only for learning communication, if there is a need for full source of friends can contact me alone.

The reptile tour of XX House

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The reptile tour of XX House

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The reptile tour of XX House

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support