[Guide] because my company is in the peer-to industry, analysis of industry data, the platform's operational decision-making has a great role, so need to crawl xx home data.
1. Analysis
By right-viewing the page source Code Discovery page structure for the table layout, so imagine can be divided into four steps to collect data: 1, the use of crawler to crawl down the page, 2, the page data analysis, 3, storage, 4, write a timed service daily crawl. Because the company's website also used PHP recently learned a little, I heard that curl is very suitable for crawling Web pages, decided to use PHP program to crawl.
2. Crawl Page
There is a small episode, just beginning to crawl, the page information returned is 404.html, the final analysis found that the site of non-browser requests are masked, jump directly 404. The part of the green code is added in the background to fetch the data successfully.
function Crawl ($url) { $ch = Curl_init (); curl_setopt ($ch, Curlopt_url, $url); false);
curl_setopt ($ch, curlopt_useragent, ' mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;. NET CLR 1.1.4322) ');
curl_setopt ($curl, Curlopt_post, 1); curl_setopt ($ch, Curlopt_returntransfer, 1); $result =curl_exec ($ch); Curl_close ($ch); return $result; }
3. Parsing data
View page source code found, the first behavior of the title removed, the last two listed platform connection and attention, are filtered out. The ID of the first column needs to be intercepted according to the following connection, all the data in the middle will have the man's unit and some special characters, replace with Preg_replace, and finally install the data of XX platform, and make the SQL return of splicing into the library.
function analyze ($dom, $satTime) {$html =NewSimple_html_dom (); $sql =INSERT INTO Xxx_data (Xxplatid,platname,averagemonth,daytotal,averagerate,investnumber,averageinvestmoney, Averageborrowtime,borrownumer,borrowbidnumber,averageborrowmoney,firstteninvestrate,firsttenborrowrate, Bidendtime,registertime,registermoney,leveragefund,invest30total,repay60total,repaytotal,statisticstime, Excutetime) Values "; $html->load ($dom); $istitle = 0;foreach($html->find (' TR ') as$TR) {$istitle = $istitle +1;if($istitle ==1) {Continue; } $sql. ="("; $count = 0;foreach($tr->find (' TD ') as$element) {$count = $count +1;if($count ==1) {$href = $element->next_sibling ()->find (' A ', 0)->href; $href =strstr ($href,'. ', TRUE); $href =strstr ($href,'-'); $sql. ="'". substr ($href, 1)."',"; }elseif ($count ==2) {$val = $element->find (' A ', 0)->innertext; $sql. ="'". $val."',"; }elseif ($count <21) {$patterns = array (); $patterns [0] ='/([\x80-\xff]*)/I '; $patterns [1] ='/[%*]/'; $val =preg_replace ($patterns,"', $element->innertext); $sql. ="'". $val."',"; }} $sql. ="'". $satTime."', '". Date (' y-m-d h:i:s ')."'),"; } $sql = substr ($sql, 0,strlen ($sql)-1); $sql = Strip_tags ($sql);return$sql; }
4. Storage
through the online search study, found that PHP operation MySQL than Java is very simple, a few lines of code to fix.
function Save ($sql) {$con = mysql_connect ( "192.168.0.1" , "root" , ); if (! $con) {Die (. mysql_error ()); } mysql_select_db ( "Xx_data" , $con); mysql_query ( "set names UTF8" ); mysql_query ($sql); Mysql_close ($con); }
5. Batch Crawl
by analyzing the query criteria of the data, each query is based on the date of the URL suffix to query the transaction data of the day, http://XXX/indexs.html? startTime=2015-04-01&endTime= 2015-04-01, because it is only necessary to traverse the historical date to splice the URL to crawl through the history of all trades.
function execute () {$starttime = "2014-04-15" ; $endtime = "2015-04-15" ; for ($start = Strtotime ($starttime); $start <= strtotime ($endtime); $start + = 86400) {$date =date ( ' y-m-d ' , $start); $url = "http://shuju.XX.com/indexs.html?startTime=" . $date. "&endtime=" . $date; //the first step to crawl $dom =crawl ( $url) ; //second step parsing $sql =analyze ($dom, $date); //third step storage Save ($sql); } echo "execute End" ; } execute ();
6. Set up the timer service
Set the scheduled task to the daily fixed time to crawl the latest data, so as not to execute each time manually, PHP also has its own timing tasks, but the web to see the implementation is too complex, so the use of Linux crontab to achieve, Linux input Crontab–e into the editing state, add a timed use of curl to call, this crawler function is complete.
* * * * Curl http://192.168.0.1/crawl.php
This program is only for learning communication, if there is a need for full source of friends can contact me alone.
The reptile tour of XX House