Use simple_html_dom.php, download | documents
Because the crawl is just a Web page, so relatively simple, the entire site of the next study, may use Python to do the crawler will be better.
1<meta http-equiv= "Content-type" content= "Text/html;charset=utf-8"/>2<?PHP3 include_once' Simplehtmldom/simple_html_dom.php ';4 //get HTML data into an object5 $html= file_get_html (' http://paopaotv.com/tv-type-id-5-pg-1.html ');6 //A -Z alphabetical list each piece of data is within the Id=letter-focus Div class= Letter-focus-item's DL tag, which is found using the Find method7 8 foreach($html->find ('. Txt-list li a ') as $element)9 $arr[]=$element->innertext. ' <br> ';Ten One $fileName= ' Data.txt ';//no need to build it beforehand. A $arrLen=Count($arr); - for($i= 0;$i<$arrLen;$i++){ - file_put_contents($fileName,$arr[$i],file_append|lock_ex); the /*file_append| LOCK_EX is appended to the data, and if there is no parameter, only one data can be inserted - However, if the crawl is restarted, the previously crawled data will continue to be stored*/ - } - //The above is the captured data and stored in the Data.text. + $content=file_get_contents($fileName); - $cont=Explode("<br>",$content); + $contLen=Count($cont); A for($i= 0;$i<$contLen;$i++) { at unset($cont[$i+1]); -}
Locate the node in http://www.paopaotv.com/tv-type-id-5-pg-1.html first.
1 foreach ($htmlas$element) 2 $arr $element->innertext. ' <br> ';
Obtaining data within a node
The data obtained:
As you can see, there is a <br>***<br> behind each obtained data, because there are two a in the. txt-list Li , so you get two data
1 $content=file_get_contents($fileName);2 $cont=Explode("<br>",$content);3 $contLen=Count($cont);4 for($i= 0;$i<$contLen;$i++) {5 unset($cont[$i+1]);6}
Get the data in Data.text, by explode("<br>",$content) to divide the data before and after <br> into two parts, Cont is printed with the Print_r () function, it gets
As you can see, all the unwanted data is odd, so use unset($cont[$i+1]) , the function is deleted, and the display is:
But how to re-order the current array of keys, this I do not know how to do, tried Array_splice, the function can not be set only to support the deletion of odd-numbered content.
PHP Crawler Crawl Web content (simple_html_dom.php)