PHP Crawler Crawl Web content (simple_html_dom.php)

Source: Internet
Author: User
Tags explode

Use simple_html_dom.php, download | documents

Because the crawl is just a Web page, so relatively simple, the entire site of the next study, may use Python to do the crawler will be better.

1<meta http-equiv= "Content-type" content= "Text/html;charset=utf-8"/>2<?PHP3 include_once' Simplehtmldom/simple_html_dom.php ';4 //get HTML data into an object5 $html= file_get_html (' http://paopaotv.com/tv-type-id-5-pg-1.html ');6 //A -Z alphabetical list each piece of data is within the Id=letter-focus Div class= Letter-focus-item's DL tag, which is found using the Find method7 8     foreach($html->find ('. Txt-list li a ') as $element)9     $arr[]=$element->innertext. ' <br> ';Ten  One     $fileName= ' Data.txt ';//no need to build it beforehand. A     $arrLen=Count($arr); -      for($i= 0;$i<$arrLen;$i++){ -     file_put_contents($fileName,$arr[$i],file_append|lock_ex); the     /*file_append| LOCK_EX is appended to the data, and if there is no parameter, only one data can be inserted - However, if the crawl is restarted, the previously crawled data will continue to be stored*/ -     } -     //The above is the captured data and stored in the Data.text. +     $content=file_get_contents($fileName); -     $cont=Explode("<br>",$content); +     $contLen=Count($cont); A      for($i= 0;$i<$contLen;$i++) { at         unset($cont[$i+1]); -}

Locate the node in http://www.paopaotv.com/tv-type-id-5-pg-1.html first.

1 foreach ($htmlas$element) 2 $arr $element->innertext. ' <br> ';

Obtaining data within a node

The data obtained:

As you can see, there is a <br>***<br> behind each obtained data, because there are two a in the. txt-list Li , so you get two data

1 $content=file_get_contents($fileName);2     $cont=Explode("<br>",$content);3     $contLen=Count($cont);4      for($i= 0;$i<$contLen;$i++) {5         unset($cont[$i+1]);6}

Get the data in Data.text, by explode("<br>",$content) to divide the data before and after <br> into two parts, Cont is printed with the Print_r () function, it gets

As you can see, all the unwanted data is odd, so use unset($cont[$i+1]) , the function is deleted, and the display is:

But how to re-order the current array of keys, this I do not know how to do, tried Array_splice, the function can not be set only to support the deletion of odd-numbered content.

PHP Crawler Crawl Web content (simple_html_dom.php)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.