Before with Nodejs Cheerio to do, but Nodejs asynchronous back too disgusting, can't stand.
Later, PHP's Htmlpagedom library was found, similar to jquery's selector syntax, and supported in Chinese.
Installing composer Install Wa72/htmlpagedom
1, read a simple Web page, such as:
require ' vendor/autoload.php '; Use \wa72\htmlpagedom\htmlpagecrawler; $url = "http://news.cnblogs.com/"; $dom = htmlpagecrawler::create (file_get_contents($url)); Print $dom // Output Content
2, how to analyze, using the jquery selector syntax, you can refer to
such as extracting the blog Park News First page of all links, structure as follows
$news _list=$dom->filter ("#news_list");$news _entry=$news _list->filter (". News_entry");$urls= [];$i= 0;$url _cnt=$news _entry-Count();//print $url _cnt; 30, find in the browser "posted in" is 30, proves to be correct while($i<$url _cnt){ $urls[] =$news _entry->eq ($i)->filter (' A ')->eq (0)->attr ("href"); + + $i;}
There may be questions, why not foreach
Because $news_entry->children () returns domelement instead of Htmlpagecrawler, you cannot use filter and continue with Htmlpagecrawler::create ().
3. Extract the News text
$content = Htmlpagecrawler::create (file_get_contents ($url. $urls [0]));
Print $content->filter ("#news_body")->text ();
4. Description
Some Web sites may not be utf8, then you'll have to use Iconv to transcode them.
You can write a function to encapsulate, $base the root URL, because in many cases the link is relative.
functionHttpGet ($url,$base=NULL) { if(!$base) { $url.=$base; } $html=file_get_contents($url); $encode= Mb_detect_encoding ($html, "Gbk,utf-8"); if(Stripos($encode, "UTF")!==false) { returnHtmlpagecrawler::create ($html); } Else { $utf _html=Iconv("GBK", "Utf-8",$html); returnHtmlpagecrawler::create ($utf _html); }}
If you get HTML using the HTML () function, the output is HTML entity encoding, which can be used Html_entity_decode
You can also use Strip_tags to remove certain tags from the HTML.
The ID is unique, and the class and tag are not unique, so get the class and tag, even if only one has to use EQ (0) to get
jquery has a have function to determine if there is a tag, and Htmlpagecrawler missing this, so manually added a.
Under Htmlpagecrawler.php's Hasclass function, add the following code
Public functionHas ($name) { foreach($this->children () as $node){ if($nodeinstanceof \domelement) { $tagName=$node-TagName; if(Stripos($tagName,$name) !==false) { return true; } } } return false; }
PHP Crawler Practice