Using PHP to get information on a webpage is less efficient than XPath.

Source: Internet
Author: User
Tags save file xpath

Using PHP to crawl Web pages, and information collection, in fact, is crawling data, the specific implementation steps are as follows, the first should be introduced two files curl_html_get.php and save_file.php files, two files specific code is such curl_html_ Get.php inside the code is

<?php
function Curl_get_file_contents ($url) {$c = Curl_init (); curl_setopt ($c, Curlopt_returntransfer, 1); curl_setopt ($c, Curlopt_url, $url); $contents = curl_exec ($c);  Curl_close ($c); if ($contents) return $contents; else return FALSE;}? >save_file.php file content is <?php
/** * Continuously Create directory * * @param string $dir directory String * @param int $mode Permission number * @return Boolean */function make_dir ($dir, $mode = "077 7 ") {if (! $dir) return false;
if (!file_exists ($dir)) {return mkdir ($dir, $mode, True);} else {return true;}}
/** * Save File * * @param string $fileName file name (with relative path) * @param string $text file contents * @return boolean*/function save_file ($filena Me, $text) {if (! $filename | |! $text) return false;
$dirname = DirName ($filename); if (Make_dir ($dirname)) {//File_put_contents ($filename, $text, file_append); File_put_contents ($filename, $text);// if (Is_resource ($fp = fopen ($filename, "w+")) {//if (@fwrite ($FP, $text)) {//fclose ($FP);//Return true;//} else {//F Close ($fp);//Return false;//}//}} return false;

?> is actually the one that gets the content of the Web page, and the other creates the file. Then is the PHP code, the definition of a function inside the code is basically the echo "==================start=======================<br/>";//1, get page $path = This_path. "Download"; $url = "http://10.maigoo.com/list_1187.html"; $pathinfo = PathInfo ($url); $html _pathname = $path. DS; $html _filename = $html _pathname. "List_1187.htm"; if (!file_exists ($html _filename)) {$text = Curl_get_file_contents ($url); Save_file ($html _filename, $text);} else {$ Text = file_get_contents ($html _filename);}
2, obtain the area//start pos $start = ' <div class= ' b-brand-nlist hoverdetail ' > '; End pos $end = ' <div id= ' copyright > '; $pos _start = Strpos ($text, $start); $pos _end = Strpos ($text, $end, $pos _start); $pos _end + = strlen ($end); $content = substr ($text, $pos _start, $pos _end-$pos _start); Save_file ($html _pathname. "  List_1187.html ", $content); 3, get all the first level $pattern = ' @<div class= ' aclist ">.*<div class=" clear "></div> @Usi '; if (!preg_match_all ($pattern, $content, $matches)) {die ("===============not match anything===================<");} echo "=========================================<br/>"; $index = 0; foreach ($matches [0] as $pinpai _cate) {save_file ($html _pathname. $index.  ". html", $pinpai _cate); Get the first class category URL and name Get_level1_url_and_name ($pinpai _cate, $cate 1_url, $cate 1_name);
echo "==================$ a brand =======================<br/>"; $pattern = ' @<li addbg= ' #400143 ' .*</li> @Usi ';  if (Preg_match_all ($pattern, $content, $matches)) {foreach ($matches [0] as $one _brand); }} echo "==================end=======================<br/>";} The basic principle is to get the download page to local, then intercept, and finally match with regular. Do not have the process of the code tuning, resulting in code too long, too many places to repeat, if the interception of the place with regular or can not be judged, or the region has a lot of repeating points, it is necessary to intercept and then eliminate interference, more cumbersome, in addition to the need to write more functions, all the code to improve their level further.

Using PHP to get information on a webpage is less efficient than XPath.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.