The principle analysis _php technique of using PHP to make simple content collector

Source: Internet
Author: User
Tags explode ming
A few days ago to do a novel series of procedures, because of the fear of updating trouble, incidentally wrote a collector, collection eight Chinese network, the function is relatively simple, can not customize the rules, but probably the idea is inside, custom rules can be extended by themselves.

Use PHP to do the collector mainly use two functions: file_get_contents () and Preg_match_all (), the former is a remote read the content of the Web page, but only in the PHP5 version of the above can be used, the latter is a regular function, to extract the required content.

Here is a step-by-step implementation of the function.

Because it is a collection of novels, so first of all to the title, author, type of these three extracted, other information can be extracted according to need.

Here to "return to the Ming dynasty when the Rajah" as the goal, first open the Bibliography page, Link: http://www.86zw.com/Book/3727/Index.aspx

Open a few more books will find that the basic format of the title is: http://www.86zw.com/Book/ISBN/index.aspx, so we can make a start page, define a <input type=text name=number> Used to enter the call number to collect, you can then through the $_post[' number '] This format to receive the call number to collect. Receive the ISBN, the following to do is to construct the Bibliography page: $url =http://www.86zw.com/book/$_post[' number ']/index.aspx, of course, here is an example, mainly to explain the convenience, the actual production time it is best to check The legality of $_post[' number '.

After constructing the URL, you can start to collect the book information. Use the file_get_contents () function to open a Bibliography page: $content =file_get_contents ($url) so that you can read the contents of the bibliography page. The next step is to match the information about the title, author, and type. Here, for example, the rest of the book is the same. Open the Bibliography page, view the source file, find "<span class=" "BookTitle" > "Back to the Ming dynasty when the Rajah" </span> ", this is to extract the title." The regular expression that extracts the title:/<span class=\ "Newstitle\" > (. *?) \<\/span>/is, use the Preg_match_all () function to remove the title: Preg_match_all ("/<span class=\" newstitle\ ">" (. *?) \<\/span>/is ", $contents, $title); so $title[0][0] content is the title we want (Preg_match_all function of the use can go to Baidu Check, here is not explained in detail). Took out the book information, next is to take the chapter content, to take the chapter content, the first thing to do is to find the address of each chapter, and then open the chapter remotely, with the contents of the content out, warehousing or directly generated HTML static file. This is the address of the chapter list: http://www.86zw.com/Html/Book/18/3727/List.shtm, you can see this is the same as the bibliography page, there are rules to be found: http://www.86zw.com/Html/Book/ Category number/ISBN/list.shtm. ISBN has been achieved, the key here is to find the category number, the category number can be found in the previous bibliography page, extract the category number:

Preg_match_all ("/html\/book\/[0-9]{1,}\/[0-9]{1,}\/list\.shtm/is", $contents, $typeid); This is not enough and requires a tangent function:
The PHP code is as follows:

function Cut ($string, $start, $end) {
$message = Explode ($start, $string);
$message = Explode ($end, $message [1]); return $message [0];} Where the $string for the content to be cut, $start for the beginning of the place, $end for the end of the place. Remove the category number:

$start = "html/book/";
$end
= "List.shtm";
$typeid = Cut ($typeid [0][0], $start, $end);
$typeid = Explode ("/", $typeid); [/php]

So, $typeid [0] is the category number we're looking for. The next step is to construct the address of the chapter list: $chapterurl = http://www.86zw.com/Html/Book/. $typeid [0]/$_post[' number ']/list.shtm. With this, we can find the address of each chapter. The method is as follows:

$ustart = "" ";
$uend
= "\"";
T represents the abbreviation of title
$tstart = ">";
$tend
= "<";
Take a path, for example: 123.shtm,2342.shtm,233.shtm
Preg_match_all ("/\" [0-9]{1,}\. shtm) \ "/is", $chapterurl, $url);
Take a title, for example: Chapter Nine The Righteous
Preg_match_all ("/<a href=\" [0-9]{1,}\.shtm\] (. *?) \<\/a>/is ", $file, $title);
$count = count ($url [0]);
for ($i =0; $i <= $count; $i + +)
{
$u = Cut ($url [0][$i], $ustart, $uend);
$t = Cut ($title [0][$i], $tstart, $tend);
$array [$u] = $t;
}

$array array is all chapter addresses, and here, the collector is half done, and the rest is to loop through each chapter address, read it, and then match the content. This is relatively simple, this is not described in detail here. Well, today is the first to write this, the first time to write such a long article, language organizations inevitably have problems, but also please forgive me!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.