The principle analysis of Dede collector using PHP to make a simple content collector

Source: Internet
Author: User
Tags ming
A few days ago did a novel serial program, because afraid of updating trouble, incidentally wrote a collector, the acquisition of eight Chinese network, the function is relatively simple, can not customize the rules, but the idea is in the inside, the custom rules can be extended by themselves.
Using PHP to do the collector mainly use two functions: file_get_contents () and Preg_match_all (), the previous is to read the Web page content, but only in the version of PHP5 above to use, the latter is a regular function, to extract the required content.
Here's a step-by-step feature implementation.
Because it is a collection of novels, so first of all, the title, author, type of the three extracted, other information can be extracted according to needs.
Here to "return to the Ming Dynasty as the Rajah" as the goal, first open the Bibliography page, Link: http://www.86zw.com/Book/3727/Index.aspx
Open a few books and find out, the basic format of the title is: http://www.86zw.com/Book/ISBN/index.aspx, so we can do a start page, define a , used to enter the number of calls that need to be collected, in the future can be $_post[' number ' This format to receive the number of calls need to be collected. Received the ISBN, the following is to do is to construct the Bibliography page: $url =http://www.86zw.com/book/$_post[' number ']/index.aspx, of course, here is to give an example, mainly to explain the convenience, the actual production of the best time to check The legality of $_post[' number '.
After constructing the URL, you can begin to collect the book information. Use the file_get_contents () function to open a Bibliography page: $c Open the Bibliography page, view the source file, and locate the return to the Ming Dynasty as the Rajah"That's the title of the book to be extracted." Regular expression to extract the title:/ (.*?) \<\/span>/is, use the Preg_match_all () function to remove the title: Preg_match_all ("/(.*?) \<\/span>/is ", $contents, $title); the content of $title[0][0] is the title we want (Preg_match_all function can go to Baidu, this is not explained in detail). Remove the book information, the next is to take the chapter content, to take the chapter content, the first thing to do is to find the address of each chapter, and then open the chapter remotely, with the regular content to take out, warehousing or directly generated HTML static files. This is the address of the chapter list: http://www.86zw.com/Html/Book/18/3727/List.shtm, you can see that this is the same as the bibliography page, there are rules to find: http://www.86zw.com/Html/Book/ Class number/ISBN/list.shtm. ISBN has been made, the key here is to find the classification number, the classification number can be found in the previous bibliography page, extract the classification number:
Preg_match_all ("/html\/book\/[0-9]{1,}\/[0-9]{1,}\/list\.shtm/is", $contents, $typeid); This is not enough, and a tangent function is required:
The PHP code is as follows:
function Cut ($string, $start, $end) {
$message = Explode ($start, $string);
$message = Explode ($end, $message [1]); return $message [0];} Where $string is the content to be cut, $start the place to start, $end for the end. Remove the classification number:
$start = "html/book/";
$end
= "List.shtm";
$typeid = Cut ($typeid [0][0], $start, $end);
$typeid = Explode ("/", $typeid); [/php]
So, $typeid [0] is the classification number we're looking for. The next step is to construct the address of the chapter list: $chapterurl = http://www.86zw.com/Html/Book/. $typeid [0]/$_post[' number ']/list.shtm. With this you can find the address of each chapter. Here's how:
$ustart = "\" ";
$uend
= "\"";
T denotes the abbreviation of title
$tstart = ">";
$tend
= "<";
Take the path, for example: 123.shtm,2342.shtm,233.shtm
Preg_match_all ("/\" [0-9]{1,}\. ( shtm) \ "/is", $chapterurl, $url);
Title, for example: Chapter Nine The Righteous
Preg_match_all ("//is", $file, $title);
$count = count ($url [0]);
for ($i =0; $i <= $count; $i + +)
{
$u = Cut ($url [0][$i], $ustart, $uend);
$t = Cut ($title [0][$i], $tstart, $tend);
$array [$u] = $t;
}
$array array is all the chapter address, here, the collector is half done, the rest is to loop open each chapter address, read, and then match the content. This is relatively simple and is not described in detail here. Well, write this today first, the first time to write such a long article, language organization inevitably have problems, but also please forgive us!

The above describes the Dede collector using PHP to make a simple content collector of the principle of analysis, including the content of the Dede collector, I hope that the PHP tutorial interested in a friend helpful.

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.