PHP content Collector (PHP thief program)

Source: Internet
Author: User
Tags explode ming php code
Collector, usually called the Thief program, is mainly used to crawl other people's web content. On the production of the collector, it is not difficult to open the Web page to be collected remotely, and then use regular expressions will need to match the content, as long as a little bit of regular expression of the basis, can make their own collector to.

A few days ago to do a novel series of procedures, because of the fear of updating trouble, incidentally wrote a collector, collection eight Chinese network, the function is relatively simple, can not customize the rules, but probably the idea is inside, custom rules can be extended by themselves.

Use PHP to do the collector mainly use two functions: file_get_contents () and Preg_match_all (), the former is a remote read the content of the Web page, but only in the PHP5 version of the above can be used, the latter is a regular function, to extract the required content.

Here is a step-by-step implementation of the function.

Because it is a collection of novels, so first of all to the title, author, type of these three extracted, other information can be extracted according to need.

Here to "return to the Ming dynasty when the Rajah" as the goal, first open the Bibliography page, Link: http://www.86zw.com/Book/3727/Index.aspx

Open a few more books will find that the basic format of the title is: http://www.86zw.com/Book/ISBN/index.aspx, so we can make a start page, define a <input type=text name=number> Used to enter the call number to collect, you can then through the $_post[' number '] This format to receive the call number to collect. Receive the ISBN, the following to do is to construct the Bibliography page: $url =http://www.86zw.com/book/$_post[' number ']/index.aspx, of course, here is an example, mainly to explain the convenience, the actual production time it is best to check The legality of $_post[' number '.

After constructing the URL, you can start to collect the book information. Use the file_get_contents () function to open a Bibliography page: $content =file_get_contents ($url) so that you can read the contents of the bibliography page. The next step is to match the information about the title, author, and type. Here, for example, the rest of the book is the same. Open the Bibliography page, view the source file, find "<span class=" "BookTitle" > "Back to the Ming dynasty when the Rajah" </span> ", this is to extract the title." The regular expression that extracts the title:/<span class= "Newstitle" > (. *?) </span>/is, use the Preg_match_all () function to remove the title: Preg_match_all ("/<span class=" Newstitle ">" (. *?) </span>/is ", $contents, $title); so $ title[0][0] is the content of the title we want (Preg_match_all function can go to Baidu search, here is not explained in detail). Took out the book information, next is to take the chapter content, to take the chapter content, the first thing to do is to find the address of each chapter, and then open the chapter remotely, with the contents of the content out, warehousing or directly generated HTML static file. This is the address of the chapter list: http://www.86zw.com/Html/Book/18/3727/List.shtm, you can see this is the same as the bibliography page, there are rules to be found: http://www.86zw.com/Html/Book/ Category number/ISBN/list.shtm. ISBN has been achieved, the key here is to find the category number, the category number can be found in the previous bibliography page, extract the category number:

Preg_match_all ("/html/book/[0-9]{1,}/[0-9]{1,}/list.shtm/is", $contents, $typeid); This is not enough and requires a tangent function:

The PHP code is as follows:

function Cut ($string, $start, $end) {
$message = Explode ($start, $string);
$message = Explode ($end, $message [1]); return $message [0];} Where the $string for the content to be cut, $start for the beginning of the place, $end for the end of the place. Remove the category number:

$start = "html/book/";
$end
= "List.shtm";
$typeid = Cut ($typeid [0][0], $start, $end);
$typeid = Explode ("/", $typeid); [/php]

So, $typeid [0] is the category number we're looking for. The next step is to construct the address of the chapter list: $chapterurl = http://www.86zw.com/Html/Book/. $typeid [0]/$_post[' number ']/list.shtm. With this, we can find the address of each chapter.
<

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.