PHP content Collector (PHP thief program)

Last Update:2017-01-13 Source: Internet

Author: User

Tags explode ming php code

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Collector, usually called the Thief program, is mainly used to crawl other people's web content. On the production of the collector, it is not difficult to open the Web page to be collected remotely, and then use regular expressions will need to match the content, as long as a little bit of regular expression of the basis, can make their own collector to.

A few days ago to do a novel series of procedures, because of the fear of updating trouble, incidentally wrote a collector, collection eight Chinese network, the function is relatively simple, can not customize the rules, but probably the idea is inside, custom rules can be extended by themselves.

Use PHP to do the collector mainly use two functions: file_get_contents () and Preg_match_all (), the former is a remote read the content of the Web page, but only in the PHP5 version of the above can be used, the latter is a regular function, to extract the required content.

Here is a step-by-step implementation of the function.

Because it is a collection of novels, so first of all to the title, author, type of these three extracted, other information can be extracted according to need.

Here to "return to the Ming dynasty when the Rajah" as the goal, first open the Bibliography page, Link: http://www.86zw.com/Book/3727/Index.aspx

Open a few more books will find that the basic format of the title is: http://www.86zw.com/Book/ISBN/index.aspx, so we can make a start page, define a <input type=text name=number> Used to enter the call number to collect, you can then through the $_post[' number '] This format to receive the call number to collect. Receive the ISBN, the following to do is to construct the Bibliography page: $url =http://www.86zw.com/book/$_post[' number ']/index.aspx, of course, here is an example, mainly to explain the convenience, the actual production time it is best to check The legality of $_post[' number '.

After constructing the URL, you can start to collect the book information. Use the file_get_contents () function to open a Bibliography page: $content =file_get_contents ($url) so that you can read the contents of the bibliography page. The next step is to match the information about the title, author, and type. Here, for example, the rest of the book is the same. Open the Bibliography page, view the source file, find " "Back to the Ming dynasty when the Rajah" ", this is to extract the title." The regular expression that extracts the title:/ (. *?) /is, use the Preg_match_all () function to remove the title: Preg_match_all ("/" (. *?) /is ", $contents, $title); so $ title[0][0] is the content of the title we want (Preg_match_all function can go to Baidu search, here is not explained in detail). Took out the book information, next is to take the chapter content, to take the chapter content, the first thing to do is to find the address of each chapter, and then open the chapter remotely, with the contents of the content out, warehousing or directly generated HTML static file. This is the address of the chapter list: http://www.86zw.com/Html/Book/18/3727/List.shtm, you can see this is the same as the bibliography page, there are rules to be found: http://www.86zw.com/Html/Book/ Category number/ISBN/list.shtm. ISBN has been achieved, the key here is to find the category number, the category number can be found in the previous bibliography page, extract the category number:

Preg_match_all ("/html/book/[0-9]{1,}/[0-9]{1,}/list.shtm/is", $contents, $typeid); This is not enough and requires a tangent function:

The PHP code is as follows:

function Cut ($string, $start, $end) {
$message = Explode ($start, $string);
$message = Explode ($end, $message [1]); return $message [0];} Where the $string for the content to be cut, $start for the beginning of the place, $end for the end of the place. Remove the category number:

$start = "html/book/";
$end
= "List.shtm";
$typeid = Cut ($typeid [0][0], $start, $end);
$typeid = Explode ("/", $typeid); [/php]

So, $typeid [0] is the category number we're looking for. The next step is to construct the address of the chapter list: $chapterurl = http://www.86zw.com/Html/Book/. $typeid [0]/$_post[' number ']/list.shtm. With this, we can find the address of each chapter.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More