Dynamic Web technology: Using PHP to make a simple content collector

Dynamic Web technology: Using PHP to make a simple content collector _php Tutorial

Last Update:2016-07-13 Source: Internet

Author: User

Tags ming

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The collector, usually called the Thief program, is mainly used to crawl other people's web content. About the production of the collector, in fact, is not difficult, is to open the Web to be collected remotely, and then use regular expressions to match the required content, as long as a little bit of the basis of regular expression, can make their own collector.

A few days ago did a novel serial program, because afraid of updating trouble, incidentally wrote a collector, the acquisition of eight Chinese network, the function is relatively simple, can not customize the rules, but the idea is in the inside, the custom rules can be extended by themselves.

Using PHP to do the collector mainly use two functions: file_get_contents () and Preg_match_all (), the previous is to read the Web page content, but only in the version of PHP5 above to use, the latter is a regular function, to extract the required content.

Here's a step-by-step feature implementation.

Because it is a collection of novels, so first of all, the title, author, type of the three extracted, other information can be extracted according to needs.

Here to "return to the Ming Dynasty as the Rajah" as the goal, first open the Bibliography page, Link: http://www.webjx.com/Book/3727/Index.aspx

Open a few books and find out, the basic format of the title is: http://www.webjx.com/Book/ISBN/index.aspx, so we can do a start page, define a, used to enter the number of calls that need to be collected, in the future can be $_post[' number ' This format to receive the number of calls need to be collected. Received the ISBN, the following is to do is to construct the Bibliography page: $url =http://www.86zw.com/book/$_post[' number ']/index.aspx, of course, here is to give an example, mainly to explain the convenience, the actual production of the best time to check The legality of $_post[' number '.

After constructing the URL, you can begin to collect the book information. Use the file_get_contents () function to open the Bibliography page: $content =file_get_contents ($url) so that you can read the contents of the bibliography page. The next step is to match the title, author, and type information. This is the case with the title, and the rest is the same. Open the bibliography page, look at the source file, find "back to the Ming Dynasty when the Rajah", this is to extract the title. Regular expressions for extracting titles:/(. *?) /is, use the Preg_match_all () function to remove the title: Preg_match_all ("/(. *?) " /is ", $contents, $title); the content of $title[0][0] is the title we want (Preg_match_all function can go to Baidu, this is not explained in detail). Remove the book information, the next is to take the chapter content, to take the chapter content, the first thing to do is to find the address of each chapter, and then open the chapter remotely, with the regular content to take out, warehousing or directly generated HTML static files. This is the address of the chapter list: http://www.webjx.com/Html/Book/18/3727/List.shtm, you can see that this is the same as the bibliography page, there are rules to find: http://www.webjx.com/Html/ book/class number/ISBN/list.shtm. ISBN has been made, the key here is to find the classification number, the classification number can be found in the previous bibliography page, extract the classification number:

Preg_match_all ("/html/book/[0-9]{1,}/[0-9]{1,}/list.shtm/is", $contents, $typeid); This is not enough, and a tangent function is required:

The PHP code is as follows:

function Cut ($string, $start, $end) {
$message = Explode ($start, $string);
$message = Explode ($end, $message [1]); return $message [0];} Where $string is the content to be cut, $start the place to start, $end for the end. Remove the classification number:

$start = "html/book/";
$end
= "List.shtm";
$typeid = Cut ($typeid [0][0], $start, $end);
$typeid = Explode ("/", $typeid); [/php]

So, $typeid [0] is the classification number we're looking for. The next step is to construct the address of the chapter list: $chapterurl = http://www.webjx.com/Html/Book/. $typeid [0]/$_post[' number ']/list.shtm. With this you can find the address of each chapter. Here's how:

$ustart = "";
$uend
= """;
T denotes the abbreviation of title
$tstart = ">";
$tend
= "<";
Take the path, for example: 123.shtm,2342.shtm,233.shtm
Preg_match_all ("/" [0-9]{1,}. ( shtm) "/is", $chapterurl, $url);
Title, for example: Chapter Nine The Righteous
Preg_match_all ("//is", $file, $title);
$count = count ($url [0]);
for ($i =0; $i <= $count; $i + +)
{
$u = Cut ($url [0][$i], $ustart, $uend);
$t = Cut ($title [0][$i], $tstart, $tend);
$array [$u] = $t;
}

$array array is all the chapter address, here, the collector is half done, the rest is to loop open each chapter address, read, and then match the content. This is relatively simple and is not described in detail here. Well, write this today first, the first time to write such a long article, language organization inevitably have problems, but also please forgive us!

http://www.bkjia.com/PHPjc/531652.html www.bkjia.com true http://www.bkjia.com/PHPjc/531652.html techarticle The collector, usually called the Thief program, is mainly used to crawl other people's web content. About the production of the collector, in fact, is not difficult, is to open the Web to be collected remotely, and then use regular ...



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More