PHP content collector (PHP thief program)-PHP source code

Last Update:2018-07-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Ec (2); collectors, usually known as thieves, are mainly used to capture others' webpage content. It is not difficult to create a collector, that is, to remotely open the webpage to be collected, and then use a regular expression to match the required content, as long as there is a little basis for a regular expression, you can make your own collectors. I developed a novel serialization program a few days ago. For fear of updating problems, I wrote a collector to collect 8-way Chinese network. The function is simple and cannot be customized, but the general idea is in it. The custom rules can be script ec (2) and script

A collector, also known as a thief program, is mainly used to capture others' webpage content. It is not difficult to create a collector, that is, to remotely open the webpage to be collected, and then use a regular expression to match the required content, as long as there is a little basis for a regular expression, you can make your own collectors.

I developed a novel serialization program a few days ago. For fear of updating problems, I wrote a collector to collect 8-way Chinese network. The function is simple and cannot be customized, however, the general idea is in place, and custom rules can be expanded by themselves.

Using php as the Collector mainly uses two functions: file_get_contents () and preg_match_all (). The previous one reads the webpage content remotely, but it can only be used in Versions later than php5, the latter is a regular function used to extract the required content.

The following describes the function implementation step by step.

Because it is a collection of novels, we must first extract the title, author, and type. Other information can be extracted as needed.

Here to return to the Ming dynasty when Wang Ye as the goal, first open the bibliography page, link: http://www.86zw.com/Book/3727/Index.aspx

When you open several more books, you will find that the basic format of the title is http://www.86z?com/book/book number /index.aspx. therefore, we can define To enter the number to be collected. Then, you can use the format $ _ POST ['number'] to receive the number to be collected. After receiving the book number, you must construct the bibliography page: $ url = http://www.86z?com/book/?_post='number']/Index. aspx, of course, here is an example. It is mainly for convenience of explanation. in actual production, it is best to check the legitimacy of $ _ POST ['number.

After the URL is constructed, you can start to collect book information. Use the file_get_contents () function to open the bibliography page: $ content = file_get_contents ($ url), so that you can read the contents of the bibliography page. Next, we will match the title, author, type, and other information. Here we take the title of the book as an example. The rest are the same. Open the bibliography page, view the source file, and find "back to the Ming dynasty when Wang Ye". This is the title to be extracted. Extract the regular expression of the title :/(.*?) /Is, use the preg_match_all () function to retrieve the title: preg_match_all ("/(.*?) /Is ", $ contents, $ title); in this way, the content of $ title [0] [0] is the title we want (the usage of the preg_match_all function can be checked by Baidu, ). The book information is taken out, and the next step is to take the chapter content. To take the chapter content, you must first find the address of each chapter, then open the chapter remotely, and obtain the content using regular expressions, store the data in the database or directly generate html static files. This is the address of the chapter list: Chapter. The number has been obtained before the book number. The key here is to find the classification number, which can be found on the first bibliography page:

Preg_match_all ("/Html/Book/[0-9] {1,}/[0-9] {1,}/List. shtm/is ", $ contents, $ typeid); this is not enough. You also need a cut function:

The PHP code is as follows:

Function cut ($ string, $ start, $ end ){
$ Message = explode ($ start, $ string );
$ Message = explode ($ end, $ message [1]); return $ message [0];} where $ string is the content to be cut, and $ start is the start, $ end indicates the end point. Retrieve the Classification Number:

$ Start = "Html/Book /";
$ End
= "List. shtm ";
$ Typeid = cut ($ typeid [0] [0], $ start, $ end );
$ Typeid = explode ("/", $ typeid); [/php]

In this way, $ typeid [0] is the classification number we are looking. The next step is to construct the chapter list address: $ chapterurl = http://www.86z?com/html/book/.?typeid=0#/?_post='number'#/list.shtm. With this, you can find the address of each chapter.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

PHP content collector (PHP thief program)-PHP source code

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

PHP content collector (PHP thief program)-PHP source code

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support