PHP content collector (PHP thief program)-PHP source code

Source: Internet
Author: User
Ec (2); collectors, usually known as thieves, are mainly used to capture others' webpage content. It is not difficult to create a collector, that is, to remotely open the webpage to be collected, and then use a regular expression to match the required content, as long as there is a little basis for a regular expression, you can make your own collectors. I developed a novel serialization program a few days ago. For fear of updating problems, I wrote a collector to collect 8-way Chinese network. The function is simple and cannot be customized, but the general idea is in it. The custom rules can be script ec (2) and script

A collector, also known as a thief program, is mainly used to capture others' webpage content. It is not difficult to create a collector, that is, to remotely open the webpage to be collected, and then use a regular expression to match the required content, as long as there is a little basis for a regular expression, you can make your own collectors.

I developed a novel serialization program a few days ago. For fear of updating problems, I wrote a collector to collect 8-way Chinese network. The function is simple and cannot be customized, however, the general idea is in place, and custom rules can be expanded by themselves.

Using php as the Collector mainly uses two functions: file_get_contents () and preg_match_all (). The previous one reads the webpage content remotely, but it can only be used in Versions later than php5, the latter is a regular function used to extract the required content.

The following describes the function implementation step by step.

Because it is a collection of novels, we must first extract the title, author, and type. Other information can be extracted as needed.

Here to return to the Ming dynasty when Wang Ye as the goal, first open the bibliography page, link: http://www.86zw.com/Book/3727/Index.aspx

When you open several more books, you will find that the basic format of the title is http://www.86z?com/book/book number /index.aspx. therefore, we can define To enter the number to be collected. Then, you can use the format $ _ POST ['number'] to receive the number to be collected. After receiving the book number, you must construct the bibliography page: $ url = http://www.86z?com/book/?_post='number']/Index. aspx, of course, here is an example. It is mainly for convenience of explanation. in actual production, it is best to check the legitimacy of $ _ POST ['number.

After the URL is constructed, you can start to collect book information. Use the file_get_contents () function to open the bibliography page: $ content = file_get_contents ($ url), so that you can read the contents of the bibliography page. Next, we will match the title, author, type, and other information. Here we take the title of the book as an example. The rest are the same. Open the bibliography page, view the source file, and find "back to the Ming dynasty when Wang Ye". This is the title to be extracted. Extract the regular expression of the title :/(.*?) /Is, use the preg_match_all () function to retrieve the title: preg_match_all ("/(.*?) /Is ", $ contents, $ title); in this way, the content of $ title [0] [0] is the title we want (the usage of the preg_match_all function can be checked by Baidu, ). The book information is taken out, and the next step is to take the chapter content. To take the chapter content, you must first find the address of each chapter, then open the chapter remotely, and obtain the content using regular expressions, store the data in the database or directly generate html static files. This is the address of the chapter list: Chapter. The number has been obtained before the book number. The key here is to find the classification number, which can be found on the first bibliography page:

Preg_match_all ("/Html/Book/[0-9] {1,}/[0-9] {1,}/List. shtm/is ", $ contents, $ typeid); this is not enough. You also need a cut function:

The PHP code is as follows:

Function cut ($ string, $ start, $ end ){
$ Message = explode ($ start, $ string );
$ Message = explode ($ end, $ message [1]); return $ message [0];} where $ string is the content to be cut, and $ start is the start, $ end indicates the end point. Retrieve the Classification Number:

$ Start = "Html/Book /";
$ End
= "List. shtm ";
$ Typeid = cut ($ typeid [0] [0], $ start, $ end );
$ Typeid = explode ("/", $ typeid); [/php]

In this way, $ typeid [0] is the classification number we are looking. The next step is to construct the chapter list address: $ chapterurl = http://www.86z?com/html/book/.?typeid=0#/?_post='number'#/list.shtm. With this, you can find the address of each chapter.

<

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.