A collector, also known as a thief program, is mainly used to capture others' webpage content. It is not difficult to create a collector, that is, to remotely open the webpage to be collected, and then use a regular expression to match the required content, as long as there is a little basis for a regular expression, can make their own collector SyntaxHighlighter. all (); collector, usually called a thief program, is mainly used to capture the content of others' webpages. It is not difficult to create a collector, that is, to remotely open the webpage to be collected, and then use a regular expression to match the required content, as long as there is a little basis for a regular expression, you can make your own collectors.
I developed a novel serialization program a few days ago. for fear of updating problems, I wrote a collector to collect 8-way Chinese network. the function is simple and cannot be customized, however, the general idea is in place, and custom rules can be expanded by themselves.
Using php as the collector mainly uses two functions: file_get_contents () and preg_match_all (). the previous one reads the webpage content remotely, but it can only be used in versions later than php5, the latter is a regular function used to extract the required content.
The following describes the function implementation step by step.
Because it is a collection of novels, we must first extract the title, author, and type. Other information can be extracted as needed.
Here to return to the Ming Dynasty when Wang Ye as the goal, first open the bibliography page, link: http://www.webjx.com/Book/3727/Index.aspx
When you open several more books, you will find that the basic format of the title is http://www.webjx.com/book/book number /index.aspx. therefore, we can defineTo enter the number to be collected. then, you can use the format $ _ POST ['Number'] to receive the number to be collected. After receiving the book number, the following is to construct the bibliography page: $ url = marker.
After the URL is constructed, you can start to collect book information. Use the file_get_contents () function to open the bibliography page: $ content = file_get_contents ($ url), so that you can read the contents of the bibliography page. Next, we will match the title, author, type, and other information. Here we take the title of the book as an example. The rest are the same. Open the bibliography page, view the source file, and find "back to the Ming Dynasty when Wang Ye". This is the title to be extracted. Extract the regular expression of the title :/(.*?) /Is, use the preg_match_all () function to retrieve the title: preg_match_all ("/(.*?) /Is ", $ contents, $ title); in this way, the content of $ title [0] [0] is the title we want (the usage of the preg_match_all function can be checked by Baidu, ). The book information is taken out, and the next step is to take the chapter content. to take the chapter content, you must first find the address of each chapter, then open the chapter remotely, and obtain the content using regular expressions, store the data in the database or directly generate html static files. This is the address of the chapter list: Chapter. The number has been obtained before the book number. The key here is to find the classification number, which can be found on the first Bibliography page:
Preg_match_all ("/Html/Book/[0-9] {1,}/[0-9] {1,}/List. shtm/is ", $ contents, $ typeid); this is not enough. you also need a cut function:
The PHP code is as follows:
Function cut ($ string, $ start, $ end ){
$ Message = explode ($ start, $ string );
$ Message = explode ($ end, $ message [1]); return $ message [0];} where $ string is the content to be cut, and $ start is the start, $ end indicates the end point. Retrieve the classification number:
$ Start = "Html/Book /";
$ End
= "List. shtm ";
$ Typeid = cut ($ typeid [0] [0], $ start, $ end );
$ Typeid = explode ("/", $ typeid); [/php]
In this way, $ typeid [0] is the classification number we are looking. The next step is to construct the chapter list address: $ chapterurl = http://www.webjx.com/html/book/.?typeid=0#/?_post='number'#/list.shtm. With this, you can find the address of each chapter. The method is as follows:
$ Ustart = """;
$ Uend
= """;
// T stands for title
$ Tstart = "> ";
$ Tend
= "<";
// Obtain the path, for example, 123. shtm, 2342. shtm, 233. shtm.
Preg_match_all ("/" [0-9] {1,}. (shtm) "/is", $ chapterurl, $ url );
// Obtain the title, for example, Chapter 1 "9 ".
Preg_match_all ("// is", $ file, $ title );
$ Count = count ($ url [0]);
For ($ I = 0; $ I <= $ count; $ I ++)
{
$ U = cut ($ url [0] [$ I], $ ustart, $ uend );
$ T = cut ($ title [0] [$ I], $ tstart, $ tend );
$ Array [$ u] = $ t;
}
$ Array is the address of all chapters. here, the collector completes the process in half. The rest is to open the address of each chapter in a loop, read the address, and match the content. This is relatively simple and will not be described in detail here. Now, let's write this article first today. for the first time I write such a long article, there will inevitably be problems with language organization. please forgive me!