Clever Use of PHP functions to implement collectors

Source: Internet
Author: User

PHP has been developing for a long time and many users are familiar with PHP. Now we can use PHP functions to implement the collector program. What is a collector, usually known as a thief program, is mainly used to capture others' webpage content. It is not difficult to create a collector, that is, to remotely open the webpage to be collected, and then use a regular expression to match the required content, as long as there is a little basis for a regular expression, you can make your own collectors.

I developed a novel serialization program a few days ago. For fear of updating problems, I wrote a collector to collect 8-way Chinese network. The function is simple and cannot be customized, however, the general idea is in place, and custom rules can be expanded by themselves. Using php as the Collector mainly uses two PHP functions: file_get_contents () and preg_match_all (). The first one reads the webpage content remotely, but it can only be used in Versions later than php5, the latter is a regular function used to extract the required content. In one step. Because it is a collection of novels, we must first extract the title, author, and type. Other information can be extracted as needed.

This is not enough. You also need a PHP function to be cut:

 
 
  1. Function cut ($ string, $ start, $ end ){
  2. $Message=Explode($ Start, $ string );
  3. $Message=Explode($ End, $ message [1]); return $ message [0];} where $ string is the content to be cut, $ start is the start, $ end indicates the end point. Retrieve the Classification Number:
  4.  
  5. $Start="Html/Book /";
  6. $End
  7. ="List. shtm";
  8. $Typeid=Cut($ Typeid [0] [0], $ start, $ end );
  9. $Typeid=Explode("/", $ Typeid); [/php]
  10.  
  11. In this way, $ typeid [0] is the classification number we are looking. The method is as follows:
  12.  
  13. $Ustart=""";
  14. $Uend
  15. =""";
  16. // T stands for title
  17. $Tstart=">";
  18. $Tend
  19. ="<";
  20. // Obtain the path, for example, 123. shtm, 2342. shtm, 233. shtm.
  21. Preg_match_all ("/" [0-9] {1,}. (shtm) "/is", $ chapterurl, $ url );
  22. // Obtain the title, for example, Chapter 1 "9 ".
  23. Preg_match_all ("/<A Href= "[0-9] {1,}. shtm "(.*?)</>/Is ", $ file, $ title );
  24. $CountCountcount= Count ($ url [0]);
  25. For ($I=0; $ I<= $ Count; $ I ++)
  26. {
  27. $U=Cut($ Url [0] [$ I], $ ustart, $ uend );
  28. $T=Cut($ Title [0] [$ I], $ tstart, $ tend );
  29. $ Array [$ u] = $ t;
  30. }

$ Array is the address of all chapters. Here, the collector completes the process in half. The rest is to open the address of each chapter in a loop, read the address, and match the content. This is relatively simple and will not be described in detail here. Now, let's write this article first today. For the first time I write such a long article, there will inevitably be problems with language organization. Please forgive me!


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.