Php uses curl and regular expression to capture webpage data example

Php uses curl and regular expression to capture webpage data example _ PHP Tutorial

Last Update:2017-05-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Php uses curl and regular expressions to capture webpage data samples. The curl and regular expression are used to capture novels from non-vip chapters of the Chinese text Network, and the novel ID can be input to download novels. Dependency: curl can be viewed in a simple way. it uses the curl and regular expression to capture a novel from the non-vip chapter of the Chinese text Network. it supports inputting the novel ID to download the novel.
Dependency: curl
The curl, regular expression, ajax and other technologies are used in a simple look. this is suitable for beginners. During local testing, you must ensure that the network is connected and the curl mode is enabled for php.

SpiderTools. class. php

The code is as follows:

Session_start ();
// Encapsulate the content into a class to enable automatic article capturing
# Header ("Refresh: 30; http://www.test.com: 8080 ");
Class SpiderTools {
//////////////////////////////////////// //////////////////////////////////////// //////////////////////////
/* Input the article ID to parse the article title */
//////////////////////////////////////// //////////////////////////////////////// //////////////////////////
Public function getBookNameById ($ aid ){
// Initialize curl
$ Ch = curl_init ();
// Url
$ Url = 'http: // www.motie.com/book/'.w.aid;
If (is_numeric ($ aid )){
// Regular expression matching
$ Ru = "/\ s * (. *) \ s * <\/a> \ s * <\/h1> /";
}
Else {
//The Family Survival path of the zombie outbreak _ Chapter 1 The zombie outbreak is updated for my friendly music ~ _ Iron grinding
$ Ru = "/(. *) <\/Title>/"; } // Set options, including URL curl_setopt ($ ch, CURLOPT_URL, $ url ); curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1); // The content is not automatically output curl_setopt ($ ch, CURLOPT_HEADER, 0 ); // no header information is returned curl_setopt ($ ch, CURLOPT_CONNECTTIMEOUT_MS, 0); // execute curl $ output = curl_exec ($ ch ); // error message if (curl_exec ($ ch) === false) { die (curl_error ($ ch )); } // check for errors if (curl_errno ($ ch) { echo 'curl error :'. curl_error ($ ch); } // release the curl handle curl_close ($ ch); $ arr = array (); preg_match_all ($ ru, $ output, $ arr); return $ arr [1] [0]; } ///////////////////////////////// //////////////////////////////////////// /// // /* ID parsing article content */ /////////////////////////////// //////////////////////////////////////// /// // public function getBookContextById ($ aid) { // start parsing the article $ ids = array (); $ ids = explode ("_", $ aid ); $ titleId = trim ($ ids [0]); $ aticleId = trim ($ ids [1]); $ ch = curl_init (); $ ru = "/ [\ s \ S] * <pre ondragstart = \" return false \ "oncopy = \ "return false; \ "oncut = \" return false; \ "oncontextmenu = \" return false \ "class = \" note \ "id = \" html_content _ \ d * \ "> [\ s \ S] * (. *) <\/pre>/ui "; $ url =' http://www.motie.com/book/ '. $ Aid; // regular expression matching // Set options, including URL curl_setopt ($ ch, CURLOPT_URL, $ url ); curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1); // The content is not automatically output curl_setopt ($ ch, CURLOPT_HEADER, 0 ); // no header information is returned curl_setopt ($ ch, CURLOPT_CONNECTTIMEOUT_MS, 0); // execute curl $ output = curl_exec ($ ch ); // error message if (curl_exec ($ ch) === false) { die (curl_error ($ ch )); } // check for errors if (curl_errno ($ c H) { echo 'curl error :'. curl_error ($ ch); } $ arr = array (); $ arr2 = array (); preg_match_all ($ ru, $ output, $ arr); curl_close ($ ch); # var_dump ($ arr ); $ s = $ arr [0] [0]; $ s = substr ($ s, 180 ); $ arr2 = explode ("return trim ($ arr2 [0]); } ///////////////////////////// //////////////////////////////////////// /// // /* static method @ generate novel file can be directly Call */ ////////////////////////////////// //////////////////////////////////////// //// // public static function createBookById ($ id) { if (! Is_numeric ($ id) { echo " init begin start write! "; $ st = new self (); $ cons = $ st-> getBookContextById ($ id ); $ title = $ st-> getBookNameById ($ id); $ cons = trim ($ cons); $ t = explode ("", $ title); // Construct a directory $ dir = array (); $ dir = explode ("_", $ t [0]); $ wzdir = $ dir [0]; // The name of the book as the directory name $ wzchapter = $ dir [1]; // Chapter // create a directory $ wzdir2 = iconv ("UTF-8", "GBK", $ wzdir ); // Directory encoding note that the reference to the $ wzdir string is retained here to construct the file name, which cannot be used here to prevent secondary encoding if (! File_exists ($ wzdir2) { mkdir ($ wzdir2); // create a directory } // Construct a file name $ wztitle = ". /". $ wzdir. "/". "$ t [0]". ". txt "; // ensure that the name of the saved file is not garbled $ wztitle = iconv (" UTF-8 "," GBK ", $ wztitle ); $ f = fopen ($ wztitle, "w +"); fwrite ($ f, $ cons); echo "$ wzdir ". $ wzchapter. "Write successful"; fclose ($ f); } else { $ ids = self :: getBookIdsById ($ id); // The server may be offline, so it is best to use session record loop # for ($ I = $ _ SESSION ["$ id ". "_ fid"]; $ I <= count ($ ids); $ _ SESSION ["$ id ". "_ fid"] ++, $ I ++) { # self: createBookById ($ id. "_". $ ids [$ _ SESSION ["$ id ". "_ fid"] + +]); // Construct the id #} for ($ I = $ _ SESSION ["$ id ". "_ fid"]; $ I <= count ($ ids); $ _ SESSION ["$ id ". "_ fid"] ++, $ I ++) { self: createBookById ($ id. "_". $ ids [$ I]); // Construct the id } # echo "<pr/> the write operation is complete "; # echo $ id. "_". $ ids [0]. " "; # var_dump ($ ids ); } } /* obtain all novel IDs @ param $ ID article id @ return array; */ public static function getBookIdsById ($ aid) { $ ch = curl_init (); $ url =' http://www.motie.com/book/ '. $ Aid. "/chapter"; // pay attention to this? You can obtain the minimum matching item $ ru = '/[\ s \ S] *? <Li class = \ "\" createdate = \ "\ d {4} \-\ d {2} \-\ d {2} \ d {2 }: \ d {2 }:\ d {2} \ "> [\ s \ S] *?. *? <\/A> .*? /U'; // regular expression match // Set options, including URL curl_setopt ($ ch, CURLOPT_URL, $ url ); curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1); // The content is not automatically output curl_setopt ($ ch, CURLOPT_HEADER, 0 ); // no header information is returned curl_setopt ($ ch, CURLOPT_CONNECTTIMEOUT_MS, 0); // execute curl $ output = curl_exec ($ ch ); // check for errors if (curl_errno ($ ch) { echo 'curl error :'. curl_error ($ ch); } // release the curl handle curl_close ($ ch); $ arr = array (); preg_match_all ($ ru, $ output, $ arr, PREG_PATTERN_ORDER); return $ arr [1]; } ?> getinfo. php code is as follows: <? Php session_start (); require_once ("SpiderTools. class. php "); if ($ _ REQUEST [" bid "]) { if (is_numeric ($ _ REQUEST [" bid "]) { SpiderTools: createBookById (trim ($ _ REQUEST ["bid"]); } else { echo " enter the correct article ID "; } ?> index.html the code is as follows: <ptml> <pead> <meta charset = "UTF-8"/> </ head> <title> download novels

Enter the ID number of the novel you want to see to download the novel.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Php uses curl and regular expression to capture webpage data example _ PHP Tutorial

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Php uses curl and regular expression to capture webpage data example _ PHP Tutorial

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support