I have a collection of CMS website, occasionally will download some resources, the old driver all understand:-D. Then there was a good few days, piled up a little did not get, thought: Cao, this good trouble ah, can you write a script to help me automatically? Then suddenly thought, this is not the so-called crawler it? A burst of excitement in my heart.
Because I still small white, will only use PHP, so can only be a bit. The homepage of the site is a list of the various resources of the small icon and the portal, I used my own encapsulated Curl function GET request in the past, get all the details page of the portal link, and then see there is no last Request anchor record link, if not, continue to request the next page of the link, if any, stop. It then iterates through the array of this record link, which is requested in turn. The details of the large map and small map, I only want to large map, filter out the small map, then the PHP powerful file_get_contents and file_put_contents function, Well,talk is cheap,show my code now.
11 <?PHP2233//load the Encapsulated Curl Request function44require".. /curl_request.php ";5566classgrab{7788//website Home99 Public $url= "/portal.php";Ten10//matching rules for picture detail pages One11Private $content _preg= "/\/content-\d{4}-1-1\.html/i"; A12//next page URL -13Private $page= "Https://www.xibixibi.com/portal.php?page="; -14//large map matching rules the15Private $bigPic _preg= "/\/data\/attachment\/forum\/20\d{2}[01]\d{1}\/[0123]\d{1}\/[a-za-z0-9]{22}\." (jpg|png)/"; -16//last saved details URL -17 Public $lastSave= ""; -18//picture Save root directory +19 Public $root= "e:/root/"; -20//number of calls to save the Grabdetailsites method +21stPrivate $count= 0; A22//collection of picture details array at23 Public $gallery=Array(); -24 -25/** - 26 * Constructors - * - -*/ in29 Public function__construct () { -30Set_time_limit(0); to31 } +32/** - 33 * How to crawl all the details page links the * @param @url website URL * **/ $36 Public functionGrabdetailsites ($url= ""){Panax Notoginseng37//Send Request -38$result= Getrequest ($url); the39//match details page URL +40Preg_match_all($this->content_preg,$result,$matches,preg_pattern_order); A41//Go heavy the42$matches=Array_unique($matches[0]); +43//Remove the connection from the last contact of the website -44if(Count($matches) > 12) { $45$matches=Array_slice($matches, 0, 12); $46 } -47//See if you've found the last last Detail page address -48$offset=Array_search($this->lastsave,$matches); the49//Save this latest detail page connection -50if($this-Count= = 0) {Wuyi51file_put_contents("./lastsave.txt",$matches[0]); the52 } -53 + +$this-Count; Wu54//If you find the latest detail URL for the last crawl, save the URL and stop -55if($offset!==FALSE) { About56$matches=Array_slice($matches, 0,$offset); $57$this->gallery =Array_merge($this->gallery,$matches); -58return TRUE; -59}Else{ -60//otherwise recursive next page find A61$this->gallery =Array_merge($this->gallery,$matches); +62$this->grabdetailsites ($this->page. ($this-Count+ 1)); the63return TRUE; -64 } $65 } the66 the67/** the 68 * Get a larger image of the gallery based on the details URL the * - -*/ in71 Public functionGrabbigpic () { the72//Loop Gallery Details Array the73foreach($this->gallery as $key=$value) { About74//get the URL of a large map the75$result= Getrequest ($value); the76Preg_match_all($this->bigpic_preg,$result,$matches); the77$matches=Array_unique($matches[0]); +78//loop to get the data for a large image -79foreach($matches as $key 1=$value 1) { the80$pic= Getrequest ($value 1);Bayi81$month=Date("y/m/"); the82if(!Is_dir($this->root.$month)) { the83mkdir($this->root.$month, 777,TRUE); -84 } -85//Save picture Data the86file_put_contents($this->root.$month.basename($value 1),$pic); the87 } the88 } the89 } -90 the91/** the 92 * Organize the old picture file the *94 94*/ the95 Public functionSortpic () { the96$allPics=Scandir($this-root); the97//Delete. and. .9898unset($allPics[0]); About99unset($allPics[1]); -100foreach($allPics as $key=$value) {101101$time=Date("y/m/",Filemtime($this->root.$value));102102if(!Is_dir($this->root.$time)) {103103mkdir($this->root.$time, 777,TRUE);104104 } the105//Moving Files106106Rename($this->root.$value,$this->root.$time.$value);107107 }108108 }109109 the110 Public function__set ($key,$value){111111$this-$key=$value; the112 }113113 } the
Because the site is not very complex, so this kind of writing is relatively simple, originally wanted to do a timed task, but still wait for my old machine to change Ubuntu.
A preliminary study of reptiles--php