Crawl pages in PHP and analyze

Source: Internet
Author: User

Before doing the crawl, remember to put the php.ini in the Max_execution_time set the big point, or will error.

First, crawl the page with Snoopy.class.php

A very cute class name. The function is also very powerful, simulates the browser function, can obtain the webpage content, the sending form and so on.

1) I am now going to crawl the contents of a list page of a website, I want to crawl the national hospital information content, such as:

  

2) I naturally copy the URL address, use the Snoopy class to crawl the first 10 pages of the page content, and put the content locally, the local HTML file, and so on for analysis.

$snoopy=NewSnoopy ();//Hospital List Page  for($i= 1;$i<= 10;$i++) {   $url= ' http://www.guahao.com/hospital/areahospitals?p= national &pageno= '.$i; $snoopy->fetch ($url); file_put_contents("web/page/$i. html ",$snoopy-results); } Echo' Success ';

  

3) It is strange that the return is not the national content, but the relevant content in Shanghai

  

4) behind suspicion is inside may set up a cookie and then use Firebug to view, sure enough there is amazing insider

  

5) In the request, the value of the cookie is also put in, added a set of statements $snoopy->cookies["_area_", the situation is very different, the smooth return to the national information.

$snoopy=new  Snoopy (); // Hospital List Page $snoopy->cookies["_area_"] = ' {"Provinceid": "All", "Provincename": "National", "Cityid": "All", "CityName": "Unlimited??"} ' ;  for ($i$i$i+ +) {
//... page content Crawl ...
}

Second, use phpquery.php analysis page

Using this class to manipulate the DOM like jquery does not have to use a regular algorithm for the headache of capturing the content of the website, which is similar in usage to jquery, except that $ () is replaced by PQ (). The homepage is in Google's code, is blocked FQ, the following attachment I share a latest version of the FGFQ software. The usage is simple as long as you open the Fg742p.exe file.

1) I want to get information about each specific hospital in each page, for example, to get 301 hospital information.

  

The content page for each page has been fetched locally in the previous operation. So you can use localhost directly.

  $snoopy=new  Snoopy ();  for ($i$i$i+ +) {    $url = "http://localhost/spider/web/page/$i. html";     $snoopy->fetch ($url);     $html $snoopy-results;  }

  

2) Use Phpquery to obtain the node information, such as the DOM structure:

  

Use some phpquery methods to read the URL address of each hospital information in conjunction with the DOM structure.

 for($i= 1;$i<= 10;$i++) {    //... Crawl Local page ...Phpquery::newdocument ($html);//Initializing Objects    $urls=Array(); foreach(PQ ('. Search-hos-info DL dt a ') as $item) {      Array_push($urls, PQ ($item)->attr (' href '));//Hospital Details    }}

  

3) Fetch the specified page according to the list of URL addresses that are read.

$detailIndex= 1; for($i= 1;$i<= 10;$i++) {    //... Crawl Local page ...//... Gets the href address information for the specified node ...    $len=Count($urls);  for($row= 1;$row<=$len;$row++) {      $snoopy->fetch ($urls[$row-1]); file_put_contents("web/hospital/$detailIndex. html ",$snoopy-results); $detailIndex++; }}

FQ Tools Download:

Climb obstacles. rar

Demo Download:

http://download.csdn.net/detail/loneleaf1/8089507

Some notes about the Snoopy class:

Class method

Fetch ($URI) This is the method used to crawl the contents of a Web page.
The $URI parameter is the URL address of the crawled Web page.
The results of the fetch are stored in the $this->results.
If you are crawling a frame, Snoopy will track each frame back into the array and deposit it into the $this->results.
Fetchtext ($URI) This method is similar to fetch (), except that this method removes HTML tags and other unrelated data, returning only the text content in the page.
Fetchform ($URI) This method is similar to fetch (), except that this method removes HTML tags and other unrelated data and returns only the form content (form) in the Web page.
Fetchlinks ($URI) This method is similar to fetch (), except that this method removes HTML tags and other unrelated data and only returns links to the Web page.
By default, relative links are automatically completed and converted to full URLs.
Submit ($URI, $formvars) This method sends a confirmation form to the link address specified by the. $formvars is an array of stored form parameters.
Submittext ($URI, $formvars) This method is similar to submit (), the only difference is that this method will remove the HTML tags and other unrelated data, only return to the page after landing text content.
Submitlinks ($URI) This method is similar to submit (), the only difference is that this method will remove the HTML tags and other unrelated data, only return the link in the Web page.
By default, relative links are automatically completed and converted to full URLs.

Class properties

$host Connected hosts
$port Port of Connection
$proxy _host Use the proxy host, if any
$proxy _port The proxy host port used, if any
$agent User Agent Spoofing (Snoopy v0.1)
$referer Route information, if any
$cookies Cookies, if any.
$rawheaders Other header information, if any
$maxredirs Maximum number of redirects, 0 = not allowed (5)
$offsiteok Whether or not to allow redirects off-site. (true)
$expandlinks Whether to complete the link as full address (true)
$user Authenticated user name, if any
$pass Authenticated user name, if any
$accept HTTP Accept type (image/gif, Image/x-xbitmap, Image/jpeg, Image/pjpeg, */*)
$error Where is the error, if any
$response _code Response code returned from the server
$headers Header information returned from the server
$maxlength Longest return data length
$read _timeout Read operation timeout (requires PHP 4 Beta 4+), set to 0 for no timeout
$timed _out This property returns True if a read operation timed out (requires PHP 4 Beta 4+)
$maxframes Maximum number of frames allowed to track
$status The state of the crawled HTTP
$temp _dir Temporary file directory (/tmp) that the Web server can write to
$curl _path Curl Binary directory, if no curl binary is set to False

Crawl pages in PHP and analyze

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.