Crawl pages in PHP and analyze

Last Update:2014-10-28 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Before doing the crawl, remember to put the php.ini in the Max_execution_time set the big point, or will error.

First, crawl the page with Snoopy.class.php

A very cute class name. The function is also very powerful, simulates the browser function, can obtain the webpage content, the sending form and so on.

1) I am now going to crawl the contents of a list page of a website, I want to crawl the national hospital information content, such as:

2) I naturally copy the URL address, use the Snoopy class to crawl the first 10 pages of the page content, and put the content locally, the local HTML file, and so on for analysis.

$snoopy=NewSnoopy ();//Hospital List Page  for($i= 1;$i<= 10;$i++) {   $url= ' http://www.guahao.com/hospital/areahospitals?p= national &pageno= '.$i; $snoopy->fetch ($url); file_put_contents("web/page/$i. html ",$snoopy-results); } Echo' Success ';

3) It is strange that the return is not the national content, but the relevant content in Shanghai

4) behind suspicion is inside may set up a cookie and then use Firebug to view, sure enough there is amazing insider

5) In the request, the value of the cookie is also put in, added a set of statements $snoopy->cookies["_area_", the situation is very different, the smooth return to the national information.

$snoopy=new  Snoopy (); // Hospital List Page $snoopy->cookies["_area_"] = ' {"Provinceid": "All", "Provincename": "National", "Cityid": "All", "CityName": "Unlimited??"} ' ;  for ($i$i$i+ +) {

//... page content Crawl ...
}

Second, use phpquery.php analysis page

Using this class to manipulate the DOM like jquery does not have to use a regular algorithm for the headache of capturing the content of the website, which is similar in usage to jquery, except that $ () is replaced by PQ (). The homepage is in Google's code, is blocked FQ, the following attachment I share a latest version of the FGFQ software. The usage is simple as long as you open the Fg742p.exe file.

1) I want to get information about each specific hospital in each page, for example, to get 301 hospital information.

The content page for each page has been fetched locally in the previous operation. So you can use localhost directly.

  $snoopy=new  Snoopy ();  for ($i$i$i+ +) {    $url = "http://localhost/spider/web/page/$i. html";     $snoopy->fetch ($url);     $html $snoopy-results;  }

2) Use Phpquery to obtain the node information, such as the DOM structure:

Use some phpquery methods to read the URL address of each hospital information in conjunction with the DOM structure.

 for($i= 1;$i<= 10;$i++) {    //... Crawl Local page ...Phpquery::newdocument ($html);//Initializing Objects    $urls=Array(); foreach(PQ ('. Search-hos-info DL dt a ') as $item) {      Array_push($urls, PQ ($item)->attr (' href '));//Hospital Details    }}

3) Fetch the specified page according to the list of URL addresses that are read.

$detailIndex= 1; for($i= 1;$i<= 10;$i++) {    //... Crawl Local page ...//... Gets the href address information for the specified node ...    $len=Count($urls);  for($row= 1;$row<=$len;$row++) {      $snoopy->fetch ($urls[$row-1]); file_put_contents("web/hospital/$detailIndex. html ",$snoopy-results); $detailIndex++; }}

FQ Tools Download:

Climb obstacles. rar

Demo Download:

http://download.csdn.net/detail/loneleaf1/8089507

Some notes about the Snoopy class:

Class method

Fetch ($URI)	This is the method used to crawl the contents of a Web page. The $URI parameter is the URL address of the crawled Web page. The results of the fetch are stored in the $this->results. If you are crawling a frame, Snoopy will track each frame back into the array and deposit it into the $this->results.
Fetchtext ($URI)	This method is similar to fetch (), except that this method removes HTML tags and other unrelated data, returning only the text content in the page.
Fetchform ($URI)	This method is similar to fetch (), except that this method removes HTML tags and other unrelated data and returns only the form content (form) in the Web page.
Fetchlinks ($URI)	This method is similar to fetch (), except that this method removes HTML tags and other unrelated data and only returns links to the Web page. By default, relative links are automatically completed and converted to full URLs.
Submit ($URI, $formvars)	This method sends a confirmation form to the link address specified by the. $formvars is an array of stored form parameters.
Submittext ($URI, $formvars)	This method is similar to submit (), the only difference is that this method will remove the HTML tags and other unrelated data, only return to the page after landing text content.
Submitlinks ($URI)	This method is similar to submit (), the only difference is that this method will remove the HTML tags and other unrelated data, only return the link in the Web page. By default, relative links are automatically completed and converted to full URLs.

Class properties

$host	Connected hosts
$port	Port of Connection
$proxy _host	Use the proxy host, if any
$proxy _port	The proxy host port used, if any
$agent	User Agent Spoofing (Snoopy v0.1)
$referer	Route information, if any
$cookies	Cookies, if any.
$rawheaders	Other header information, if any
$maxredirs	Maximum number of redirects, 0 = not allowed (5)
$offsiteok	Whether or not to allow redirects off-site. (true)
$expandlinks	Whether to complete the link as full address (true)
$user	Authenticated user name, if any
$pass	Authenticated user name, if any
$accept	HTTP Accept type (image/gif, Image/x-xbitmap, Image/jpeg, Image/pjpeg, /)
$error	Where is the error, if any
$response _code	Response code returned from the server
$headers	Header information returned from the server
$maxlength	Longest return data length
$read _timeout	Read operation timeout (requires PHP 4 Beta 4+), set to 0 for no timeout
$timed _out	This property returns True if a read operation timed out (requires PHP 4 Beta 4+)
$maxframes	Maximum number of frames allowed to track
$status	The state of the crawled HTTP
$temp _dir	Temporary file directory (/tmp) that the Web server can write to
$curl _path	Curl Binary directory, if no curl binary is set to False

Crawl pages in PHP and analyze

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Crawl pages in PHP and analyze

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support