In-depth explanation of php Data Collection

Last Update:2013-10-16 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Here we will introduce two good php collection tools. One is Snoopy and the other is simple_html_dom. There are many ways to collect data (in essence 2-3, others are derived). php can also directly collect data using several methods. However, out of the spirit of putting laziness to the end. We can still use these two tools to make the collection easier.

I have introduced Snoopy on the Internet. below is the Snoopy SDK translated by someone else.
//////////////////////////////////////// //////////////////////
Snoopy is a php class used to simulate browser functions. It can obtain webpage content and send forms.
Some features of Snoopy:
1. fetch the webpage content
2. fetchtext
3. Capture the link of the Web page. The form is fetchlinks fetchform.
4. proxy host supported
5. basic user name/password verification is supported.
6. You can set user_agent, referer, cookies, and header content)
7. Supports browser redirection and can control the depth of redirection.
8. Extend the link in the webpage to a high-quality url (default)
9 submit data and obtain the returned value
10 support tracking HTML framework
11. Sending cookies during redirection
The php4 and above are required. Because it is a php class, it is the best choice when the server does not need to be expanded and curl is not supported,
Class method:
Fetch ($ URI)
----
This method is used to capture the content of a webpage.
$ URI is the URL of the webpage to be crawled.
The captured results are stored in $ this-> results.
If you are capturing a framework, Snoopy will track each frame and store it in an array, and then save it to $ this-> results.
Fetchtext ($ URI)
-----
This method is similar to fetch (). The only difference is that this method will remove HTML tags and other irrelevant data and only return the text content in the webpage.
Fetchform ($ URI)
-----
This method is similar to fetch (). The only difference is that this method will remove the HTML Tag and other irrelevant data and only return the form Content (form) in the webpage ).
Fetchlinks ($ URI)
------
This method is similar to fetch (). The only difference is that this method will remove HTML tags and other irrelevant data and only return links in the webpage ).
By default, the relative link is automatically completed and converted to a complete URL.
Submit ($ URI, $ formvars)
--------
This method sends a confirmation form to the URL specified by $ URL. $ Formvars is an array that stores form parameters.
Submittext ($ URI, $ formvars)
---------
This method is similar to submit (). The only difference is that this method will remove HTML tags and other irrelevant data and only return the text content on the webpage after login.
Submitlinks ($ URI)
------
This method is similar to submit (). The only difference is that this method will remove HTML tags and other irrelevant data and only return links in the webpage ).
By default, the relative link is automatically completed and converted to a complete URL.
Class property: (the default value is in brackets)
$ Host connected host
$ Port connection port
$ Proxy_host: the proxy host used, if any
$ Proxy_port indicates the proxy host port used. If yes
$ Agent user proxy disguise (Snoopy v0.1)
$ Referer information, if any
$ Cookies, if any
$ Rawheaders other header information, if any
$ Maxredirs maximum redirect times, 0 = not allowed (5)
$ Offsiteok whether or not to allow redirects off-site. (true)
$ Expandlinks: whether to add all links to the full address (true)
$ User authentication username, if any
$ Pass authentication username, if any
$ Accept http accept type (image/gif, image/x-xbitmap, image/jpeg, image/pjpeg ,*/*)
$ Error: Where is the error reported? If yes
$ Response_code response code returned from the server
$ Headers header information returned from the server
$ Maxlength: Maximum length of returned data
$ Read_timeout read operation timeout (requires PHP 4 Beta 4 +)
Set 0 to no timeout
$ Timed_out if a read operation times out, this attribute returns true (requires PHP 4 Beta 4 +)
$ Maxframes maximum number of frames that can be tracked
$ Status indicates the http status captured.
$ Temp_dir temporary file directory (/tmp) that can be written by the webpage Server)
$ Curl_path cURL binary directory. If no cURL binary is available, set it to false.
The following is a demo
Copy codeThe Code is as follows:
Include "Snoopy. class. php ";
$ Snoopy = new Snoopy;
$ Snoopy-> proxy_host = "www.7767.cn ";
$ Snoopy-& gt; proxy_port = "8080 ";
$ Snoopy-> agent = "(compatible; MSIE 4.01; MSN 2.5; AOL 4.0; Windows 98 )";
$ Snoopy-> referer = "http://www.7767.cn /";
$ Snoopy-> cookies ["SessionID"] = 238472821323489l;
$ Snoopy-> cookies ["favoriteColor"] = "RED ";
$ Snoopy-> rawheaders ["Pragma"] = "no-cache ";
$ Snoopy-> maxredirs = 2;
$ Snoopy-> offsiteok = false;
$ Snoopy-> expandlinks = false;
$ Snoopy-> user = "joe ";
$ Snoopy-> pass = "bloe ";
If ($ snoopy-> fetchtext ("http://www.7767.cn "))
{
Echo "<PRE>". htmlspecialchars ($ snoopy-> results). "</PRE> \ n ";
}
Else
Echo "error fetching document:". $ snoopy-> error. "\ n ";

//////////////////////////////////////// //////////////////////
Snoopy features "big" and "full". A fetch can be used as the first step of collection. Next, we need to use simple_html_dom to detail and deduct the desired part. Of course, if you are particularly good at regular expressions and love regular expressions, you can also use regular expressions to match and capture.

Simple_html_dom is actually a process of dom parsing. Php also provides some internal parsing methods, but this simple_html_dom can be said to be more professional, a class, to meet a lot of features you want.
//////////////////////////////////////// ////////////////////////
// Create a target document object with a URL or file name, that is, the target webpage
$ Html = file_get_html ('HTTP: // www.7767.cn /');
// $ Html = file_get_html ('test.htm ');
// Use a string as a target webpage. You can get the page through Snoopy and then get it here for processing.
$ Myhtml = str_get_html ('// Find all the images and return an array
Foreach ($ html-> find ('img ') as $ element)
Echo $ element-> src. '<br> ';
// Find all links
Foreach ($ html-> find ('A') as $ element)
Echo $ element-> href. '<br> ';

The find method is very useful. Generally, it returns an array containing objects. When searching for a target element, you can use the class, id, or other attributes to obtain the target string.

// Search for the div using the class attribute of the target div. The second parameter in the find method is the number of returned arrays. Starting from 0 is the first
$ Target_div = $ html-> find ('div.tar getclass', 0 );
// Check whether the result is what you want. simply echo it.
Echo $ target_div;

// The key point is that the collection object must be destroyed after it is created. Otherwise, the php page may be stuck for about 30 seconds, this depends on the time limit on your server. The destruction method is as follows:
$ Html-> clear ();
Unset ($ html );
I think simple_html_dom is better at controlling the collection as easily as JavaScript. The downloaded package below contains an English manual.
Simplehtmldom_000011/simplehtmldom/manual/manual.htm

Array $ e-> getAllAttributes ()	Array $ e->Attr
String $ e-> getAttribute ($ Name)	String $ e->Attribute
Void $ e-> setAttribute ($ Name, $ value)	Void $ value = $ e->Attribute
Bool $ e-> hasAttribute ($ Name)	Boolisset ($ e->Attribute)
Void $ e-> removeAttribute ($ Name)	Void $ e->Attribute= Null
Element $ e-> getElementById ($ Id)	Mixed $ e-> find ("# $ Id", 0)
Mixed $ e-> getElementsById ($ Id [, $ index])	Mixed $ e-> find ("# $ Id" [, int $ index])
Element $ e-> getElementByTagName ($ Name)	Mixed $ e-> find ($ Name, 0)
Mixed $ e-> getElementsByTagName ($ Name [, $ index])	Mixed $ e-> find ($ Name [, int $ index])
Element $ e-> parentNode ()	Element $ e-> parent ()
Mixed $ e-> childNodes ([$ Index])	Mixed $ e-> children ([Int $ index])
Element $ e-> firstChild ()	Element $ e-> first_child ()
Element $ e-> lastChild ()	Element $ e-> last_child ()
Element $ e-> nextSibling ()	Element $ e-> next_sibling ()
Element $ e-> previussibling ()	Element $ e-> prev_sibling ()

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

In-depth explanation of php Data Collection

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

In-depth explanation of php Data Collection

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support