Collection is the use of PHP programs, other sites to crawl the information into our own database, the site. This article mainly and everyone to share the PHP implementation of data collection methods, hope to help everyone.
Techniques for PHP Production acquisition:
From the bottom of the socket to the high-level file operation function, a total of 3 ways to achieve the acquisition.
1. Use socket technology to collect:
Socket acquisition is the lowest level, it just establishes a long connection, and then we have to construct the HTTP protocol string to send the request.
For example, to get the content of this page, Http://tv.youku.com/?spm=a2hww.20023042.topNav.5~1~3!2~A, write with the socket as follows:
<?php //connection, $error error number, $errstr wrong string, 30s is the connection time-out $fp=fsockopen ("www.youku.com", $errno, $errstr, +); if (! $fp) Die ("Connection failed". $errstr); Constructs the HTTP protocol string, because socket programming is the lowest level, it has not yet used the HTTP protocol $http= "GET/?spm=a2hww.20023042.topnav.5~1~3!2~a http/1.1\r\n"; \ r \ n indicates that the preceding is a command $http.= "host:www.youku.com\r\n"; The requested host $http.= "connection:close\r\n\r\n"; The connection is closed, the last line to two \r\n//send this string to the server fwrite ($fp, $http, strlen ($http)); The data returned by the receiving server $data= '; while (!feof ($fp)) { $data. =fread ($fp, 4096); Fread reads the returned data, reads 4096 bytes at a time} //Closes the connection fclose ($FP); Var_dump ($data); ? >
The printed results are as follows, including the returned header information and the source code of the page:
2. Use Curl_ set of functions
Curl encapsulates the HTTP protocol into a number of functions, directly passing the corresponding parameters, reducing the difficulty of writing an HTTP protocol string.
Premise: To turn on the curl extension in php.ini.
Generates a Curl object $curl=curl_init (); Set the URL and the corresponding options curl_setopt ($curl, Curlopt_url, "http://www.youku.com"); curl_setopt ($curl, Curlopt_returntransfer, 1); The information obtained by CURL_EXEC () is returned as a string, rather than as a direct output. Perform curl Operation $data=curl_exec ($curl); Var_dump ($data);
The printed results are as follows, containing only the source code of the page:
3. Direct use of file_get_contents (top-level)
Premise: Set the URL address in php.ini that allows you to open a network.
Use file_get_contents () $data =file_get_contents ("http://www.youku.com"); Var_dump ($data);
3 Ways to choose
communication between the network is mainly used in the above three kinds. One of the latter two uses more: If you want to bulk collect large amounts of data when using the second "CURL", good performance, stable.
Occasionally a few requests are sent frequently without intensive use of the third.
Extension: How does an image's anti-theft chain break?
For example, 7060 of the images on the site made a chain of anti-theft: In his site can see the picture, the picture to get outside the station can not access.
Principle: There is an referer entry in the HTTP protocol that represents the source address of the request, and the server will determine if the request is not sent from this site, it will filter out the request:
WORKAROUND: When you send HTTP, you can simulate Referer:
Extension: Some of the data to be collected must be logged in, you can use the simulated test simulation in the login state of the acquisition:
A. First login with a browse, log in, browser cookie will be SessionID
B. When PHP sends the HTTP protocol, the SessionID in the browser is placed in the HTTP protocol request of PHP so that the request is made in the state of login.
Summary: All the data sent by the client can be simulated, so the program on the server must filter the client's data where necessary.
When do you use these things? when the interface is developed and collected.
Second, data collection
For example, I want to collect information about all the American movies in this URL,
Http://list.youku.com/category/show/c_96_a_%E7%BE%8E%E5%9B%BD_s_1_d_1_p_3.html
First you need to know the structure of the node where the movie is located, and we use Firebug to view it.
then start writing code: complete code as follows
/** * Send a GET request to get data */function get ($url) {global $curl; Configure the HTTP protocol--configurable in curl to check the PHP manual for Curl_ curl_setopt ($curl, Curlopt_url, $url); curl_setopt ($curl, Curlopt_returntransfer, TRUE); curl_setopt ($curl, Curlopt_header, FALSE); Execute this request return curl_exec ($curl); }//Generate a Curl Object $curl = Curl_init (); $url = ' http://list.youku.com/category/show/c_96_a_%E7%BE%8E%E5%9B%BD_s_1_d_1_p_3.html '; $data =get ($url); Match movie Location $list_preg = '/<li class= ' yk-col4 mr1 ' >.+<\/li>/us '; Match the SRC and alt$img_preg on the img tag = '//u"; Match the movie's url$video_preg= '/<a href= ' (. *) "title=" (. *) "target=" (. *) "><\/a>/u"; Put all the Li into the $list, $list is a two-dimensional array preg_match_all ($list _preg, $data, $list); Var_dump ($list); foreach ($list [0] as $k = = $v) {//Here $v is every li tag/* get pictures and movie names Preg_match ($img _preg, $v, $img); The information of the matching picture is stored in the $img var_dump ($img); */* Get movie address Preg_match ($video _preg, $v, $video); Put a match to the movie The letter$video Var_dump ($video); */Preg_match ($img _preg, $v, $img); Preg_match ($video _preg, $v, $video); echo $img [0]. ' <a href= "'. $video [1]. '" > '. $video [2]. ' </a> '; }
Test:
Print $list;
Print $img
Print $video
Final effect:
If you need to copy the picture to your hard disk, add the following code to the Foreach loop:
$imgData = Get ($img [1]); Write the picture file to the hard disk "download" //because the operating system is GBK, so to turn UTF8 into GBK is_dir ('./youkuimg/')? ": mkdir ('./youkuimg/'); File_put_contents ('./youkuimg/'. mb_convert_encoding ($img [3], ' GBK ', ' utf-8 '). JPG ', $imgData);
The effect is as follows: In the current directory under the YOUKUIMG directory there will be a download good picture.
Related recommendations:
PHP Regular and data collection detailed
PHP A data acquisition class instance code
In-depth _php tutorial on PHP data collection