Three ways to implement data collection in PHP

Source: Internet
Author: User
Tags fread

What do you mean collection?

is to use PHP programs to crawl information from other sites into our own databases and websites.

Techniques for PHP Production acquisition:

From the bottom of the socket to the high-level file operation function, a total of 3 ways to achieve the acquisition.

1. Use socket technology to collect:

Socket acquisition is the lowest level, it just establishes a long connection, and then we have to construct the HTTP protocol string to send the request.

For example, to get the content of this page, Tv.youku.com/?spm=a2hww.20023042.topnav.5~1~3!2~a, write with the socket as follows:

<?php//connection, $error error number, $errstr the wrong string, 30s is the connection timeout $fp=fsockopen ("www.youku.com", $errno, $errstr,); if (! $fp) Die ("Connection failed". $errstr); Constructs the HTTP protocol string, because socket programming is the lowest level, it has not yet used the HTTP protocol $http= "GET/?spm=a2hww.20023042.topnav.5~1~3!2~a http/1.1\r\n";  \ r \ n indicates that the preceding is a command $http.= "host:www.youku.com\r\n";  The requested host $http.= "connection:close\r\n\r\n";   The connection is closed and the last line is two \ r \ n//Send this string to the server fwrite ($fp, $http, strlen ($http)),//The data returned by the receiving server $data= '; while (!feof ($fp)) {$data. = Fread ($fp, 4096);  Fread reads the returned data, reads 4096 bytes at a time}//closes the connection fclose ($FP); Var_dump ($data);? >

The printed results are as follows, including the returned header information and the source code of the page:

2. Use Curl_ set of functions

Curl encapsulates the HTTP protocol into a number of functions, directly passing the corresponding parameters, reducing the difficulty of writing an HTTP protocol string.

Premise: To turn on the curl extension in php.ini.

Generate a Curl object $curl=curl_init ();//Set URL and corresponding options curl_setopt ($curl, Curlopt_url, "http://www.youku.com"); curl_setopt ($ Curl, Curlopt_returntransfer, 1);  The information obtained by CURL_EXEC () is returned as a string, rather than as a direct output. Perform a curl operation $data=curl_exec ($curl); Var_dump ($data);

The printed results are as follows, containing only the source code of the page:

3. Direct use of file_get_contents (top-level)

Premise: Set the URL address in php.ini that allows you to open a network.

Use file_get_contents () $data =file_get_contents ("http://www.youku.com"); Var_dump ($data);


3 Ways to choose

communication between the network is mainly used in the above three kinds. One of the latter two uses more: If you want to bulk collect large amounts of data when using the second "CURL", good performance, stable.

Occasionally a few requests are sent frequently without intensive use of the third.

Extension: How does an image's anti-theft chain break?

For example, 7060 of the images on the site made a chain of anti-theft: In his site can see the picture, the picture to get outside the station can not access.

Principle: There is an referer entry in the HTTP protocol that represents the source address of the request, and the server will determine if the request is not sent from this site, it will filter out the request:

WORKAROUND: When you send HTTP, you can simulate Referer:

Extension: Some of the data to be collected must be logged in, you can use the simulated test simulation in the login state of the acquisition:

A. First login with a browse, log in, browser cookie will be SessionID

B. When PHP sends the HTTP protocol, the SessionID in the browser is placed in the HTTP protocol request of PHP so that the request is made in the state of login.

Summary: All the data sent by the client can be simulated, so the program on the server must filter the client's data where necessary.

When do you use these things? when the interface is developed and collected.

Second, data collection

For example, I want to collect information about all the American movies in this URL,

List.youku.com/category/show/c_96_a_%e7%be%8e%e5%9b%bd_s_1_d_1_p_3.html

First you need to know the structure of the node where the movie is located, and we use Firebug to view it.

/** * Send a GET request to get data */function get ($url) {global $curl;   Configure the HTTP protocol--configurable in curl to check the PHP manual for Curl_ curl_setopt ($curl, Curlopt_url, $url);   curl_setopt ($curl, Curlopt_returntransfer, TRUE);   curl_setopt ($curl, Curlopt_header, FALSE); Execute this request return curl_exec ($curl);} Generate a Curl Object $curl = Curl_init (); $url = ' http://list.youku.com/category/show/c_96_a_%E7%BE%8E%E5%9B%BD_s_1_d_1_p_3. HTML '; $data =get ($url);//Match movie location $list_preg = '/<li class= ' yk-col4 mr1 ' >.+<\/li>/us ';//Match src and alt on IMG tag $img _preg = '//u ' url$video_preg=/<a ' (. * "Title=" (. *) "target=" (. *) "><\/a>/u";//Save All Li to $list, $list is a two-dimensional array preg_match_all ($list _preg, $data, $   list);  Var_dump ($list); foreach ($list [0] as $k = = $v) {//Here $v is every li tag/* get pictures and movie names Preg_match ($img _preg, $v, $img);    The information of the matching picture is stored in the $img var_dump ($img);  */* Get movie address Preg_match ($video _preg, $v, $video); Save the information of the matching movie to $video var_dump ($video); */Preg_match ($img _preg, $v, $img);    Preg_match ($video _preg, $v, $video); echo $img [0]. ' <a href= "'. $video [1]. '" > '. $video [2]. ' </a> ';}

Test:

Print $list;

Print $img

Print $video

Final effect:

If you need to copy the picture to your hard disk, add the following code to the Foreach loop:

$imgData = Get ($img [1]);    Write the picture file to the hard disk "download"    //because the operating system is GBK, so to turn UTF8 into GBK    is_dir ('./youkuimg/')? ": mkdir ('./youkuimg/'); File_put_contents ('./youkuimg/'. mb_convert_encoding ($img [3], ' GBK ', ' utf-8 '). JPG ', $imgData);


The effect is as follows: In the current directory under the YOUKUIMG directory there will be a download good picture.


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.