Today is arranged to do Sohu Home news part crawl. Originally very simple thing, who knows the Sohu page grabbed over has been garbled, how can not turn. Had to thoroughly study a bit, also learned a lot of things, write down to share.
First, what is the PHP collection program?
Second, why to collect?
Third, what to collect?
Iv. how to collect?
V. Collection of Ideas
Vi. Collection of sample procedures
Vii. Collection of Experience
What is a PHP capture program?
PHP collection program, also known as PHP Thief, is mainly used to automatically collect Web pages on the network specific content, in PHP language written in the Web program, running on the platform to support PHP. When it comes to "auto-collection," You may associate Baidu Goole with what the search engine does. The PHP collection program is doing similar work.
Why collect?
The internet is growing at a rapid pace, the web data daily in geometric increments, in the face of this huge data, as a webmaster, you, how to collect the information you need? Especially for one or a few similar sites, you need a lot of their information to enrich the content of your website, you can only copy and paste live? A webmaster, you really have to spend a lot of time to engage in the original content, and the entire Internet information on the pace of development is out of sync? There is only one solution to these problems: acquisition. If there is a program, you help your site automatically or semi-automatic collection of the specific content you need to update your site's information, is it your dream? This is why the collection program appears.
What are you collecting?
It depends on what type of website you are doing. If you do picture station, collect pictures, do music station, collect MP3, do news station, collect news and so on. Everything depends on your site's content architecture needs. Determine what you want to collect, and then write the appropriate collection procedure.
How to collect?
Usually the collection procedure, all is targeted. That is, you need to have a target site, collect some of the content you need to collect the site, respectively, its HTML code analysis, find the regular things, according to the specific content you want to collect, write PHP code. After you have collected what you want, you can choose the storage method you need. For example, generate HTML pages directly, or put them in a database for further processing or to store them in a specific form for later use.
Collection Ideas
The idea of collecting the program is very simple and can be divided into the following steps:
1. Get the remote file source code (file_get_contents or fopen).
2. Analyze the code to get what you want (here is a regular match, usually get paged).
3. Follow the root to download the contents of the library and other operations.
The second step here is likely to repeat the operation several times, for example, to analyze the paging address, in the analysis of the contents of the inside page to get what we want.
/* * * * * * * Get remote file Source code common three ways * * *
/* * * method One, fopen (), Stream_context_create () method * * * *
$opts = Array (
' http ' = = Array (
' Method ' = ' GET ',
' Header ' = ' accept-language:en\r\n '.
"Cookie:foo=bar\r\n"
)
);
$context = Stream_context_create ($opts);
$fp = fopen (' http://www.example.com ', ' R ', false, $context);
Fpassthru ($FP);
Fclose ($FP);
/****** method Two, socket*******/
function Get_content_by_socket ($url, $host) {
$fp = Fsockopen ($host, or Die ("Open"). $url. "Failed");
$header = "GET/". $url. " Http/1.1\r\n ";
$header. = "Accept: */*\r\n";
$header. = "accept-language:zh-cn\r\n";
$header. = "Accept-encoding:gzip, deflate\r\n";
$header. = "user-agent:mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon; infopath.1;. NET CLR 2.0.50727) \ r \ n ";
$header. = "Host:". $host. " \ r \ n ";
$header. = "connection:keep-alive\r\n";
$header. = "cookie:cnzz02=2; rtime=1; ltime=1148456424859; Cnzz_eid=56601755-\r\n\r\n ";
$header. = "connection:close\r\n\r\n";
Fwrite ($fp, $header);
while (!feof ($fp)) {
$contents. = Fgets ($fp, 8192);
}
Fclose ($FP);
return $contents;
}
/****** method Three, file_get_contents (), Stream_context_create () method three ********/
$opts = Array (
' HTTP ' =>array (
' Method ' = ' GET ',
' Header ' = ' content-type:text/html; Charset=utf-8 "
)
);
$context = Stream_context_create ($opts);
$file = file_get_contents (' http://www.sohu.com/', false, $context);
/****** method Four, PHP's Curl http://www.chinaz.com/program/2010/0119/104346.shtml*******/
$ch = Curl_init ();
2. Setting options, including URLs
curl_setopt ($ch, Curlopt_url, "http://www.sohu.com");
curl_setopt ($ch, Curlopt_returntransfer, 1);
curl_setopt ($ch, Curlopt_header, 0);
curl_setopt ($ch, Curlopt_httpheader,array ("Content-type:text/xml; Charset=utf-8 "," expect:100-continue "));
3. Execute and get HTML document content
$output = curl_exec ($ch);
Var_dump ($output);
4. Releasing the curl handle
Curl_close ($ch);
/* Note
1. Use file_get_contents and fopen to open the Allow_url_fopen. Method: Edit PHP.ini, set allow_url_fopen = On,allow_url_fopen Close when fopen and file_get_contents cannot open remote files.
2. Use curl to have space to turn on curl. Method: Modify PHP.ini under WINDOWS, remove the semicolon in front of Extension=php_curl.dll, and need to copy Ssleay32.dll and Libeay32.dll to C:/windows/system32 ; Install the curl extension under Linux.
*/
?>
Sampling Sample Program
/* A picture download function */
function getimg ($url, $filename) {
/* Determine if the URL of the picture is empty, if it is an empty stop function */
if ($url = = "") {
return false;
}
/* Get the extension of the picture, in the variable $ext */
$ext = STRRCHR ($url, ".");
/* Determine if it is a valid picture file */
if ($ext! = ". gif" && $ext! = ". jpg") {
return false;
}
/* Read the picture */
$img = file_get_contents ($url);
/* Open the specified file */
$FP =@ fopen ($filename. $ext, "a");
/* Write the picture to the pointing file */
Fwrite ($fp, $img);
/* Close File */
Fclose ($FP);
/* Returns the new file name for the picture */
Return $filename. $ext;
}
Collect Pictures PHP Program
View Code
/* *
* Collect pictures PHP program
*
* Copyright (c) by Mini (CCXXCC) All rights reserved
*
* To contact the author write to {@link mailto:ucitmc@163.com}
*
* @author CCXXCC
* @version $Id: {filename},v 1.0 {time} $
* @package System
*/
Set_time_limit (0);
/* *
* Write files
* @param string $file file path
* Write content @param string $str
* @param char $mode write mode
*/
function Wfile ($file, $str, $mode = ' W ')
{
$oldmask = @ umask (0);
$fp = @ fopen ($file, $mode);
@ Flock ($FP, 3);
if (! $fp)
{
Return false;
}
Else
{
@ fwrite ($fp, $STR);
@ fclose ($FP);
@ umask ($oldmask);
Return true;
}
}
function SaveToFile ($path _get, $path _save)
{
@ $hdl _read = fopen ($path _get, ' RB ');
if ($hdl _read = = False)
{
Echo ("$path _get can not get");
Return;
}
if ($HDL _read)
{
@ $hdl _write = fopen ($path _save, ' WB ');
if ($HDL _write)
{
while (! feof ($HDL _read))
{
Fwrite ($hdl _write, fread ($HDL _read,8192));
}
Fclose ($hdl _write);
Fclose ($HDL _read);
return 1;
}
Else
return 0;
}
Else
return-1;
}
function Getext ($path)
{
$path = PathInfo ($path);
return Strtolower ($path [' extension ']);
}
/* *
* Generate directory by specified path
*
* @param string $path path
*/
function Mkdirs ($path)
{
$adir = explode ('/', $path);
$dirlist = ";
$rootdir = Array_shift ($adir);
if ($rootdir! = '. ' | | $rootdir! = ' ... ') &&! File_exists ($rootdir))
{
@ mkdir ($rootdir);
}
foreach ($adir as $key = $val)
{
if ($val! = '. ') && $val! = ' ... ')
{
$dirlist. = "/". $val;
$dirpath = $rootdir. $dirlist;
if (! file_exists ($dirpath))
{
@ mkdir ($dirpath);
@ chmod ($dirpath, 0777);
}
}
}
}
/* *
* Get a one-dimensional array from the text
*
* @param string $file _path text path
*/
function Getfilelistdata ($file _path)
{
$arr = @ file ($file _path);
$data = Array ();
if (Is_array ($arr) &&! empty ($arr))
{
foreach ($arr as $val)
{
$item = Trim ($val);
if (! empty ($item))
{
$data [] = $item;
}
}
}
Return $data;
}
Acquisition start
Pass in your own collection of images URL list text file each picture URL write a line
$url _file = isset ($_get[' file ') &&! Empty ($_get[' file '])? $_get[' file ': null;
$txt _url = "txt/". $url _file;
$urls = Array_unique (Getfilelistdata ($txt _url));
if (empty ($urls))
{
Echo (' No link address ');
Die ();
}
$save _url = "images/". Date ("Y_m_d", Time ()). " /";
Mkdirs ($save _url); Create a folder by date
$i = 1;
if (Is_array ($urls) && count ($urls))
{
foreach ($urls as $val)
{
SaveToFile ($val, $save _url. Date ("His", Time ()). " _". $i. ".". Getext ($val));
Echo ($i. ".". Getext ($val). "got\n");
$i + +;
}
}
Echo (' Finish ');
?>
In addition to the above methods can also use Snoopy, also good.
What is Snoopy? (Download Snoopy)
Snoopy is a PHP class that mimics the functionality of a Web browser, which accomplishes the task of getting web content and sending forms.
Some features of Snoopy:
* Easy to crawl the content of the webpage
* Easy to crawl Web page text content (remove HTML tags)
* Easy to crawl Web links
* Support Agent Host
* Support basic username/password Verification
* Support Settings User_agent, Referer (routing), cookies and header content (header file)
* Supports browser steering and can control steering depth
* Can expand the link in the Web page into a high-quality URL (default)
* Easy to submit data and get return value
* Support for tracking HTML frames (v0.92 added)
* Support for re-steering when transmitting cookies (v0.92 increase)
Collection Experience
Share a personal collection of hearts:
1. Do not take those for the anti-theft chain of the station, in fact, can be false, but this station acquisition cost is too high
2. Collect as fast as possible, preferably locally
3. There are many times when you can put some data into the database, and then the next step of processing.
4. When collecting must make the error handling, I generally is if the acquisition three times did not succeed on the skip. I used to be stuck there all the time because a piece of content couldn't be picked.
5. Must make a good judgment before warehousing, check the contents of the legal, filter unnecessary strings.