When we develop the network program, we often need to crawl non-local files, in general, the use of PHP simulation browser access, HTTP requests to access the URL address, and then get the HTML source code or XML data, we can not directly output data, often need to extract the content, Then format it and show it in a more friendly way.
Here are some simple ways and principles of PHP crawl page:
First, the main method of PHP crawl page:
1. File () function
2. file_get_contents () function
3. fopen ()->fread ()->fclose () mode
4.curl mode
5. Fsockopen () function socket mode
6. Using plug-ins (e.g.: http://sourceforge.net/projects/snoopy/)
Second, PHP parsing HTML or XML code the main way:
1. File () function
?123456789
//定义url
$url
=
'http://t.qq.com'
;
//fiel函数读取内容数组
$lines_array
=file(
$url
);
//拆分数组为字符串
$lines_string
=implode(
''
,
$lines_array
);
//输出内容,嘿嘿,大家也可以保存在自己的服务器上
echo
$lines_string
;
2. file_get_contents () function
Use file_get_contents and fopen to open allow_url_fopen. Method: Edit PHP.ini, set allow_url_fopen = On,allow_url_fopen Close when fopen and file_get_contents cannot open remote files.
?1234567 //定义url
$url
=
'http://t.qq.com'
;
//file_get_contents函数远程读取数据
$lines_string
=
file_get_contents
(
$url
);
//输出内容,嘿嘿,大家也可以保存在自己的服务器上
echo
htmlspecialchars(
$lines_string
);
3. fopen ()->fread ()->fclose () mode
?12345678910111213141516171819 //定义url
$url
=
'http://t.qq.com'
;
//fopen以二进制方式打开
$handle
=
fopen
(
$url
,
"rb"
);
//变量初始化
$lines_string
=
""
;
//循环读取数据
do
{
$data
=
fread
(
$handle
,1024);
if
(
strlen
(
$data
)==0) {
break
;
}
$lines_string
.=
$data
;
}
while
(true);
//关闭fopen句柄,释放资源
fclose(
$handle
);
//输出内容,嘿嘿,大家也可以保存在自己的服务器上
echo
$lines_string
;
4. Curl Mode
Use curl to have space to turn on curl. Method: Modify PHP.ini under WINDOWS, remove the semicolon in front of Extension=php_curl.dll, and need to copy Ssleay32.dll and Libeay32.dll to C:\WINDOWS\system32 ; Install the curl extension under Linux.
?123456789101112131415 // 创建一个新cURL资源
$url
=
'http://t.qq.com'
;
$ch
=curl_init();
$timeout
=5;
// 设置URL和相应的选项
curl_setopt(
$ch
, CURLOPT_URL,
$url
);
curl_setopt(
$ch
, CURLOPT_RETURNTRANSFER, 1);
curl_setopt(
$ch
, CURLOPT_CONNECTTIMEOUT,
$timeout
);
// 抓取URL
$lines_string
=curl_exec(
$ch
);
// 关闭cURL资源,并且释放系统资源
curl_close(
$ch
);
//输出内容,嘿嘿,大家也可以保存在自己的服务器上
echo
$lines_string
;
5. Fsockopen () function socket mode
The socket mode can be executed correctly, and it is related to the server settings, which can be phpinfo to see which communication protocols are open by the server.
?1234567891011121314 $fp
=
fsockopen
(
"t.qq.com"
, 80,
$errno
,
$errstr
, 30);
if
(!
$fp
) {
echo
"$errstr ($errno)
\n"
;
}
else
{
$out
=
"GET / HTTP/1.1\r\n"
;
$out
.=
"Host: t.qq.com\r\n"
;
$out
.=
"Connection: Close\r\n\r\n"
;
fwrite(
$fp
,
$out
);
while
(!
feof
(
$fp
)) {
echo
fgets
(
$fp
, 128);
}
fclose(
$fp
);
}
6. Snoopy plug-in, the latest version is Snoopy-1.2.4.zip last update:2013-05-30, recommend everyone to use
It is a very powerful collection plug-in using the very popular Snoopy on the internet, and it is very convenient to use, and you can also set up an agent inside to simulate browser information.
?123456789101112 //引入snoopy的类文件
require
(
'Snoopy.class.php'
);
//初始化snoopy类
$snoopy
=
new
Snoopy;
$url
=
"http://t.qq.com"
;
//开始采集内容
$snoopy
->fetch(
$url
);
//保存采集内容到$lines_string
$lines_string
=
$snoopy
->results;
//输出内容,嘿嘿,大家也可以保存在自己的服务器上
echo
$lines_string
;
Description: The setup agent is on line 45th of the Snoopy.class.php file, where you search for "var $agent" (The contents of the quotation marks). Browser content you can use PHP to get,
Use echo $_server[' http_user_agent ']; You can get the browser information and copy the echo out into the agent.
http://www.bkjia.com/PHPjc/735061.html www.bkjia.com true http://www.bkjia.com/PHPjc/735061.html techarticle when we develop the network program, we often need to crawl non-local files, in general, the use of PHP simulation browser access, HTTP requests to access the URL address, and then get the HTML source generation ...