We cannot directly output the obtained data. We often need to extract and format the content to display it in a more friendly way.
The main content of this article is as follows:
I. Main Methods for capturing pages in PHP:
1. File () function
2. file_get_contents () function
3. fopen ()-> fread ()-> fclose () Mode
4. Curl Method
5. fsockopen () function Socket mode
6. Use plug-ins (such as: http://sourceforge.net/projects/snoopy)
Ii. php parsing HTML or XMLCodeMain Methods:
1. Regular Expression
2. php domdocument object
3. Plug-ins (for example, PHP simple HTML Dom parser)
If you are familiar with the above content, you can see the following content ......
PHP crawling page
1. File () functionCopy codeThe Code is as follows: <? PHP
$ Url = 'HTTP: // t.qq.com ';
$ Lines_array = file ($ URL );
$ Lines_string = implode ('', $ lines_array );
Echo htmlspecialchars ($ lines_string );
?>
2. file_get_contents () function
Use file_get_contents and fopen to enable allow_url_fopen. Method: Edit PHP. ini and set allow_url_fopen = on. When allow_url_fopen is disabled, neither fopen nor file_get_contents can open remote files.Copy codeThe Code is as follows: <? PHP
$ Url = 'HTTP: // t.qq.com ';
$ Lines_string = file_get_contents ($ URL );
Echo htmlspecialchars ($ lines_string );
?>
3. fopen ()-> fread ()-> fclose () Mode
Copy code The Code is as follows: <? PHP
$ Url = 'HTTP: // t.qq.com ';
$ Handle = fopen ($ URL, "rb ");
$ Lines_string = "";
Do {
$ DATA = fread ($ handle, 1024 );
If (strlen ($ data) = 0) {break ;}
$ Lines_string. = $ data;
} While (true );
Fclose ($ handle );
Echo htmlspecialchars ($ lines_string );
?>
4. Curl Method
Use curl to enable curl. Method: Modify PHP. ini in windows, remove the semicolon before extension = php_curl.dll, and copy ssleay32.dll and libeay32.dll to c: \ windows \ system32. Install curl extension in Linux.Copy codeThe Code is as follows: <? PHP
$ Url = 'HTTP: // t.qq.com ';
$ CH = curl_init ();
$ Timeout = 5;
Curl_setopt ($ ch, curlopt_url, $ URL );
Curl_setopt ($ ch, curlopt_returntransfer, 1 );
Curl_setopt ($ ch, curlopt_connecttimeout, $ timeout );
$ Lines_string = curl_exec ($ ch );
Curl_close ($ ch );
Echo htmlspecialchars ($ lines_string );
?>
5. fsockopen () function Socket mode
Whether the socket mode can be correctly executed depends on the server settings. You can use phpinfo to check which communication protocols are enabled on the server. For example, my local PHP socket does not enable HTTP, you can only use UDP for testing.Copy codeThe Code is as follows: <? PHP
$ Fp = fsockopen ("UDP: // 127.0.0.1", 13, $ errno, $ errstr );
If (! $ FP ){
Echo "error: $ errno-$ errstr <br/> \ n ";
} Else {
Fwrite ($ FP, "\ n ");
Echo fread ($ FP, 26 );
Fclose ($ FP );
}
?>
6. Plug-ins
There should be a lot of plug-ins on the Internet, and Snoopy plug-ins are found on the Internet. If you are interested, you can study them.
PHP parses XML (HTML)
1. Regular Expression:
Copy code The Code is as follows: <? PHP
$ Url = 'HTTP: // t.qq.com ';
$ Lines_string = file_get_contents ($ URL );
Eregi ('<title> (. *) </title>', $ lines_string, $ title );
Echo htmlspecialchars ($ title [0]);
?>
2. php domdocument () object
If the remote HTML or XML file has a syntax error, PHP will report an error when parsing the Dom.
Copy code The Code is as follows: <? PHP
$ Url = 'HTTP: // www.136web.cn ';
$ Html = new domdocument ();
$ HTML-> loadhtmlfile ($ URL );
$ Title = $ HTML-> getelementsbytagname ('title ');
Echo $ title-> item (0)-> nodevalue;
?>
3. Plug-ins
This article takes PHP simple HTML Dom parser as an example to give a brief introduction. The simple_html_dom syntax is similar to jquery, which allows PHP to operate the Dom, just as simple as using jquery to operate the Dom.Copy codeThe Code is as follows: <? PHP
$ Url = 'HTTP: // t.qq.com ';
Include_once ('../simplehtmldom/simple_html_dom.php ');
$ Html = file_get_html ($ URL );
$ Title = $ HTML-> Find ('title ');
Echo $ title [0]-> plaintext;
?>
Of course, Chinese people are creative, and foreigners tend to lead in technology, but Chinese people tend to be superior in use and often make some functions that foreigners do not dare to think about, for example, the remote crawling and Analysis of PHP originally provided convenience for data integration. However, Chinese people like this very much. As a result, a large number of collection sites do not create any valuable content, that is, they rely on capturing others' website content and taking it as their own. Enter the keyword "php small" in Baidu. The first keyword in the suggest list is "php thief ".Program", And then put the same keyword into Google, brother can only laugh without saying anything.