Recommendations for crawling pages and code parsing in php

Source: Internet
Author: User
When making weather forecasts or RSS subscriptions, you often need to capture non-local files. In general, php is used to simulate browser access and access the url address through http requests, then you can obtain the html source code or xml data. We cannot directly output the obtained data. we often need to extract and format the content to display it in a more friendly way.
The main content of this article is as follows:

I. main methods for capturing pages in PHP:

1. file () function
2. file_get_contents () function
3. fopen ()-> fread ()-> fclose () mode
4. curl method
5. fsockopen () function socket mode
6. use plug-ins (such as: http://sourceforge.net/projects/snoopy)

II. main methods for parsing html or xml code in PHP:

1. Regular expression
2. PHP DOMDocument object
3. plug-ins (for example, PHP Simple html dom Parser)

If you are familiar with the above content, you can see the following content ......

PHP crawling page

1. file () function
The code is as follows:
$ Url = 'http: // t.qq.com ';
$ Lines_array = file ($ url );
$ Lines_string = implode ('', $ lines_array );
Echo htmlspecialchars ($ lines_string );
?>


2. file_get_contents () function
Use file_get_contents and fopen to enable allow_url_fopen. Method: edit php. ini and set allow_url_fopen = On. when allow_url_fopen is disabled, neither fopen nor file_get_contents can open remote files.
The code is as follows:
$ Url = 'http: // t.qq.com ';
$ Lines_string = file_get_contents ($ url );
Echo htmlspecialchars ($ lines_string );
?>


3. fopen ()-> fread ()-> fclose () mode

The code is as follows:
$ Url = 'http: // t.qq.com ';
$ Handle = fopen ($ url, "rb ");
$ Lines_string = "";
Do {
$ Data = fread ($ handle, 1024 );
If (strlen ($ data) = 0) {break ;}
$ Lines_string. = $ data;
} While (true );
Fclose ($ handle );
Echo htmlspecialchars ($ lines_string );
?>


4. curl method
Use curl to enable curl. Method: modify php. ini in windows, remove the semicolon before extension = php_curl.dll, and copy ssleay32.dll and libeay32.dll to C: \ WINDOWS \ system32. install curl extension in Linux.
The code is as follows:
$ Url = 'http: // t.qq.com ';
$ Ch = curl_init ();
$ Timeout = 5;
Curl_setopt ($ ch, CURLOPT_URL, $ url );
Curl_setopt ($ ch, CURLOPT_RETURNTRANSFER, 1 );
Curl_setopt ($ ch, CURLOPT_CONNECTTIMEOUT, $ timeout );
$ Lines_string = curl_exec ($ ch );
Curl_close ($ ch );
Echo htmlspecialchars ($ lines_string );
?>


5. fsockopen () function socket mode
Whether the socket mode can be correctly executed depends on the server settings. you can use phpinfo to check which communication protocols are enabled on the server. for example, my local php socket does not enable http, you can only use udp for testing.
The code is as follows:
$ Fp = fsockopen ("udp: // 127.0.0.1", 13, $ errno, $ errstr );
If (! $ Fp ){
Echo "ERROR: $ errno-$ errstr
\ N ";
} Else {
Fwrite ($ fp, "\ n ");
Echo fread ($ fp, 26 );
Fclose ($ fp );
}
?>


6. plug-ins
There should be a lot of plug-ins on the Internet, and snoopy plug-ins are found on the internet. if you are interested, you can study them.

PHP parses xml (html)

1. Regular expression:

The code is as follows:
$ Url = 'http: // t.qq.com ';
$ Lines_string = file_get_contents ($ url );
Eregi (' (.*)', $ Lines_string, $ title );
Echo htmlspecialchars ($ title [0]);
?>


2. PHP DOMDocument () object
If the remote html or xml file has a syntax error, php will report an error when parsing the dom.

The code is as follows:
$ Url = 'http: // www.136web.cn ';
$ Html = new DOMDocument ();
$ Html-> loadHTMLFile ($ url );
$ Title = $ html-> getElementsByTagName ('title ');
Echo $ title-> item (0)-> nodeValue;
?>


3. plug-ins
This article takes PHP Simple html dom Parser as an example to give a brief introduction. the simple_html_dom syntax is similar to jQuery, which allows php to operate the dom, just as Simple as using jQuery to operate the dom.
The code is as follows:
$ Url = 'http: // t.qq.com ';
Include_once ('../simplehtmldom/simple_html_dom.php ');
$ Html = file_get_html ($ url );
$ Title = $ html-> find ('title ');
Echo $ title [0]-> plaintext;
?>


Of course, Chinese people are creative, and foreigners tend to lead in technology, but Chinese people tend to be superior in use and often make some functions that foreigners do not dare to think about, for example, the remote crawling and analysis of php originally provided convenience for data integration. However, Chinese people like this very much. as a result, a large number of collection sites do not create any valuable content, that is, they rely on capturing others' website content and taking it as their own. Enter the keyword "php small" in Baidu. The first keyword in the suggest list is the "php thief Program". then, put the same keyword into google, so you can only smile without saying anything.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.