Use of the Regular Expression Processing Function get_matches Based on curl data collection

Source: Internet
Author: User
Tags preg

Based on the first two blog posts:

Use of the single page collection function get_html Based on curl data collection

Use of the single-page parallel collection function get_htmls Based on curl data collection

We can get the html file we need, and now we need to process the file to get the data we need to collect.

For html document parsing, there is no parsing class like XML, because HTML documents have many unpaired tags and are not strict. At this time, we need to use some other helper classes. simplehtmldom is a parsing class similar to JQuery that operates HTML documents. You can easily obtain the desired data, but it is a pity that the speed is slow. This is not the focus of our discussion here. I mainly use regular expressions to match the data I need to collect and quickly obtain the information I need.

Considering that get_html can judge the returned data, but get_htmls cannot be judged, the following two functions are written to facilitate the adjustment and call:
Copy codeThe Code is as follows:
Function get_matches ($ pattern, $ html, $ err_msg, $ multi = false, $ flags = 0, $ offset = 0 ){
If (! $ Multi ){
If (! Preg_match ($ pattern, $ html, $ matches, $ flags, $ offset )){
Echo $ err_msg ."! Error message: ". get_preg_err_msg ()." \ n ";
Return false;
}
} Else {
If (! Preg_match_all ($ pattern, $ html, $ matches, $ flags, $ offset )){
Echo $ err_msg ."! Error message: ". get_preg_err_msg ()." \ n ";
Return false;
}
}
Return $ matches;
}
Function get_preg_err_msg (){
$ Error_code = preg_last_error ();
Switch ($ error_code ){
Case PREG_NO_ERROR:
$ Err_msg = 'preg _ NO_ERROR ';
Break;
Case PREG_INTERNAL_ERROR:
$ Err_msg = 'preg _ INTERNAL_ERROR ';
Break;
Case PREG_BACKTRACK_LIMIT_ERROR:
$ Err_msg = 'preg _ BACKTRACK_LIMIT_ERROR ';
Break;
Case PREG_RECURSION_LIMIT_ERROR:
$ Err_msg = 'preg _ RECURSION_LIMIT_ERROR ';
Break;
Case PREG_BAD_UTF8_ERROR:
$ Err_msg = 'preg _ BAD_UTF8_ERROR ';
Break;
Case PREG_BAD_UTF8_OFFSET_ERROR:
$ Err_msg = 'preg _ BAD_UTF8_OFFSET_ERROR ';
Break;
Default:
Return 'unknown error! ';
}
Return $ err_msg. ':'. $ error_code;
}

It can be called as follows:
Copy codeThe Code is as follows:
$ Url = 'HTTP: // www.baidu.com ';
$ Html = get_html ($ url );
$ Matches = get_matches ('! <A [^ <] + </a>! ', $ Html,' No link found ', true );
If ($ matches ){
Var_dump ($ matches );
}

You can also call this method as follows:
Copy codeThe Code is as follows:
$ Urls = array ('HTTP: // www.baidu.com ', 'HTTP: // www.hao123.com ');
$ Htmls = get_htmls ($ urls );
Foreach ($ htmls as $ html ){
$ Matches = get_matches ('! <A [^ <] + </a>! ', $ Html,' No link found ', true );
If ($ matches ){
Var_dump ($ matches );
}
}

You can get the required information. No matter whether it is single page collection or multi-page collection, PHP can only process one page. Because get_matches is used, you can determine whether the returned value is true or false, the correct data is obtained. Because the regular expression is used when it exceeds the regular expression backtracking, The get_preg_err_msg is added to prompt the regular expression information.

During data collection, the list page is often collected. When the Content Page Link obtained from the list page is used to collect the content page or more layers, there will be many nested loops, code control does not work. Can we split the code on the collection list page and the Code on the collection content page, or more levels of collection code, or even simplify the loop?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.