Based on the first two blog posts:
Use of the single page collection function get_html Based on curl data collection
Use of the single-page parallel collection function get_htmls Based on curl data collection
We can get the html file we need, and now we need to process the file to get the data we need to collect.
For html document parsing, there is no parsing class like XML, because HTML documents have many unpaired tags and are not strict. At this time, we need to use some other helper classes. simplehtmldom is a parsing class similar to JQuery that operates HTML documents. You can easily obtain the desired data, but it is a pity that the speed is slow. This is not the focus of our discussion here. I mainly use regular expressions to match the data I need to collect and quickly obtain the information I need.
Considering that get_html can judge the returned data, but get_htmls cannot be judged, the following two functions are written to facilitate the adjustment and call:
Copy codeThe Code is as follows:
Function get_matches ($ pattern, $ html, $ err_msg, $ multi = false, $ flags = 0, $ offset = 0 ){
If (! $ Multi ){
If (! Preg_match ($ pattern, $ html, $ matches, $ flags, $ offset )){
Echo $ err_msg ."! Error message: ". get_preg_err_msg ()." \ n ";
Return false;
}
} Else {
If (! Preg_match_all ($ pattern, $ html, $ matches, $ flags, $ offset )){
Echo $ err_msg ."! Error message: ". get_preg_err_msg ()." \ n ";
Return false;
}
}
Return $ matches;
}
Function get_preg_err_msg (){
$ Error_code = preg_last_error ();
Switch ($ error_code ){
Case PREG_NO_ERROR:
$ Err_msg = 'preg _ NO_ERROR ';
Break;
Case PREG_INTERNAL_ERROR:
$ Err_msg = 'preg _ INTERNAL_ERROR ';
Break;
Case PREG_BACKTRACK_LIMIT_ERROR:
$ Err_msg = 'preg _ BACKTRACK_LIMIT_ERROR ';
Break;
Case PREG_RECURSION_LIMIT_ERROR:
$ Err_msg = 'preg _ RECURSION_LIMIT_ERROR ';
Break;
Case PREG_BAD_UTF8_ERROR:
$ Err_msg = 'preg _ BAD_UTF8_ERROR ';
Break;
Case PREG_BAD_UTF8_OFFSET_ERROR:
$ Err_msg = 'preg _ BAD_UTF8_OFFSET_ERROR ';
Break;
Default:
Return 'unknown error! ';
}
Return $ err_msg. ':'. $ error_code;
}
It can be called as follows:
Copy codeThe Code is as follows:
$ Url = 'HTTP: // www.baidu.com ';
$ Html = get_html ($ url );
$ Matches = get_matches ('! <A [^ <] + </a>! ', $ Html,' No link found ', true );
If ($ matches ){
Var_dump ($ matches );
}
You can also call this method as follows:
Copy codeThe Code is as follows:
$ Urls = array ('HTTP: // www.baidu.com ', 'HTTP: // www.hao123.com ');
$ Htmls = get_htmls ($ urls );
Foreach ($ htmls as $ html ){
$ Matches = get_matches ('! <A [^ <] + </a>! ', $ Html,' No link found ', true );
If ($ matches ){
Var_dump ($ matches );
}
}
You can get the required information. No matter whether it is single page collection or multi-page collection, PHP can only process one page. Because get_matches is used, you can determine whether the returned value is true or false, the correct data is obtained. Because the regular expression is used when it exceeds the regular expression backtracking, The get_preg_err_msg is added to prompt the regular expression information.
During data collection, the list page is often collected. When the Content Page Link obtained from the list page is used to collect the content page or more layers, there will be many nested loops, code control does not work. Can we split the code on the collection list page and the Code on the collection content page, or more levels of collection code, or even simplify the loop?