I used the php Snoopy class for two days and found it useful. You can use fetchlinks to retrieve all the links in the request webpage. You can use fetchtext to obtain all the text information (the regular expression is used internally for processing). There are many other functions, for example, simulate a submission form.
Usage:
First download Snoopy class: http://sourceforge.net/projects/snoopy/
Instantiate an object and call the corresponding method to obtain the captured webpage information.
Copy codeThe Code is as follows:
Include 'snoopy/snoopy. class. php ';
$ Snoopy = new Snoopy ();
$ SourceURL = "http://www.jb51.net ";
$ Snoopy-> fetchlinks ($ sourceURL );
$ A = $ snoopy-> results;
It does not provide a way to get all the image addresses on the webpage. You need to obtain the image addresses in the list of all the articles on the page. Then I wrote one myself, mainly because it is important to match the regular expression.
Copy codeThe Code is as follows:
// Match the regular expression of the image
$ ReTag = "//I ";
Because of the special requirements, you only need to capture the pictures starting with "Dead htp: //" (the pictures on the external site may cause anti-leech protection and want to capture them locally first)
1. Capture the specified webpage and filter out all the expected article addresses;
2. Capture the article address in step 1 cyclically, and then use the regular expression matching the image to obtain the addresses of all the images on the page that comply with the rules;
3. Save the image according to the image suffix and ID (only gif and jpg are available here). If the image file exists, delete it and save it.
Copy codeThe Code is as follows:
<Meta http-equiv = 'content-type' content = 'text/html; charset = UTF-8 '>
<? Php
Include 'snoopy/snoopy. class. php ';
$ Snoopy = new Snoopy ();
$ SourceURL = "http: // xxxxx ";
$ Snoopy-> fetchlinks ($ sourceURL );
$ A = $ snoopy-> results;
$ Re = "/d+.html $ /";
// Filter requests for getting the specified file address
Foreach ($ a as $ tmp ){
If (preg_match ($ re, $ tmp )){
GetImgURL ($ tmp );
}
}
Function getImgURL ($ siteName ){
$ Snoopy = new Snoopy ();
$ Snoopy-> fetch ($ siteName );
$ FileContent = $ snoopy-> results;
// Match the regular expression of the image
$ ReTag = "//I ";
If (preg_match ($ reTag, $ fileContent )){
$ Ret = preg_match_all ($ reTag, $ fileContent, $ matchResult );
For ($ I = 0, $ len = count ($ matchResult [1]); $ I <$ len; ++ $ I ){
SaveImgURL ($ matchResult [1] [$ I], $ matchResult [2] [$ I]);
}
}
}
Function saveImgURL ($ name, $ suffix ){
$ Url = $ name. ".". $ suffix;
Echo "requested image address:". $ url. "<br/> ";
$ ImgSavePath = "E:/xxx/style/images /";
$ ImgId = preg_replace ("/^. +/(d +) $/", "\ 1", $ name );
If ($ suffix = "gif "){
$ ImgSavePath. = "emotion ";
} Else {
$ ImgSavePath. = "topic ";
}
$ ImgSavePath. = ("/". $ imgId. ".". $ suffix );
If (is_file ($ imgSavePath )){
Unlink ($ imgSavePath );
Echo "<p style = 'color: # f00; '> File". $ imgSavePath. "already exists and will be deleted </p> ";
}
$ ImgFile = file_get_contents ($ url );
$ Flag = file_put_contents ($ imgSavePath, $ imgFile );
If ($ flag ){
Echo "<p> File". $ imgSavePath. "saved successfully </p> ";
}
}
?>
When using php to capture webpages: content, images, and links, I think the most important thing is regular expressions (retrieve desired data based on the captured content and specified rules ), in fact, the ideas are relatively simple, and there are not many methods used, that's just a few (and you can directly call the methods in the classes written by others to capture the content)
However, I have previously thought that php does not seem to implement the following methods. For example, if a file contains N rows (N is large), replace the contents of the rows that comply with the rules, for example, if Row 3 is aaa, it needs to be converted to bbbbb. Common Methods for modifying files:
1. Read the entire file (or read it row by row) at a time, save the final conversion result using a temporary file, and replace the original file.
2. Read data row by row, use fseek to control the position of the file pointer, and then write data to fwrite
Solution 1: When the file is large, one read operation is not feasible (Reading data row by row, writing data to a temporary file, and replacing the original file is inefficient ), solution 2: there is no problem when the length of the string to be replaced is smaller than or equal to the target value, but if it exceeds the value, it will be "out of bounds ", the data in the next row is also disrupted (it cannot be replaced with new content like the concept of "constituency" in JavaScript ).
The following code uses solution 2 for testing:
Copy codeThe Code is as follows:
<? Php
$ Mode = "r + ";
$ Filename = "d:/file.txt ";
$ Fp = fopen ($ filename, $ mode );
If ($ fp ){
$ I = 1;
While (! Feof ($ fp )){
$ Str = fgets ($ fp );
Echo $ str;
If ($ I = 1 ){
$ Len = strlen ($ str );
Fseek ($ fp,-$ len, SEEK_CUR); // move the Pointer Forward
Fwrite ($ fp, "123 ");
}
I ++;
}
Fclose ($ fp );
}
?>
Read a row first. At this time, the file pointer actually refers to the beginning of the next line. Use fseek to move the file pointer back to the starting position of the previous line, and then use fwrite to replace it, because it is a replacement operation, without specifying the length, it will affect the data of the next row. What I want is to only operate on this row, for example, if you delete this row or replace the whole row with only one, the above example does not meet the requirements, maybe I have not found a proper method...