With two days of PHP Snoopy this class, found very useful. Get all the links inside the request page, use Fetchlinks directly, get all the text information using the Fetchtext (inside or using regular expressions for processing), there are many other functions, such as analog submission form.
How to use:
Download the Snoopy class first, download the address: http://sourceforge.net/projects/snoopy/
Instantiate an object and then call the appropriate method to get the crawled Web page information
Copy Code code as follows:
Include ' snoopy/snoopy.class.php ';
$snoopy = new Snoopy ();
$sourceURL = "Http://www.jb51.net";
$snoopy->fetchlinks ($sourceURL);
$a = $snoopy->results;
It doesn't provide a way to get all the image addresses in a Web page, and there's a need to get a picture address in a list of all the articles in a page. And then I wrote one, mostly or just right where the match was important.
Copy Code code as follows:
A regular expression that matches a picture
$reTag = "//i";
Because the demand is more special, only need to crawl write dead htp://the beginning of the picture (the picture of the outside station may make the chain of anti-theft, want to crawl to local)
1. Crawl the specified page, and filter out all the expected article addresses;
2. Loop crawl the article address in the first step, and then use the matching picture of the regular expression to match, get all the pages in accordance with the rules of the picture address;
3. According to the picture suffix and ID (here only gif, JPG) to save the picture---if this picture file exists, first delete it and save it.
Copy Code code as follows:
<meta http-equiv= ' content-type ' content= ' text/html;charset=utf-8 ' >
<?php
Include ' snoopy/snoopy.class.php ';
$snoopy = new Snoopy ();
$sourceURL = "Http://xxxxx";
$snoopy->fetchlinks ($sourceURL);
$a = $snoopy->results;
$re = "/d+.html$/";
Filter gets the specified file address request
foreach ($a as $tmp) {
if (Preg_match ($re, $tmp)) {
Getimgurl ($TMP);
}
}
function Getimgurl ($siteName) {
$snoopy = new Snoopy ();
$snoopy->fetch ($siteName);
$fileContent = $snoopy->results;
A regular expression that matches a picture
$reTag = "//i";
if (Preg_match ($reTag, $fileContent)) {
$ret = Preg_match_all ($reTag, $fileContent, $matchResult);
for ($i = 0, $len = count ($matchResult [1]); $i < $len; + + $i) {
Saveimgurl ($matchResult [1][$i], $matchResult [2][$i]);
}
}
}
function Saveimgurl ($name, $suffix) {
$url = $name. ". $suffix;
echo "Requested picture address:". $url. " <br/> ";
$imgSavePath = "e:/xxx/style/images/";
$imgId = Preg_replace ("/^.+/(d+) $/", "\1", $name);
if ($suffix = = "gif") {
$imgSavePath. = "Emotion";
} else {
$imgSavePath. = "topic";
}
$imgSavePath. = ("/". $imgId. ".". $suffix);
if (Is_file ($imgSavePath)) {
Unlink ($imgSavePath);
echo "<p style= ' color: #f00; ' > Documents ". $imgSavePath." already exists, will be deleted </p> ";
}
$imgFile = file_get_contents ($url);
$flag = File_put_contents ($imgSavePath, $imgFile);
if ($flag) {
echo "<p> file". $imgSavePath. " Save Success </p> ";
}
}
?>
When using PHP to crawl Web pages: content, pictures, links, I think the most important is the regular (according to the content of the capture and the specified rules to get the data), in fact, the idea is relatively simple, the use of the method is not much, but also the few (and crawl content or directly call someone else to write a good class method on it)
But the previous thought is that PHP does not seem to implement the following methods, such as a file with n rows (n is very large), you need to match the rules of the line content to be replaced, such as the 3rd line is AAA need to turn into bbbbb. Common practices when you need to modify files:
1. Read the entire file (or read it line by row) at a time, and then use the temporary file to save the result of the final conversion and replace the original file
2. Read-by-line, use fseek to control the position of the file pointer, and then fwrite write
Scenario 1 when the file is larger, one fetch is not available (read line by row, then write temporary file and replace original file) The efficiency is not very high, and scenario 2 is not a problem when the length of the string being replaced is less than or equal to the target, but there is a problem, it "crosses over", The next line of data is also disrupted (not like the concept of "constituency" in JavaScript and replaced with new content).
Here's the code to experiment with scenario 2:
Copy Code code as follows:
<?php
$mode = "r+";
$filename = "D:/file.txt";
$fp = fopen ($filename, $mode);
if ($fp) {
$i = 1;
while (!feof ($fp)) {
$str = fgets ($FP);
Echo $str;
if ($i = = 1) {
$len = strlen ($STR);
Fseek ($FP,-$len, seek_cur);//Pointer moves forward
Fwrite ($FP, "123");
}
i++;
}
Fclose ($FP);
}
?>
Read a row first, when the file pointer actually refers to the beginning of the next line, use Fseek to move the file pointer back to the beginning of the previous line, and then use fwrite to replace the operation, because it is a substitution operation, without specifying the length of the case, it affects the next line of data, And what I want is just to do something about this line, such as deleting this line or replacing the whole line with just one 1, the example above doesn't meet the requirements, maybe I haven't found the right way ...