Blog crawl system, blog crawl
Introduction
Weekend nothing dry, bored, using PHP to do a blog crawl system, I often visit is cnblogs, of course, from the blog park (see I still like the blog Park) began to start, my crawl is relatively simple, get web content, and then through regular matching, get to the desired things, and then save the database, of course , some problems will be encountered in the actual process. Before doing this already thought well, want to make expandable, if someday want to add Csdn, 51cto, Sina Blog These content can easily expand.
Those things can be crawled?
First of all to say, this is a simple crawl, not all of the pages can be seen to crawl, some things are not crawled, like the following
For example, from the start of the link a crawl, if depth is 1, to get the content to play the current link is done, if depth is 2, from the content of a link to the specified rules to match the link, the matching to the link also do depth 1 of the processing, and so on, depth is to get the depth of the link, the hierarchy. So the crawler can "crawl up".
Of course, with a link to crawl specific content, this crawl is very limited, or it may not have climbed up to die (the next level does not match the content), so when the crawl can set multiple start links. Of course, when crawling is likely to encounter a lot of duplicate links, so also have to mark the crawled links, to prevent duplicate access to the same content, resulting in redundancy. There are several variables to cache this information in the following format
First, is a hash array, the key value is the MD5 value of the URL, the state is 0, maintain an array of non-duplicate URLs, form the following forms
Array ( = 0 = 0 = 0 = 0 = 0 = 0 =0 = 0 ... )
The second is to get the array of URLs, this place can also be optimized, I will all link links to get all the array, and then go to the loop array to get the content, it is said that all the maximum depth minus 1 of the content has been obtained two times, here can be directly in the next level of content access to the content, Then the state in the above array is modified to 1 (acquired), which improves efficiency. First look at the contents of the array where the links are saved:
Array( [0] = =Array ( [0] = + http://zzk.cnblogs.com/s?t=b&w=php&p=1 ) [1] = =Array ( [0] = + http://www.cnblogs.com/baochuan/archive/2012/03/12/2391135.html[1] + = http://www.cnblogs.com/ohmygirl/p/internal-variable-1.html[2] + = http://www.cnblogs.com/zuoxiaolong/p/java1.html...... ) [2] = =Array ( [0] = + http://www.cnblogs.com/ohmygirl/category/623392.html[1] + = http://www.cnblogs.com/ohmygirl/category/619019.html[2] + = http://www.cnblogs.com/ohmygirl/category/619020.html...... ))
Finally, all the links are spelled as an array return, allowing the program to loop through the contents of the connection. Just like the above get level is 2, 0 levels of the chain content has been acquired, just to get the link in Level 1, all the link content in Level 1 has been obtained, only to save the link in Level 2, and when the actual content will be retrieved on the above content, And the state of the above hash array is not used ... (need to be optimized).
There is a regular to get the article, through the analysis of the blog Park article content, found that the article title, the body part of the basic can be very regular access to
Title, the form of the title HTML code is the format that can be easily matched with the following regular match to
# ]*?> (. *?) <\/a> #is
Body, body parts can be through the regular expression of the advanced characteristics of the balance group is easy to obtain, but a half-day discovery PHP seems to support the balance group is not very good, so give up the balance group, in the HTML source found through the following regular can also easily match the content of the article body, Each article basically has the content in
# (]*?>.*) #is
Begin:
for ($i= 1; $i<=100; $i+ +) {echo "page{$i}*************************[begin]****************** \ r" ; $spidercnblogs New C\spidercnblogs ("http://zzk.cnblogs.com/s?t=b&w=php&p={$i}"); $urls $spidercnblogs-Spiderurls (); die (); foreach ($urlsas$key$value) { $cnblogs->grap ( $value ); $cnblogs, Save (); } }
At this point, you can go to catch their favorite things, crawl speed is not very fast, I in a common PC on the above 10 process, grabbed for several hours, only to obtain the 40多万条 data, good to see the content crawled to a slightly optimized after the display effect, which added the blog garden base CSS code, You can see the effect and
The crawled content is slightly modified:
Original content
Github--myblogs
The copyright belongs to the author Iforever (luluyrt@163.com) all, without the author's consent to prohibit any form of reprint, reprinted article must be in the article page obvious location to the author and the original text connection, otherwise reserve the right to pursue legal responsibility.
http://www.bkjia.com/PHPjc/948224.html www.bkjia.com true http://www.bkjia.com/PHPjc/948224.html techarticle Blog Crawl system, blog crawl Introduction Weekend nothing dry, bored, using PHP to do a blog crawl system, I often visit is cnblogs, of course from the blog park (see I still like ...)