Blog crawl system, blog crawl _php tutorial

Source: Internet
Author: User

Blog crawl system, blog crawl


Introduction

Weekend nothing dry, bored, using PHP to do a blog crawl system, I often visit is cnblogs, of course, from the blog park (see I still like the blog Park) began to start, my crawl is relatively simple, get web content, and then through regular matching, get to the desired things, and then save the database, of course , some problems will be encountered in the actual process. Before doing this already thought well, want to make expandable, if someday want to add Csdn, 51cto, Sina Blog These content can easily expand.

Those things can be crawled?

First of all to say, this is a simple crawl, not all of the pages can be seen to crawl, some things are not crawled, like the following

For example, from the start of the link a crawl, if depth is 1, to get the content to play the current link is done, if depth is 2, from the content of a link to the specified rules to match the link, the matching to the link also do depth 1 of the processing, and so on, depth is to get the depth of the link, the hierarchy. So the crawler can "crawl up".

Of course, with a link to crawl specific content, this crawl is very limited, or it may not have climbed up to die (the next level does not match the content), so when the crawl can set multiple start links. Of course, when crawling is likely to encounter a lot of duplicate links, so also have to mark the crawled links, to prevent duplicate access to the same content, resulting in redundancy. There are several variables to cache this information in the following format

First, is a hash array, the key value is the MD5 value of the URL, the state is 0, maintain an array of non-duplicate URLs, form the following forms

Array (    = 0    = 0    = 0    = 0    = 0      = 0    =0    = 0    ... )

The second is to get the array of URLs, this place can also be optimized, I will all link links to get all the array, and then go to the loop array to get the content, it is said that all the maximum depth minus 1 of the content has been obtained two times, here can be directly in the next level of content access to the content, Then the state in the above array is modified to 1 (acquired), which improves efficiency. First look at the contents of the array where the links are saved:

Array(    [0] = =Array        (            [0] = + http://zzk.cnblogs.com/s?t=b&w=php&p=1        )    [1] = =Array        (            [0] = + http://www.cnblogs.com/baochuan/archive/2012/03/12/2391135.html[1] + = http://www.cnblogs.com/ohmygirl/p/internal-variable-1.html[2] + = http://www.cnblogs.com/zuoxiaolong/p/java1.html......        )    [2] = =Array        (            [0] = + http://www.cnblogs.com/ohmygirl/category/623392.html[1] + = http://www.cnblogs.com/ohmygirl/category/619019.html[2] + = http://www.cnblogs.com/ohmygirl/category/619020.html......        ))

Finally, all the links are spelled as an array return, allowing the program to loop through the contents of the connection. Just like the above get level is 2, 0 levels of the chain content has been acquired, just to get the link in Level 1, all the link content in Level 1 has been obtained, only to save the link in Level 2, and when the actual content will be retrieved on the above content, And the state of the above hash array is not used ... (need to be optimized).

There is a regular to get the article, through the analysis of the blog Park article content, found that the article title, the body part of the basic can be very regular access to

Title, the form of the title HTML code is the format that can be easily matched with the following regular match to

# ]*?> (. *?) <\/a> #is

Body, body parts can be through the regular expression of the advanced characteristics of the balance group is easy to obtain, but a half-day discovery PHP seems to support the balance group is not very good, so give up the balance group, in the HTML source found through the following regular can also easily match the content of the article body, Each article basically has the content in

# (]*?>.*) #is

Begin:

for ($i= 1; $i<=100; $i+ +) {echo "page{$i}*************************[begin]****************** \ r" ; $spidercnblogs New C\spidercnblogs ("http://zzk.cnblogs.com/s?t=b&w=php&p={$i}"); $urls $spidercnblogs-Spiderurls (); die (); foreach ($urlsas$key$value) { $cnblogs->grap ( $value ); $cnblogs, Save (); } }

At this point, you can go to catch their favorite things, crawl speed is not very fast, I in a common PC on the above 10 process, grabbed for several hours, only to obtain the 40多万条 data, good to see the content crawled to a slightly optimized after the display effect, which added the blog garden base CSS code, You can see the effect and

The crawled content is slightly modified:

Original content

Github--myblogs

The copyright belongs to the author Iforever (luluyrt@163.com) all, without the author's consent to prohibit any form of reprint, reprinted article must be in the article page obvious location to the author and the original text connection, otherwise reserve the right to pursue legal responsibility.

http://www.bkjia.com/PHPjc/948224.html www.bkjia.com true http://www.bkjia.com/PHPjc/948224.html techarticle Blog Crawl system, blog crawl Introduction Weekend nothing dry, bored, using PHP to do a blog crawl system, I often visit is cnblogs, of course from the blog park (see I still like ...)

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.