Blog crawl system, blog crawl

Blog crawl system, blog crawl _php tutorial

Last Update:2016-07-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction

Weekend nothing dry, bored, using PHP to do a blog crawl system, I often visit is cnblogs, of course, from the blog park (see I still like the blog Park) began to start, my crawl is relatively simple, get web content, and then through regular matching, get to the desired things, and then save the database, of course , some problems will be encountered in the actual process. Before doing this already thought well, want to make expandable, if someday want to add Csdn, 51cto, Sina Blog These content can easily expand.

Those things can be crawled?

First of all to say, this is a simple crawl, not all of the pages can be seen to crawl, some things are not crawled, like the following

For example, from the start of the link a crawl, if depth is 1, to get the content to play the current link is done, if depth is 2, from the content of a link to the specified rules to match the link, the matching to the link also do depth 1 of the processing, and so on, depth is to get the depth of the link, the hierarchy. So the crawler can "crawl up".

Of course, with a link to crawl specific content, this crawl is very limited, or it may not have climbed up to die (the next level does not match the content), so when the crawl can set multiple start links. Of course, when crawling is likely to encounter a lot of duplicate links, so also have to mark the crawled links, to prevent duplicate access to the same content, resulting in redundancy. There are several variables to cache this information in the following format

First, is a hash array, the key value is the MD5 value of the URL, the state is 0, maintain an array of non-duplicate URLs, form the following forms
Array (    = 0    = 0    = 0    = 0    = 0      = 0    =0    = 0    ... )


The second is to get the array of URLs, this place can also be optimized, I will all link links to get all the array, and then go to the loop array to get the content, it is said that all the maximum depth minus 1 of the content has been obtained two times, here can be directly in the next level of content access to the content, Then the state in the above array is modified to 1 (acquired), which improves efficiency. First look at the contents of the array where the links are saved:
Array(    [0] = =Array        (            [0] = + http://zzk.cnblogs.com/s?t=b&w=php&p=1        )    [1] = =Array        (            [0] = + http://www.cnblogs.com/baochuan/archive/2012/03/12/2391135.html[1] + = http://www.cnblogs.com/ohmygirl/p/internal-variable-1.html[2] + = http://www.cnblogs.com/zuoxiaolong/p/java1.html......        )    [2] = =Array        (            [0] = + http://www.cnblogs.com/ohmygirl/category/623392.html[1] + = http://www.cnblogs.com/ohmygirl/category/619019.html[2] + = http://www.cnblogs.com/ohmygirl/category/619020.html......        ))
Finally, all the links are spelled as an array return, allowing the program to loop through the contents of the connection. Just like the above get level is 2, 0 levels of the chain content has been acquired, just to get the link in Level 1, all the link content in Level 1 has been obtained, only to save the link in Level 2, and when the actual content will be retrieved on the above content, And the state of the above hash array is not used ... (need to be optimized).

There is a regular to get the article, through the analysis of the blog Park article content, found that the article title, the body part of the basic can be very regular access to

Title, the form of the title HTML code is the format that can be easily matched with the following regular match to
# ]*?> (. *?) <\/a> #is

Body, body parts can be through the regular expression of the advanced characteristics of the balance group is easy to obtain, but a half-day discovery PHP seems to support the balance group is not very good, so give up the balance group, in the HTML source found through the following regular can also easily match the content of the article body, Each article basically has the content in
# (]*?>.*) #is

Begin:
 for ($i= 1; $i<=100; $i+ +)            {echo "page{$i}*************************[begin]****************** \ r"            ; $spidercnblogs New C\spidercnblogs ("http://zzk.cnblogs.com/s?t=b&w=php&p={$i}");             $urls $spidercnblogs-Spiderurls ();              die ();             foreach ($urlsas$key$value) {                $cnblogs->grap ( $value );                 $cnblogs, Save ();            }        }

At this point, you can go to catch their favorite things, crawl speed is not very fast, I in a common PC on the above 10 process, grabbed for several hours, only to obtain the 40多万条 data, good to see the content crawled to a slightly optimized after the display effect, which added the blog garden base CSS code, You can see the effect and

The crawled content is slightly modified:
Original content
Github--myblogs

The copyright belongs to the author Iforever (luluyrt@163.com) all, without the author's consent to prohibit any form of reprint, reprinted article must be in the article page obvious location to the author and the original text connection, otherwise reserve the right to pursue legal responsibility.

http://www.bkjia.com/PHPjc/948224.html www.bkjia.com true http://www.bkjia.com/PHPjc/948224.html techarticle Blog Crawl system, blog crawl Introduction Weekend nothing dry, bored, using PHP to do a blog crawl system, I often visit is cnblogs, of course from the blog park (see I still like ...)



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Blog crawl system, blog crawl _php tutorial

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Blog crawl system, blog crawl _php tutorial

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support