Blog Crawl System

Last Update:2016-08-08 Source: Internet

Author: User

Tags php tutorial

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction

Weekend nothing dry, bored, using PHP to do a blog crawl system, I often visit is cnblogs, of course, from the blog park (see I still like the blog Park) began to start, my crawl is relatively simple, get web content, and then through regular matching, get to the desired things, and then save the database, of course , some problems will be encountered in the actual process. Before doing this already thought well, want to make expandable, if someday want to add Csdn, 51cto, Sina Blog These content can easily expand.

Those things can be crawled?

First of all to say, this is a simple crawl, not all of the pages can be seen to crawl, some things are not crawled, like the following

Among them, the number of reading, comments, number of recommendations, number of objections, comments ..., these are through the JS call Ajax dynamic acquisition, so is not get to, in fact, you open a Web page, and then right click to view the source code, in the source code directly can not see, this simple crawl may have a problem, To crawl those Ajax-filled content, to think of other ways, before you see an article, someone first loaded the Web page through the browser, and then the entire DOM on the line filter (that article also said, so inefficient), of course, splicing these JS request is also possible, estimated will be more troublesome.

The idea of crawling

First of all, crawl depth depth

For example, from the start of the link a crawl, if depth is 1, to get the content to play the current link is done, if depth is 2, from the content of a link to the specified rules to match the link, the matching to the link also do depth 1 of the processing, and so on, depth is to get the depth of the link, the hierarchy. So the crawler can "crawl up".

Of course, with a link to crawl specific content, this crawl is very limited, or it may not have climbed up to die (the next level does not match the content), so when the crawl can set multiple start links. Of course, when crawling is likely to encounter a lot of duplicate links, so also have to mark the crawled links, to prevent duplicate access to the same content, resulting in redundancy. There are several variables to cache this information in the following format

First, is a hash array, the key value is the MD5 value of the URL, the state is 0, maintain an array of non-duplicate URLs, form the following forms
Array (    = 0    = 0    = 0    = 0    = 0      = 0    =0    = 0    ... )

The second is to get the array of URLs, this place can also be optimized, I will all link links to get all the array, and then go to the loop array to get the content, it is said that all the maximum depth minus 1 of the content has been obtained two times, here can be directly in the next level of content access to the content, Then the state in the above array is modified to 1 (acquired), which improves efficiency. First look at the contents of the array where the links are saved:
Array(    [0] = =Array        (            [0] = + http://zzk.cnblogs.com/s?t=b&w=php&p=1        )    [1] = =Array        (            [0] = + http://www.cnblogs.com/baochuan/archive/2012/03/12/2391135.html[1] + = http://www.cnblogs.com/ohmygirl/p/internal-variable-1.html[2] + = http://www.cnblogs.com/zuoxiaolong/p/java1.html......        )    [2] = =Array        (            [0] = + http://www.cnblogs.com/ohmygirl/category/623392.html[1] + = http://www.cnblogs.com/ohmygirl/category/619019.html[2] + = http://www.cnblogs.com/ohmygirl/category/619020.html......        ))
Finally, all the links are spelled as an array return, allowing the program to loop through the contents of the connection. Just like the above get level is 2, 0 levels of the chain content has been acquired, just to get the link in Level 1, all the link content in Level 1 has been obtained, only to save the link in Level 2, and when the actual content will be retrieved on the above content, And the state of the above hash array is not used ... (need to be optimized).

There is a regular to get the article, through the analysis of the blog Park article content, found that the article title, the body part of the basic can be very regular access to

Title, the form of the title HTML code is the format that can be easily matched with the following regular match to
# ]*?> (. *?) <\/a> #is

Body, body parts can be through the regular expression of the advanced characteristics of the balance group is easy to obtain, but a half-day discovery PHP seems to support the balance group is not very good, so give up the balance group, in the HTML source found through the following regular can also easily match the content of the article body, Each article basically has the content in
# (]*?>.*) #is

Begin:

End:

The release time of the blog is also available, but some articles can not be found when the release time, this is not listed here, with these things can crawl content.

Start crawl

Start crawling content, initially I set the crawl depth is 2 levels, the initial page is the blog home, found that crawl not much content, and later found that the blog home page has a number of navigation

Just try to stitch into the page format http://www.cnblogs.com/#p2, Loop 200 times, to each page as the starting page, the depth of 2 to crawl. But I am happy too early, open a few process ran a long time program, caught hundreds of thousands of, and later found completely in the repetition, are from the first page crawl, because the blog home click on the navigation (except the first page), are AJAX request to get to .... It seems that the blog Park is still considering this issue, because most people only open the home page, not to click on the content (I may occasionally click on the next page), so in order to prevent the novice crawler to crawl and performance of the tradeoff, set the first page as a static Web page way, The cache is valid for a few minutes (or based on the new frequency, when updating the cache, or the combination of the two), which is why the post is sometimes published for a while (I guess ^_^).

Isn't it possible to crawl a lot of content at once? And then I found out that this place was all static Web pages.

Look at this place. The content obtained is static, including all the pages in the bottom navigation link are static, and the search on the right side of the filter, you can better improve the quality of the crawl. OK, with this entrance, you can get a lot of high-quality articles, the following is the Loop crawl 100 pages of code

  for  ( $i  =1;  $i  <=100;  $i  + + ) {  echo  "page{ $i }*************************[begin]******            \ r ";   $spidercnblogs  =  new  c\spidercnblogs ("http://zzk.cnblogs.com/s?t=b&w=php            &p={$i} ");              $urls  =  $spidercnblogs ->  spiderurls ();              die   ();                  foreach  ( $urls   as   $key  +  $value  ) {                  $cnblogs ->grap ( $value  );              $cnblogs ->  Save (); }}

At this point, you can go to catch their favorite things, crawl speed is not very fast, I in a common PC on the above 10 process, grabbed for several hours, only to obtain the 40多万条 data, good to see the content crawled to a slightly optimized after the display effect, which added the blog garden base CSS code, You can see the effect and

The crawled content is slightly modified:

Original content

Then look at the file directory structure, also used in the home-made directory Generation tool generated:

+myblogs-master
+controller
|_blog.php
|_blogcnblogs.php
|_spider.php
|_spidercnblogs.php
+core
|_autoload.php
+interface
|_blog.php
+lib
|_mysql.php
+model
|_blog.php
|_app.php

The effect is still very good, here again guess push cool This special crawl site work way, a resident process, to get a content (such as home), if there is fresh content storage, no words to give up the content of this acquisition, waiting for the next acquisition, when the time is very small can be a non-leaking crawl " Fresh "content.

This is the GitHub address:

Github--myblogs

The copyright belongs to the author Iforever (luluyrt@163.com) all, without the author's consent to prohibit any form of reprint, reprinted article must be in the article page obvious location to the author and the original text connection, otherwise reserve the right to pursue legal responsibility.

The above describes the blog crawl system, including the aspects of the content, I hope that the PHP tutorial interested in a friend helpful.



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More