Blog crawling system

Source: Internet
Author: User
: This article mainly introduces the blog crawling system. if you are interested in the PHP Tutorial, refer to it. Introduction

I had nothing to do over the weekend and got bored. I used php to create a blog crawling system. I often visited cnblogs. of course, I started from the blog garden (which I like very much, I can easily capture the webpage content, obtain the desired content through regular expression matching, and save the database. of course, some problems may occur in the actual process. I have already thought about it before, and it will be extensible. if I want to add csdn, 51cto, and Sina blog on any day, it will be very easy to expand.

What can be captured?

First of all, this is a simple crawling. not all items on the web page can be crawled. some items cannot be crawled, just like the following:

The number of reading times, comments, recommendations, objections, comments ......, These are obtained dynamically by calling ajax through js, so they cannot be obtained. In fact, you can open a webpage and right-click to view the source code, which is not directly visible in the source code, this kind of simple crawling may be problematic. to capture the content filled with ajax, you have to think of other methods. someone has read an article before loading the webpage through a browser, then we can filter the entire dom (that article also said, this is very inefficient). of course, splicing these js requests is also acceptable, and it may be quite troublesome.

Concept of crawling

First, depth

For example, if the value of depth is 1 and the content of the current link is retrieved, if the value of depth is 2, match the link according to the specified rule from the content of the link. depth is 1 for the matched link, and so on. depth is used to obtain the depth and level of the link. In this way, crawlers can "crawl".

Of course, if you use a link to crawl a specific content, the crawler is very limited, or it may die before it gets up (the level does not match the content in the future ), therefore, you can set multiple start links during crawling. Of course, you may encounter many duplicate links during crawling. Therefore, you have to mark the captured links to prevent repeated accesses to the same content and cause redundancy. There are several variables to cache this information, in the following format:

First, it is a hash array. The key value is the md5 value of the url and the status is 0. maintain a non-repeated url array in the following form:

Array(    [bc790cda87745fa78a2ebeffd8b48145] => 0    [9868e03f81179419d5b74b5ee709cdc2] => 0    [4a9506d20915a511a561be80986544be] => 0    [818bcdd76aaa0d41ca88491812559585] => 0    [9433c3f38fca129e46372282f1569757] => 0    [f005698a0706284d4308f7b9cf2a9d35] => 0    [e463afcf13948f0a36bf68b30d2e9091] => 0    [23ce4775bd2ce9c75379890e84fadd8e] => 0    ......)

The second is the url array to be obtained, which can also be optimized. I will retrieve all the links to the array and then loop through the array to get the content, all the content with the maximum depth minus 1 is obtained twice. here we can get the content directly when getting the next level of content, then, the status in the preceding array is changed to 1 (obtained), which improves the efficiency. Let's take a look at the content of the array that saves the link:

Array(    [0] => Array        (            [0] => http://zzk.cnblogs.com/s?t=b&w=php&p=1        )    [1] => Array        (            [0] => http://www.cnblogs.com/baochuan/archive/2012/03/12/2391135.html            [1] => http://www.cnblogs.com/ohmygirl/p/internal-variable-1.html            [2] => http://www.cnblogs.com/zuoxiaolong/p/java1.html                ......        )    [2] => Array        (            [0] => http://www.cnblogs.com/ohmygirl/category/623392.html            [1] => http://www.cnblogs.com/ohmygirl/category/619019.html            [2] => http://www.cnblogs.com/ohmygirl/category/619020.html                ......        ))

Finally, all the links are combined into an array and returned, allowing the program to obtain the connection content cyclically. Just as the chain content at level 2 and Level 0 has been received, it is only used to obtain links at level 1 and all links at level 1 have also been obtained, it is only used to save the link of level 2. when the content is actually obtained, the previous content will be obtained again, and the status in the hash array above will not be used... (To be optimized ).

There is also a regular expression for getting the article. by analyzing the content of the article in the blog, we can find that the article title and body can be obtained in a regular way.

The title and title html code are in the same format and can be easily matched using the following regular expressions

#]*?>(.*?)<\/a>#is

The body part can be easily obtained through the regular expression's advanced feature balancing group, but after half a day it seems that php does not support the balancing group very well, therefore, we discard the balance group and find in the html source code that the following regular expressions can easily match the content of the article body. each article basically contains content.

#(
  
   ]*?>.*)
   
    #is
   
  

Start:

End:

The blog publishing time can also be obtained, but some articles may not be found when obtaining the publishing time, and this will not be listed here, with these things, you can crawl the content.

Start crawling

I started to crawl the content. at first, I set the crawling depth to 2 levels. The initial page was the homepage of the blog garden. I found that I could not crawl much content. later I found that the homepage of the blog garden had a page number navigation.

Then we try to splice the page into a page number format: http://www.cnblogs.com/#p2, and use each page as the start page, with a depth of 2 for capture. However, I was so happy that I started several processes and ran the program for a long time. I caught hundreds of thousands of processes. later I found that they were completely repeated and all of them were captured from the first page, this is because when you click the navigation button on the homepage of the blog Garden (except the first page), ajax requests are all obtained .... It seems that this problem is still taken into consideration in the blog Park, because most people only open the homepage and do not click the content next to it (I may occasionally click the next page ), therefore, in order to prevent beginners from crawling and performance issues, set the first page as a static webpage. the cache validity period is several minutes (or based on the new frequency, the number of updates to the cache, or the combination of the two, is also the reason why sometimes published articles will be displayed after a while (I guess ^_^ ).

Can't I crawl a lot of content at a time? Later, I found that all static webpages are used here.

The retrieved content from this location is static, including all the pages in the bottom navigation link are static, and there are filtering conditions on the right side of the search, it can better improve the capture quality. With this entry, you can get a lot of high-quality articles. below is the code that crawls 100 pages cyclically.

for($i=1;$i<=100;$i++){            echo "PAGE{$i}*************************[begin]***************************\r";            $spidercnblogs = new C\Spidercnblogs("http://zzk.cnblogs.com/s?t=b&w=php&p={$i}");            $urls = $spidercnblogs->spiderUrls();            die();            foreach ($urls as $key => $value) {                $cnblogs->grap($value);                $cnblogs->save();            }        }

So far, I can capture what I like. the crawling speed is not very fast. I started 10 processes on an ordinary pc and captured it for several hours, the first more than 0.4 million pieces of data are obtained. let's take a look at the display effect after the captured content is slightly optimized. here, the basic css code of the blog garden is added to show the effect and

The captured content is slightly modified:

Original content

 

Let's look at the file directory structure, which is also generated using the previous self-made directory generation tool:

+ MyBlogs-master
+ Controller
| _ Blog. php
| _ Blogcnblogs. php
| _ Spider. php
| _ Spidercnblogs. php
+ Core
| _ Autoload. php
+ Interface
| _ Blog. php
+ Lib
| _ Mysql. php
+ Model
| _ Blog. php
| _ App. php

The effect is still very good. here I guess it's a special way to crawl a website, a resident process, to get the content (such as the homepage) every other time ), if there is fresh content in the database, give up the obtained content and wait for the next retrieval. when the time is very small, you can capture a "fresh" content.

This is the github address:

Github -- myBlogs

The copyright of this article is owned by the author iforever (luluyrt@163.com), without the author's consent to prohibit any form of Reprint, repost the article must be in the obvious position on the article page to give the author and the original connection, otherwise, you are entitled to pursue legal liability.

The above introduces the blog crawling system, including some content, and hopes to help those who are interested in the PHP Tutorial.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.