Introduction to resource-based HTTP cache implementation

Last Update:2014-08-24 Source: Internet

Author: User

Tags knowledge base website performance

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

We all know that the browser will cache the webpage that has accessed the website. The browser accesses a webpage through the URL address, and the webpage content will be cached on the computer while displaying the webpage content. If the webpage is not updated, when the browser accesses the URL again, the webpage will not be downloaded again, but the locally cached webpage will be directly used. The web page will be downloaded again only when the website clearly identifies that the resource has been updated.

1. What is HTTP cache?

This kind of browser Web Cache Mechanism has been familiar to everyone, for example, javaeye News subscription address: http://www.iteye.com/rss/news, when the browser or subscription program to access this URL address, the javaeye server sends the following status identifier to the browser in the Response Header:

C code

Etag "427fe7b6442f2096dff4f92339305444"
Last-modified Fri, 04 Sep 2009 05:55:43 GMT

This tells the browser the last modification time and etag of the network resource subscribed by the news. The browser caches these two statuses together with the webpage content locally. When the browser accesses the javaeye News subscription address again, the browser will send the following two status IDs to the javaeye Server:

C code

If-None-match "427fe7b6442f2096dff4f92339305444"
If-modified-since Fri, 04 Sep 2009 05:55:43 GMT

It is to tell the server what is the last modification time and etag of the locally cached webpage. Does your server's resources have been updated after my last visit? Therefore, the javaeye server checks whether the user has not updated the news since the last visit, so there is no need to generate this RSS and directly tells the browser: "There is nothing new, you should check your cached webpage. "The server then sends a 304 not modified message, so you don't have to do anything else.

This is the cache at the HTTP layer. Using this resource-based cache mechanism not only greatly saves server program resources, but also reduces the number of Web Page downloads and saves a lot of network bandwidth.

Ii. What is the role of HTTP cache?

In our common dynamic website programming, server programs simply do not process the IF-None-match and if-modified-since status identifiers sent by browsers, A webpage is generated and sent to the browser as long as there is a request. In general, users will not always refresh a page endlessly, so we do not think that this resource-based Cache has much effect, but this is not the case:

1. Intelligent Web Crawlers like Google can effectively identify the status information of resources. If this cache mechanism is used, crawlers can be crawled much less.

For example, Google crawls the javaeye website about 0.15 million times a day, but in fact javaeye has no more than 10 thousand webpages updated every day. Because a lot of content is updated quickly, Google will continue to crawl, which itself will cause a lot of waste of resources. If we use HTTP cache, the web page will be crawled only when the content of the Web page changes. Otherwise, we can directly tell Google's crawler 304 not modified. This not only reduces the network bandwidth consumption caused by server loads and crawlers, but also greatly improves the efficiency of Google crawlers. Isn't it all happy?

2. Many webpages with infrequent content updates. Although users do not frequently refresh, using HTTP cache for a long period of time can still play a significant role in caching.

For example, some historical posts have been discussed for several months, and the content of these posts is rarely updated. Users may access this page from time to time by searching, adding links to favorites, and associating articles. After a user accesses the server once, all subsequent access servers can directly send 304 not modified instead of generating pages.

3. Using HTTP cache for historical posts can prevent repeated crawling by crawlers.

For example, in the javaeye forum post list page, few posts after 20 pages have been accessed directly, but they are viewed from the server log, every day, a large number of crawlers repeatedly crawl these pages to the very back pages. These pages are rarely clicked by users, so they are basically not cached by the application's memcached. Each access will cause high resource consumption, and crawlers will crawl every time, it is a huge burden on the server. If HTTP cache is used, 304 not modified can be directly returned no matter how many times the crawler crawls once, greatly saving the server load.

3. How to Use http cache in applications

If we want to implement HTTP cache in our own program, it is very simple, especially for rails, we only need to add a little bit of Code. For the javaeye News subscription above, just add a line of code:

Ruby code

Def news
Fresh_when (: last_modified => News. Last. created_at,: etag => News. Last)
End

Use the latest news article as etag. The last modification time of this article is used as the last modification time of the resource. If the ID sent by the browser is the same as that sent by the server, it indicates that the content is not updated and 304 not modified is sent directly. If the content is inconsistent, it indicates that the content is updated and the local cache of the browser is too old, the server needs to generate a page.

The above is just the simplest example. It is easy to do more work based on the status. For example, javaeye blog RSS feed address: http://robbin.iteye.com/rss

Ruby code

@ Blogs = @ blog_owner.last_blogs
@ Hash = @ blogs. Collect {| B | {B. ID => B. Post. modified_at.to_ I + B. posts_count}. Hash
If stale? (: Last_modified => (@ blog_owner.last_blog.post.modified_at | @ blog_owner.last_blog.post.created_at),: etag => @ hash)
Render: template => "RSS/blog"
End

This implementation is a little more complicated. We need to determine whether all output articles subscribed to by the blog are updated. Therefore, we use the last modification time of the blog article content and the number of blog comments to make a hash, then, the hash value is used as the resource's etag. As long as any content in these blog posts is modified or any new comments are made, the etag value is changed to notify the browser that the content is updated.

In addition to RSS subscriptions, The javaeye website also has many areas suitable for using HTTP cache. For example, the layout list page of The javaeye forum may be refreshed by users who often like Forum preparation, check whether there are any new posts, so we do not have to execute the program every time the user requests to generate a page for him. If there is no new post, we can tell him 304 not modified directly. The layout action code before the HTTP cache is not used:

Ruby code

Def Board
@ Topics = @ Forum. Topics. paginate...
@ Announcements = (Params [: Page] | 1). to_ I = 1? Topic. Find: All,: conditions =>...
Render: Action => 'show'
End

After an HTTP cache is added, the Code is as follows:

Ruby code

Def Board
@ Topics = @ Forum. Topics. paginate...
If logged_in? | Stale? (: Last_modified => @ topics [0]. last_post.created_at,: etag => @ topics. Collect {| T | {T. ID => T. posts_count}. Hash)
@ Announcements = (Params [: Page] | 1). to_ I = 1? Topic. Find: All,: conditions...
Render: Action => 'show'
End
End

For login users, HTTP cache is not used, because login users need to receive SMS notifications and subscription notifications in real time, so we can only use HTTP cache for anonymous users, then, we use the ID and number of replies of all current posts to construct a hash for etag. In this way, the page is updated as long as any posts on the current page are changed or new replies are received, otherwise, you do not have to regenerate the page.

In fact, you can also use HTTP cache on the forum post page, but the hash algorithm of etag is a little complicated. You need to ensure that any changes to the post will change the hash value. The sample code is as follows:

Ruby code

Def show
@ Topic = topic. Find Params [: Id]
User_session.update _... If logged_in?
Topic. increment_counter (...) if ......
@ Posts = @ topic. post_by_page Params [: Page]
Posts_hash = @ posts. Collect {| p | {P. ID => P. modified_at}. Hash
Topic_hash = @ topic. forum_id + @ topic. sys_tag_id.to_ I + @ topic. Title. Hash + @ topic. status_flag.hash
Ad_hash =... (AD hash algorithm, omitted)
If logged_in? | Stale? (: Etag => [posts_hash, topic_hash, ad_hash])
Render
End
End

Hash all replies on the page and ad content on the post Page Based on the topic, and calculate a unique etag value to ensure that a new etag is generated for any changes, this is done, isn't it easy! This kind of post cache is very effective, which can avoid rails to render pages and download pages, greatly reducing server load and bandwidth.

Another example with Special Requirements: For the recommendation page of knowledge base search related articles, for example, could not modified, rails does not have direct facilities for us to use, we need to know a little about the rails mechanism and write it by ourselves. The sample code is as follows:

Ruby code

Def topic
@ Topic = topic. Find (Params [: Id])
Unless logged_in?
If request. not_modified? (5. Days. ago)
Head: not_modified
Else
Response. last_modified = time. Now
End
End
End

In each user request, we determine whether the user has accessed the page within five days. If the user has accessed the page, 304 not modified is directly returned. If the user has not accessed the page, or the last access has exceeded 5 days, set the latest modification time to the current time, and then generate the page for the user. Is it easy?

After an HTTP cache is added to all RSS subscription outputs of the javaeye website, we can see that more than half of the RSS subscription requests have been cached and 304 not modified is returned directly, therefore, the effect is very obvious. Because the javaeye website has subscribed to more than 0.1 million dynamic requests every day, adding HTTP cache can reduce the load on many servers and bandwidth consumption. In addition, you can add an HTTP cache to the news article page, the entire Forum channel, and the knowledge base-related recommendation Article Page. After rough calculation, all pages of javaeye use HTTP cache, the overall website performance can be improved by at least 10%.

Introduction to resource-based HTTP cache implementation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More