A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
This is an informative document designed to make Web Cache concepts easier for developers to understand and apply to actual application environments. For the sake of simplicity, some implementation details are simplified or omitted. If you are more concerned about the implementation of the details, you do not have to patiently read this article. The reference documentation and more in-depth reading may be what you need.
The Web Cache is located between the Web server (one or more content source servers) and the client (one or more): the cache saves copies of the output content based on incoming requests, for example, HTML pages, images, and files (collectively referred to as copies). Then, when the next request comes: if the same URL is used, the cache directly uses the copy to respond to the access request, instead of sending a request to the source server again.
There are two main reasons for using cache:
For a new generation of Web browsers (such as IE and Firefox), cache settings can be found in the Setting dialog box, A hard disk space is used to store copies of the website you have already viewed in a remote location on your computer. Browser cache works according to very simple rules: during the same session (before the current browser is disabled), it checks once and determines that the cached copy is new enough. This cache is particularly useful for users to click "back" or click the link they just accessed. If you access the same image during browsing, these images can be instantly displayed from the browser cache.Proxy Server Cache
The Web Proxy server uses the same cache principle, but it is larger. The proxy server group uses the same mechanism for services of hundreds of users. Large Companies and ISPs often set up proxy caches or separate cache devices on their firewalls;
Because the cache of the routing server is not a part of the client or the source server, but is outside the original network, requests must be routed to them to take effect. One method is to manually set your browser: Tell the browser to use the proxy, and the other is to use the intermediate server: This intermediate server to process all Web requests and forward the requests to the background network, the user does not have to configure a proxy or even know the existence of the proxy;
Proxy Server cache: A Shared cache is not only used by a single user, but is often used by a large number of users. Therefore, it is effective in reducing the corresponding time and bandwidth usage: because the same copy will be reused multiple times.Gateway Cache
It is also called reverse proxy cache or indirect proxy cache. Gateway cache is also an intermediate server. It is used to deploy the cache as an intranet administrator to save bandwidth. Gateway cache is generally deployed by the website administrator himself: make it easier for their websites to expand and achieve better performance;
Requests can be routed to the gateway Cache Server in several ways: one or more Server Load balancer servers are typically used to make the client look like the source server;
Network Content Publishers (content delivery networks cdns) Distribute the gateway to cache the entire (or part of) Internet and sell the cache service to the desired website,SpeederaAndAkamaiIs a typical network content publisher (CDN ).
This topic focuses on browser and proxy caching. Of course, some information is equally effective for gateway caching;Is Web Cache harmless? Why encourage caching?
One of the most common technologies for Web caching on the Internet: website administrators are often afraid of losing control over their websites. Because the proxy cache will "hide" their users, it makes it hard for them to monitor who is using their website.
Unfortunately, even if the Web cache is not taken into account, many websites on the Internet use many parameters so that administrators can accurately track how users use their websites; if you are concerned about this kind of problem, this article will show you how to obtain accurate statistics without the unfriendly caching of website design.
Another complaint is that the cache will give users expired or invalid data. In any case, this article will show you how to configure your server to control how your content will be cached.
CDN is another interesting direction. It is different from other proxy caches: The CDN gateway caches the WebSite Services to be cached, so there is no such concern. Even if you use CDN, you need to consider the future Proxy Server cache and browser cache issues.
On the other hand: If your website is well planned, caching will help your website to provide faster services and save server load and Internet connection requests. This improvement is significant: a website that is difficult to cache may take several seconds to load the page, and the page with cache is almost instantly displayed: users prefer fast websites and access them more frequently;
In this way, many large Internet companies have invested millions of dollars in Server clusters around the world to make user access as fast as possible. The client cache is also the goal, but closer to the user end, and the best thing is that you don't even have to pay for it.
In fact, whether you like it or not, the proxy server and the browser enable the cache. If you do not configure the correct website cache, the cache will be performed according to the default or cache administrator policy.How the cache works
All caches use a set of rules to help them decide when to use cached copies to provide services (assuming a copy is available ); some rules are defined in the Protocol (HTTP 1.0 and 1.1), and some rules are set by the cache Administrator (the browser user or the proxy server administrator );
Generally speaking: follow the following basic rules (don't worry, you don't have to know all the details, and the details will be explained later)
In short:FreshnessAndVerificationIs the most important way to determine whether the content is available:
If the copy is new enough, it can be extracted from the cache immediately;
The cache verifies that the original copy has not changed, and the system will avoid re-transmission of the copy content from the source server.
There are many tools that can help designers and website administrators adjust the way the Cache Server treats websites. You may need to adjust the server configuration in person, but it is definitely worth it; for more information about how to use these tools, see the implementation section below;HTML meta tag and HTTP header information
HTML writers add various attributes of the description document to the Meta Tags are easy to use: But they are not efficient, because only several browsers will follow this tag (those browsers that actually "read" html ), no cache proxy server can follow this rule (because they almost do not parse the HTML content in the document at all). In some cases, the meta tag Pragma: No-cache will be added to the web page, if you want to refresh the page, this tag is completely unnecessary.
If your website is hosted in the ISP data center and the data center may not give you permission to control the HTTP header information (such as expires and cache-control), complain loudly: these mechanisms are required for your work;
On the other hand, the HTTP header allows you to control how the browser and proxy server process your copy. They are in HTMLCodeIs invisible, generally automatically generated by the Web server. However, depending on the service you are using, you can control it to some extent. Below: you will see some interesting HTTP header information and how to deploy these features on your site.
Before the HTTP header information is sent to HTML code, it can only be seen by the browser and some intermediate caches. the header information returned by a typical HTTP 1.1 protocol looks like this:HTTP/1.1 200 OK
An empty line of header information is followed by HTML code output. For details about how to set HTTP header information, see the implementation section;Pragma HTTP header information (why does it not work)
Many people think that after Pragma: No-cache is set in the HTTP header information, the content cannot be cached. However, this is not the case: in HTTP specifications, the Response Header does not have any description of The Pragma attribute, the Pragma attribute of the request header information is discussed (the header information is also sent to the server by the browser). Although a few centralized cache servers will follow this header information, most of them will not. If Pragma is used, the following header information is used:Use expires (expiration time) HTTP header information to control the retention period
The expires (expiration time) attribute is the basic method for HTTP to control the cache. This attribute tells the cache server how long the related copy will be fresh. After this time, the cache will send a request to the source server to check whether the document has been modified. Almost all cache servers support the expires attribute;
Most Web servers support setting the expires attribute in several ways. Generally, you can design an absolute time interval: based on the time when the customer last views the copy (the last access time) or the last modification time of the document on the server;
Expires header information: it is particularly useful for caching static image files (such as navigation bar and image buttons). Because these images are rarely modified, you can set a particularly long expiration time for them, this will make your website very fast for users. They are also useful for controlling regularly changed webpages. For example, you update news pages at every morning, you can set the copy expiration time to the same time, so that the cache server will know when to get an updated version without having to press the "refresh" button in the browser.
Attribute Value of the expiration time HeaderOnlyIt is a date time in the HTTP format, and others will be parsed into the current time "before", the copy will expire, remember: the HTTP date time must be Greenwich Mean Time (GMT ), instead of local time. Example:Expires: Fri, 30 Oct 1998 14:19:41 GMT
Therefore, to use the expiration time attribute, make sure that your web server time settings are correct. One way is to use the Network Time Protocol (NTP ), with your system administrator, you can learn more details.
Although the expiration time attribute is very useful, it still has some limitations. First, it involves the date. In this way, the time of the Web server and the time of the cache server must be synchronized. If some of them are not synchronized, either the cached content expires in advance or the expiration result is not updated in time.
There is also a problem with the expiration time setting that cannot be ignored: If the expiration time you set is a fixed time, if the content you returned is not updated with the next expiration time, then all access requests will be sent to the source Web server, which increases the load and response time;
HTTP 1.1 introduces another set of header information attributes: cache-control response header information, allowing website publishers to control their content more comprehensively and locate the expiration time limit.
Useful cache-control response header information includes:
Example:Cache-control: Max-age = 3600, must-revalidate
If you plan to use the cache-control attribute, you should take a look at this HTTP document. For more information, see references and further reading;Verification parameters and Verification
How to work in Web Cache: we have said: verification is the communication mechanism between the server and the cache after the copy has been modified. This mechanism is used: the cache server can avoid duplicate download of the entire original when the copy is still new enough.
Verification parameters are very important. If one does not exist and there is no information indicating the retention period (expires or cache-control), the cache will not store any copies;
The most common verification parameter is the last modification time of the document. You can use the last-modified header information. When a cache contains the last-modified information, it is based on this information, add an IF-modified-since request parameter to query the server: whether the copy has been modified since the last check.
HTTP 1.1 introduces another verification parameter: etag. The server is the unique identifier etag generated by the server. The tag of each copy changes. Since the server controls how etag is generated, the cache server can use the IF-None-match request to return the same result. Then, the current copy is exactly the same as the original one.
All cache servers use the last-modified time to determine whether the replica is new, and etag verification is becoming increasingly popular;
All new-generation Web servers automatically generate etag and last-modified header information for static content (such as files) without any configuration. However, the server does not know how to generate dynamic content (for example, CGI, ASP, or database-generated websites). For more information, see the cache-friendly script section;
In addition to freshness information and verification, you also have many ways to make your website cache friendly.
By default, the script does not return verification parameters (the last-modified or etag header information is returned) or other freshness information (expires or cache-control ), some Dynamic scripts are indeed dynamic content (each time the corresponding content is different), but more (search engines, Database Engine websites) websites can still benefit from the cache-friendly.
Generally, if the output generated by the script can be replicated repeatedly in a few minutes or days in the future, it can be cached. If the Script output content only changes with the URL, it is also cacheable. However, if the output content changes according to the cookie, authentication information, or other external conditions, it is still not cacheable.
For specific definitions, see the implementation section.What are the key points for websites to become cacheable?
A good strategy is to determine the most popular content, copy a lot (especially images), and deploy the cache for the content.How can we get the fastest response of pages through cache?
The best cached copy is the content that can be kept fresh for a long time. Although the validation helps speed up the response, it has to contact the source server to check whether the content is new enough, if the Cache Server knows that the content is new, the content can be returned directly.I understand that cache is good, but I have to count how many people have accessed my website!
If you need to know what you access each page, select a small element on the  page, or the page itself, and make it uncacheable through appropriate header information, for example: you can deploy a 1x1 pixel transparent image on each page. The referer header contains information about each page of the image;
It is clear that this will not give you a statistics on the accuracy of your users, and it is not very good for the Internet and your users, Consuming extra bandwidth, forces users to access content that cannot be cached. For more information, see access statistics.
Many browsers allow you to see expires and last-modified information in page properties or similar interfaces. If yes, you will find the page information menu and page-related files ), including their details;
You can use Telnet to manually connect to the web server;
Therefore, you may need to use a field to specify the port (80 by default), or link to www.example.com: 80 or www.example.com 80 (note that it is a space ), for more settings, see the Telnet client documentation;
Open website link: request a viewing link, if you want to see the http://www.example.com/foo.html connected to port 80 of www.example.com, type:
Press the Enter key at the [Press enter] and press ENTER twice at the end. Then, the header information and the complete page are output. If you only want to view the header information, replace get with head.My pages are password-protected. How does the Proxy Cache Server handle them?
By default, Web pages are protected by HTTP Authentication and are not kept by any shared cache. However, you can set cache-control: public to make the authentication page cacheable. HTTP 1.1 Standard-compatible cache servers will recognize that the authentication pages are cacheable.
If you think that the pages that can be cached can be viewed only after authentication by each user, you can combine the cache-control: public and no-Cache header information, the cache must submit the authentication information of new customers to the source server before providing a copy. The setting is as follows:
Cache-control: public, no-Cache
In any case: This is the best way to reduce authentication requests. For example, your images are confidential, deploy them in another directory, and configure the server for Unauthenticated authentication. In this way, the images will be cached by default.Do we have to worry about users accessing my site through cache?
The SSL page on the proxy server is not cached (it is not recommended to be cached), so you don't have to worry about it. However, because the cache stores non-SSL requests and URLs crawled from them, you need to be aware that websites without security protection may be accessed by immoral administrators, especially through URLs.
In fact, administrators between servers and clients can collect such information. In particular, using CGI scripts to pass user names and passwords through URLs poses a major problem. This vulnerability may cause leakage of user names and passwords;
If you initially understand the security mechanisms of the internet, you will not have any access to the cache server.
This is hard to say. Generally, the more complex the system is, the more difficult it is to cache. The worst is that full dynamic publishing does not provide verification parameters; you do not cache any content. You can contact the technical staff of the system provider and refer to the subsequent implementation instructions.My image expires after 1 month, but I need to update it now.
The expiration time cannot be exceeded. The copy will be deleted unless the cache (browser or proxy server) space is insufficient. The cached copy will be used continuously during expiration.
The best way is to change their links so that new copies will be re-downloaded from the source server. Remember: the pages that reference them will also be cached. Therefore, static images and similar content can be easily cached, while HTML pages that reference them must be very updated;
If you want to reload a copy of the specified cache server, you can force a "refresh" (hold down the Shift key during reload in Firefox: there will be the aforementioned evil Pragma: no-Cache header information is sent ). Or you can ask the cache administrator to delete the corresponding content from their interface;
If you use apahe, you can consider allowing them to use the. htaccess file and provide relevant documents;
On the other hand, you can also consider creating various cache policies on various virtual hosts. For example, you can set a directory/cache-1m dedicated to store access for one month, and another/no-cache directory is used to provide services that do not support storage copies.
In any case: for a large number of users, the cache should be used. For large websites, this saves significantly (bandwidth and server load );
The cache server does not always save copies and reuse them; they only do not save and use copies under certain circumstances. All cache servers determine the cache based on the file size, type (for example, image page), or the remaining server space. Your page is not worth caching compared to more popular or larger files.
Therefore, some cache servers allow administrators to determine the priority of cached copies based on the file type, and allow some copies to be permanently cached and valid for a long time;
Generally, you should select the latest web server version.Program. Not only because they contain more caching functions, new versions tend to improve both performance and security.Apache HTTP Server
Some optional modules of Apache contain the header information, including expires and cache-control. These modules are supported in version 1.2 and later;
These modules must be compiled with Apache. Although they are included in the released version, they are not enabled by default. To determine whether the corresponding module has been enabled: Find the httpd program and run httpd-L. It lists available modules. The modules we need are mod_expires and mod_headers.
Once Apache enables the corresponding module, you can use mod_expires in the. htaccess file or the server's access. conf file to set when the copy expires. You can set the expiration time to start from the Access time or file modification time, and apply it to a certain file type or default settings. For more information, seeModule documentationObtain more information or ask Apache experts around you when you encounter problems.
Apply the cache-control header information. You need to use mod_headers, which will allow you to set any HTTP header information. For more information, seeMod_headers documentationMore information is available;
Here is an example of how to use the header information:
The configuration of Apache 2.0 is similar to that of Apache 1.3. For more information, seeMod_expiresAndMod_headers document;Microsoft IIS server
Microsoft IIS can easily set header information. Note: This is only applicable to IIS 4.0 servers and can only run on NT servers.
To set the header information for a region of the website, go to the Administrator tool interface and set the attributes. Select the HTTP header menu and you will see two interesting areas: Enabling content expiration and customizing HTTP header information. The first setting is automatically configured, and the second setting can be used to set the cache-control header information;
To set the header of an ASP page, you can refer to the ASP section or set the header Through the ISAPI module. For details, see msdn.
After version 3.6, Netscape/iPlanet cannot set Expires header information. It supports HTTP 3.0 from version 1.1. This means that the cache (proxy server/browser) Advantages of HTTP 1.1 can be obtained through the cache-control settings.
Use the cache-control header to select content management | cache setting directory on the management server. Then, use the resource selector to select the directory where you want to set the header information. After setting the header information, click "OK ". For more information, seeNetscape/iPlanet Enterprise Server Manual.
Note that it may be easier for the server to set the HTTP header information than the script language, but you should use both.
Because the script on the server is mainly used for dynamic content, it does not generate a file page that can be cached, even if the content can actually be cached. If your content changes frequently but not every page request changes, consider setting a cache-control: Max-age header. Most users will access the same page multiple times in a short time. For example, if a user clicks the "back" button, the user still needs to download the content from the server again even if there is no new content.
CGI scripts are one of the most popular methods to generate content. You can easily expand HTTP header information before sending content. Most CGI implementations require you to write Content-Type header information, for example, this Perl script:#! /Usr/bin/perl
Because it is all text, you can easily generate headers related to expires and other dates through built-in functions. It will be simpler if you use cache-control: Max-age;Print "cache-control: Max-age = 600 \ n ";
In this way, the script can be cached for 10 minutes after the request is sent. If the user presses the "back" button, they will not submit the request again;
The CGI specification also allows the client to send header information. Each header information has a prefix of 'HTTP _ '. In this way, if a client sends an IF-modified-since request, this is the case:
ReferCgi_bufferLibrary, a library that automatically processes etag generation and verification, generates the Content-Length attribute, and compresses the content by gzip. Add only one line to the Python script;The server side includes server side des
Ssi(regular use of the .shtml extension) is the earliest solution for website publishers to generate dynamic content. By setting a special tag on the page, it is also a script embedded in HTML;
For most SSI implementations, the validator cannot be set and therefore cannot be cached. However, Apache can execute permission settings on the group of specific files to allow users to set the type of SSI that can be cached; and use xbithack to adjust the entire directory. For more information, seeMod_include document.
Start building with 50+ products and up to 12 months usage for Elastic Compute Service