Original: https://www.yangqiu.cn/sinobeauty/1616327.html
I. Overview of Caching
Caching is an important component in distributed system, which mainly solves the performance problem of hotspot data access under high concurrency and large data scenarios. Provides high performance data fast access.
1, the principle of caching
A storage (device) that writes/reads data faster;
Cache data to a location nearest to the application;
Caches data to a location closest to the user.
2. Cache classification
In the distributed system, the application of caching is very extensive, from the deployment point of view has the following several aspects of caching applications.
CDN Cache;
Reverse proxy caching;
Distributed cache;
local application caching;
3. Buffer Media
Common middleware: Varnish,ngnix,squid,memcache,redis,ehcache and so on;
Cached content: Files, data, objects;
Cached media: CPU, Memory (local, distributed), disk (local, distributed)
4. Cache Design
Caching design needs to address the following issues:
Cache what.
What data needs to be cached: 1. Hotspot data; 2. Static resources.
The location of the cache.
CDN, reverse proxy, distributed caching server, native (memory, hard drive)
How to cache the problem.
Expiration policy
Fixed time: For example, the time specified for caching is 30 minutes;
Relative time: such as data not accessed in the last 10 minutes;
Synchronization mechanism
write in real time; (push)
asynchronous refresh; (Push-pull)
Second, CDN Cache
CDN mainly addresses the cache of data to the nearest user location, general caching of Static resource files (pages, scripts, pictures, videos, files, etc.). Domestic networks are unusually complex, and network access across operators is slow. CDN applications can be deployed in important cities in order to solve the problem of accessing across operators or local users. So that users get the required content nearby, reduce network congestion, improve user access response speed and hit rate.
1, CND principle
The basic principle of CDN is to use a variety of caching servers widely, distribute these caching servers to the area or network where the user visits the relative concentration, and when the users visit the website, use the global load technology to point the user's access to the nearest working cache server, and the caching server responds directly to the user request.
(1) before deploying CDN application
Network Request Path:
Request: Native network (LAN)--"operator network" Application Server room
Response: Application Server Room--"operator network-" local Network (LAN)
Without considering complex networks, it takes 3 nodes from request to response, 6 steps to complete a user access operation.
(2) After the deployment of CDN application
Network path:
Request: Local network (LAN)--"operator network"
Response: Carrier network--native network (LAN)
Without considering complex networks, it takes 2 nodes from request to response, 2 steps to complete a user access operation.
Reduced 1 nodes and 4 steps of access compared to not deploying CDN services. Greatly improve the response speed of the system.
2, CDN advantages and Disadvantages
Advantages (excerpt from Baidu Encyclopedia):
Local cache acceleration: improves access speed, especially with a large number of images and static page sites.
Mirroring service: Eliminates the influence of the bottleneck of interconnection between different operators, realizes the network acceleration of the trans-operator, and ensures the good access quality of the users in different networks.
Remote acceleration: Remote access Users automatically select Cache server based on DNS load balancing technology, choose the fastest cache server, speed up remote access.
Bandwidth optimization: automatic generation of remote mirror (mirrored) cache server, remote user access from the cache server read data, reduce the bandwidth of remote access, share network traffic, reduce the original site Web server load and other functions.
Cluster anti-attack: The widely distributed CDN node plus the intelligent redundancy mechanism between nodes can effectively prevent hacker intrusion and reduce the impact of various D.D.O.S attacks on the website, while ensuring better service quality.
Disadvantages:
Dynamic resource caching needs attention to real-time;
Solution: The main cache static resources, dynamic resources to build multi-level cache or quasi real-time synchronization;
How to ensure the consistency and timeliness of data need to be weighed;
Solve:
(1) Set the cache expiration time (1 hours, final consistency);
(2) the data version number;
3, CND framework reference
Excerpt from "Yunze Video CDN System"
4. CND Technology Practice
At present, small and medium sized internet companies, integrated cost considerations, generally hire third-party CDN Services, large Internet companies, using a self-built or third-party-combined approach. For example, Taobao just started using a third party, when the flow is very large, third-party companies can not support its CDN traffic, Taobao finally adopted the way to achieve the self-built CDN.
Taobao CDN, as shown below (from the network):
Third, reverse proxy caching
Reverse proxy refers to the deployment of proxy servers in the Web server room to achieve load balancing, data caching, security control and other functions.
1, the principle of caching
The reverse proxy is located in the application server room and handles all requests to the Web server. If a user requests a page that has a buffer on the proxy server, the proxy server sends the buffered content directly to the user. If there is no buffer, first make a request to the Web server, retrieve the data, and then send it to the user after the local cache. Reduces the load on the Web server by reducing the number of requests to the Web server.
The reverse proxy generally caches static resources, and dynamic resources are forwarded to the application server for processing. The common caching application servers are varnish, Ngnix, Squid.
2. Squid Sample
SQUID reverse proxy typically caches only static resources, and dynamic programs are not cached by default. Buffers the static page based on the HTTP header token returned from the WEB server. There are four most important HTTP header tags:
Last-modified: Tell the Reverse proxy page what time it was modified
Expires: Tell the Reverse proxy page what time should be removed from the buffer
Cache-control: Tell if the reverse proxy page should be buffered
Pragma is used to include implementation-specific instructions, most commonly pragma:no-cache
Squid Reverse Proxy acceleration Site instance
Through DNS polling technology, the client's request is distributed to one of the Squid reverse proxy server processing;
If the Squid caches the user's request resources, the requested resource is returned directly to the user;
Otherwise this squid will not cache the request according to the configured rules sent to the neighbor Squid and the background of the WEB server processing;
This reduces the load on the background Web server and improves the performance and security of the entire Web site.
2, Proxy cache comparison
The commonly used proxy cache has Varnish,squid,ngnix, the simple comparison is as follows:
(1) Varnish and squid is a professional cache service, Nginx need third-party module support;
(2) Varnish adopts memory cache, avoids the frequent exchange of files in memory and disk, and the performance is higher than that of squid;
(3) varnish because it is memory cache, so small files such as css,js, small picture of what the support is very good, the back end of the persistent cache can be used squid or ATS;
(4) Squid full function and large, suitable for a variety of static file caching, generally in front of the hanging a haproxy or nginx do load balanced run multiple instances;
(5) Nginx uses the third party module Ncache to do the buffering, the performance basically achieves the varnish, generally uses as the reverse proxy, may realize the simple cache.
Four, distributed caching
CDN, reverse proxy caching, mainly resolves the static file, or the user requests the resource's cache, the data source is generally static file or dynamically generated file (with cache head identifier).
and distributed caching, mainly refers to caching users often access data cache, the data source for the database. Generally play the role of hotspot data access and reduce database pressure.
The current distributed cache design is an essential architectural element in a large web site architecture. The common middleware has memcache,redis.
1, Memcache
Memcache is a high-performance, distributed memory object caching system that can be used to store data in various formats, including images, videos, files, and the results of database retrieval, by maintaining a large, unified hash table in memory. It simply means that the data is called into memory and then read from memory, thus greatly increasing the reading speed.
Memcache Features:
(1) The use of physical memory as a buffer, can be run independently on the server. Each process Max 2G, if you want to cache more data, you can open more memcache processes (different ports) or use distributed memcache for caching, caching data to different physical machines or virtual machines.
(2) The use of Key-value to store data, this is a single index of structured data organization, can make the data item query time complexity of O (1).
(3) Simple protocol: Based on the text line protocol, directly through telnet on the memcached server can access data operations, simple, convenient for a variety of caching reference to this protocol.
(4) High-performance communication based on Libevent: Libevent is a set of library developed using C, which encapsulates the kqueue,linux system of the BSD system, such as the epoll of event processing functions as an interface, which improves performance compared with traditional select.
(5) Built-in memory management: All data are stored in memory, access data than the hard disk, when full memory, through the LRU algorithm automatically delete unused cache, but did not consider the data disaster tolerance, restart services, all data will be lost.
(6) Distributed: Each memcached server does not communicate with each other, independently accessing data and sharing no information. The server does not have distributed functionality, and distributed deployment depends on the Memcache client.
(7) Caching policy: The memcached cache policy is the LRU (least recently used) expiration failure policy. When you store a data item in memcached, you can specify its expiration time in the cache, which defaults to permanent. When the memcached server has run out of allocations, the invalidated data is replaced first, and then the most recently unused data. In LRU, Memcached uses a lazy expiration policy that does not monitor the expiration of the saved Key/vlue, but instead examines the record timestamp when the key value is obtained, and checks that the key/value expires on the space, which reduces the load on the server.
memcache Working principle:
The work flow of memcache is as follows:
(1) First check whether the client's request data is in the memcached, if there is, directly to the request data returned, no longer do any operations on the database.
(2) If the requested data is not in memcached, go to the database, the data obtained from the database back to the client, while the data cache to the memcached (memcached client is not responsible, need program implementation).
(3) Update the data in memcached at the same time, ensure consistency.
(4) When the memcached memory space is allocated, the LRU (least recently Used, least recently used) policy plus expiration policy is used, and the expiration data is replaced first and then replaced with the most recent unused data.
memcache Cluster
Memcached is called a "distributed" caching server, but there is no "distributed" functionality on the server side. Each server is a fully isolated and isolated service. Memcached distributed, is implemented by the client program.
When the key value is deposited/removed to the memcached cluster, the memcached client program calculates which server is stored according to a certain algorithm and then saves the key value to this server.
Access data in two steps, the first step, select the server, the second access data.
distributed algorithm (consistent hashing):
There are two ways to select a server, one is to compute the distribution based on the remainder, the other is to compute the distribution according to the hashing algorithm.
Remainder algorithm:
First the integer hash value of the key is evaluated, divided by the number of servers, and the access server is determined according to the remainder.
Advantages: Simple calculation and high efficiency.
Disadvantage: Almost all caches will fail when the memcached server increases or decreases.
Hash algorithm: (Consistent hash)
The hash value of the memcached server is first calculated and distributed to the 0 to 2 32-square circle. The same method is used to calculate the hash value of the key that stores the data and map it to the circle, and then start looking clockwise from where the data maps to, and save the data to the first server found. If the server is still not found with more than 2 32, the data is saved to the first memcached server.
If you add a memcached server, the keys on the first server that only increase the server's counter-clockwise direction on the circle are affected.
Consistent hash algorithm: solves the problem that the remainder algorithm increases the hit limit of the node, in theory, inserting an entity node will affect the number of virtual nodes/2 of the node data hit.
2, Redis
Redis is an open source (BSD-licensed), memory-based, multiple data structure storage system. Can be used as database, cache, and message middleware. Supports multiple types of data structures, such as strings (strings), hashes (hashes), lists (lists), collections (sets), ordered sets (sorted sets) and range queries, bitmaps, Hyperloglogs, and geo-space (Geospa tial) index RADIUS query.
Built-in replication (replication), LUA scripts (Lua scripting), LRU driver events (LRU eviction), transaction (transactions) and different levels of disk persistence (persistence), and through Redis Sentinel (Sentinel) and automatic partitioning (Cluster) provide high availability (higher availability).
Redis Common data types
String
Common commands: Set,get,decr,incr,mget.
Scenario: string is the most commonly used type of data, similar to the Memcache key value storage method.
Implementation: string Redis internal storage default is a string, referenced by Redisobject, when encountered INCR,DECR and other operations will be converted to a numeric calculation, at this time redisobject encoding field Int.
Hash
Common commands: Hget,hset,hgetall.
Application scenario: To store a user information object data, for example:
Implementation: Redis hash corresponding to the value, the internal reality is a hashmap, in fact there will be 2 different implementations.
(1) The member of the hash is relatively young redis in order to save memory will use similar one-dimensional array of methods to compact storage, and will not adopt a real hashmap structure, the corresponding value Redisobject encoding for Zipmap.
(2) When the number of members of the increase will automatically turn into a real hashmap, at this time encoding for HT.
List
Common commands: Lpush,rpush,lpop,rpop,lrange.
Application Scenario: Redis list has a lot of applications and is one of the most important data structures in Redis, such as Twitter's attention list, fan list, etc. can be implemented with REDIS list structure.
Implementation: Redis list is implemented as a two-way linked list, can support reverse lookup and traversal, convenient operation. But with some extra memory overhead, many implementations within the Redis, including sending buffer queues, are also used in this data structure.
Set
Common commands: Sadd,spop,smembers,sunion.
Application Scenario: Redis set provides functionality that is similar to a list, especially if set is automatic, and when you need to store a list of data and do not want duplicate data, set is a good choice. and set provides an important interface for determining whether a member is within a set set, which is not available in the list.
Implementation: The internal implementation of set is a value is always null HashMap, in fact, by calculating the hash of the way to quickly row weight, which is also set can provide to determine whether a member is within the set of reasons.
Sorted Set
Common commands: Zadd, Zrange, Zrem, Zcard.
Usage Scenario: Redis sorted set is similar to set, except that set is not automatically ordered, and sorted set can be sorted by the user providing an additional priority (score) parameter, and is inserted in an orderly, automatic sort. When you need an ordered and not duplicated list of collections, you can choose to sorted set data structures, such as Twitter's public timeline can be stored as score for publication time, which is automatically sorted by time.
Implementation: Redis sorted set of the internal use of HashMap and jump Table (skiplist) to ensure the storage and order of data, HashMap is placed in the member to the score map, and the jump tables are stored all members, The sorting is based on the score in the HashMap, the structure of the jump table can obtain the high search efficiency, and it is simpler to realize.
Redis Cluster
(1) High-availability scenarios implemented through keepalived
Switch process:
When Master hung up, VIP drift to Slave;slave on keepalived notice Redis execution: slaveof no one, start to provide business;
When Master is up, the VIP address is unchanged, Master's keepalived notifies Redis to execute slaveof slave IP host, starting as data from synchronization;
by analogy.
Master and subordinate at the same time down machine situation:
Unplanned, without consideration, there is generally no such problem
Planned restart, save DUMP Main Library data by Operation Dimension before reboot; order to be noted:
Turn off all the Redis on one of the machines, and the master is all cut to another machine (multi-instance deployment, on a single machine with both the master and the case); and shut down the machine.
Then dump our Lord Redis service
Turn off the main
Start master and wait for data load to complete
Start from
Remove dump file (avoid reboot load slow)
(2) using Twemproxy to implement cluster scheme
By Twitter Open source C version of Proxy, while supporting memcached and Redis, the latest version is: 0.2.4, ongoing development; Https://github.com/twitter/twemproxy. Twitter uses it primarily to reduce the number of network connections between front-end and cache services.
Features: Fast, lightweight, reduce back-end cache server connection number, easy to configure, support Ketama, Modula, random, commonly used hash slicing algorithm.
This uses keepalived to implement a highly available primary standby solution to address the proxy single point problem.
Advantages:
For the client, the Redis cluster is transparent, the client is simple, and the dynamic expansion is over;
When the proxy is single point and the consistency hash is processed, the cluster node usability detection does not exist the problem of brain crack.
High performance, CPU-intensive, and Redis node cluster multiple CPU resource redundancy, can be deployed on the Redis node cluster, no additional equipment required.
3, the comparison of Memcache and Redis
(1) Data structure: Memcache only support key value storage mode, Redis support more data types, such as key value, hash, list, set, Zset;
(2) Multithreading: Memcache support multiple threads, Redis support single-threaded, CPU utilization memcache better than redis;
(3) Persistence: Memcache does not support persistence, Redis supports persistence;
(4) Memory utilization: Memcache High, Redis low (using compression is higher than memcache);
(5) Expiration policy: After the expiration of memcache, do not delete the cache, will cause the next time to take data data, Redis have specialized threads, clear cache data;
Five, local cache
Local caching refers to the application of internal caching, the standard distributed system, generally has a multi-level cache composition. The local cache is the most recent cache that can be used to cache data to a hard disk or memory.
1. Hard disk Cache
The data is cached to the hard disk and read from the hard disk when read. The principle is to read the native files directly, reducing the network transmission consumption, faster than reading the database over the network. It can be applied to scenarios where speed requirements are not high, but require large cache storage.
2. Memory Cache
Storing data directly in the native memory and maintaining the cached object directly through the program is the quickest way to access it.
VI. Caching Schema examples
Division of responsibilities:
CDN: Storage of HTML, CSS, JS and other static resources;
Reverse proxy: Static and dynamic separation, only cached user requests of the passive resources;
Distributed caching: Caching hot data in a database;
Local caching: Commonly used data such as caching an application dictionary.
Request Process:
(1) The browser initiates the request to the client, if the CDN has the cache to return directly;
(2) If the CDN has no caching, access the reverse proxy server;
(3) return directly if the reverse proxy server has a cache;
(4) Access to the application server if the reverse proxy server has no caching or dynamic requests;
(5) The application server accesses the local cache, and if there is a cache, returns the proxy server and caches the data; (Dynamic request not cached)
(6) If the local cache has no data, read the distributed cache, and return to the application server; The application server caches the data to the local cache (part);
(7) If the distributed cache has no data, the application reads the database data and puts it into the distributed cache.
Vii. Caching FAQ
1, data consistency
Caching is a node that precedes data persistence, primarily by placing hotspot data in a media that is more recent or faster than the user, speeding up data access and reducing response time.
Because caching is a copy of the persisted data, there is an unavoidable problem of data inconsistency. Causes dirty read or cannot read data. Inconsistent data, usually due to network instability or node failure. Depending on the order of operation of the data, there are several main situations.
Scene Introduction
(1) write the cache first, then write the database
The following figure:
If the cache is written successfully, but the write database fails or the response is delayed, dirty reads occur the next time the cache is read (read concurrently).
(2) write the database first, then write the cache
The following figure:
If the write database succeeds, but the write cache fails, the next read (concurrent read) cache will not read the data.
(3) Cache asynchronous Refresh
This means that database operations and write caching are not in an operational step, such as in a distributed scenario where simultaneous write caching is not possible or an asynchronous refresh (remediation) is required.
In this case, the primary consideration is the timeliness of data writes and cache refreshes. For example, how often to refresh the cache does not affect user access to the data.
Solving Method
The first scenario: this way of writing the cache itself is wrong, you need to write the persistent media, and then write the cached way.
Second scenario:
(1) based on the response of the write cache, if the cache write fails, the database operation is rolled back; This method increases the complexity of the program and is not recommended;
(2) when the cache is in use, if read cache fails, read the database first, and then write the cache back to achieve.
A third scenario:
(1) First determine what data is suitable for such a scenario;
(2) According to the experience value to determine reasonable data inconsistent time, user data refresh time interval.
Other methods
(1) Timeout: Set a reasonable timeout time;
(2) Refresh: Periodically refresh a certain range (according to time, version number) of data;
The above is simplified data reading and writing scenarios, in practice will be divided into:
(1) Consistency between the cache and the database;
(2) The consistency before multilevel caching;
(3) The consistency before the cached copy.
2, Cache high Availability
Industry has two theories, the first cache is caching, temporary storage of data, do not need high availability. The second cache gradually evolves into an important storage medium that needs to be made highly available.
My view is that the cache is highly available and needs to be based on the actual scenario. The critical point is whether it affects the backend database.
The specific decision basis needs to be based on the cluster size (data, caching), cost (server, operational dimension), System performance (concurrency, throughput, response time) and other aspects of the comprehensive evaluation.
Solving Method
Cache is highly available, typically through distributed and replicated implementations. Distributed implementation of the mass caching of data, replication to achieve the high availability of cached data nodes. The architecture diagram is as follows: