Codis author Huangdongxu The design of the distributed Redis architecture and the pits that have been trampled

Source: Internet
Author: User
Tags redis cluster


The content of this share consists of five major parts:
    • Redis, Rediscluster and Codis;

    • We love consistency more;

    • Codis experience in the use of production environments and pits;

    • Some views on distributed database and distributed architecture;

    • Q & A link.


?? Codis is a distributed Redis solution that differs from the official pure-to-peer model, and CODIS uses the proxy-based solution. Today we will introduce the design of Codis and the next large version of Reborndb, as well as some tips for codis in practical applications. Finally, I would like to introduce some of my views and views on distributed storage, and I hope you elegance the chief executives.


first, Redis,rediscluster and Codis


?? Redis: Presumably everyone's architecture, Redis is already an essential component, rich data structures and ultra-high performance as well as simple protocols, so that Redis can be very good as the upstream cache layer of the database. But we will be more concerned about the single point of Redis, single-point redis capacity is always limited to memory, in the case of high performance requirements, ideally we want all the data in memory, do not hit the database, so it is natural to seek other solutions. For example, an SSD replaces memory with a disk in exchange for greater capacity. The more natural idea is to turn Redis into a distributed cache service that scales horizontally, before codis, the industry has only twemproxy, but the twemproxy itself is a static distributed Redis scheme, which requires very high operation and maintenance when expanding/shrinking. And it is difficult to achieve a smooth expansion capacity. Codis's goal is to be as compatible as possible twemproxy, plus the ability to migrate data to achieve capacity expansion and reduction, eventually replacing Twemproxy. From the end of the last line of pea pod results, the final completely replaced Twem, about 2T of memory cluster.



?? Redis Cluster : I think there are pros and cons, as an architect, I don't use in a production environment for two reasons: the official version of Codis, which was released in the same period as the Cluster.


    • The cluster data storage module and the distributed logic module are coupled together, the benefit is that the deployment of the exception is simple, All-in-the-box, not as many concepts as CODIS, components and dependencies. But the downside is that it's hard to make a painless upgrade to your business. For example, which day Redis cluster's distributed logic has a more serious bug, how do you upgrade? There is no good way to restart the whole cluster except rolling. This is a relatively wounded ops.

    • The protocol has been greatly modified, the client is not very friendly, many clients have become the standard of fact, and many programs have been written, so that business to change redisclient, is not very realistic, And at present, it is difficult to say which Rediscluster client after a large-scale production environment validation, from the Hunantv open source Rediscluster Proxy can be seen that the impact or pretty big, otherwise it will support the use of cluster client.


?? Codis: Unlike Redis cluster, Codis uses a non-stateful proxy layer to write distributed logic on a proxy, The underlying storage engine is also the Redis itself (although some patches are made based on Redis2.8.13), the distribution state of the data is stored in zookeeper (ETCD), and the underlying data store becomes a pluggable part. The benefits of this matter do not have to say much, that is, the various components can be dynamically horizontal expansion, especially the stateless proxy for dynamic load balancing, or very significant, but also can do some interesting things, such as the discovery of some slot data is relatively cold, This slot can be dedicated to a server group that supports persistent storage to save memory, and when that data gets hot, it can be migrated dynamically to the memory server group, all transparent to the business. Interestingly, after abandoning twmeproxy within Twitter, the T home developed a new distributed Redis solution, still taking the proxy-based route. But no open source comes out. Pluggable storage Engine This is one of the things that Codis's next-generation product reborndb is doing. BTW,REBORNDB and its persistence engine are fully open source, see Https://github.com/reborndb/reborn and Https://github.com/reborndb/qdb. Of course, the disadvantage of this design is that, after the proxy, more network interaction, it seems that performance decreased some, but remember, our proxy can be dynamically extended, the entire service of QPS is not determined by the performance of a single proxy (so in production environment I recommend the use of Lvs/ha Proxy or Jodis), each proxy is actually the same.





Second, we love consistency


?? Many friends ask me, why do not support read and write separation, in fact, the reason for this is very simple, because our business scenario does not tolerate data inconsistency, because the Redis itself replication model is master-slave asynchronous replication, after the master write success, It is not guaranteed to be able to read this data on slave, and it is cumbersome for business parties to deal with the problem of consistency. And the performance of Redis single point is quite high, unlike the real database such as MySQL, there is no need to improve a little bit to read the QPS and let business parties confused. This is not the same as the role of the database. So, you may see that, in fact, Codis ha, does not guarantee that the data is not lost completely, because it is asynchronous replication, so after the master hangs, if there is no synchronization to the slave on the data, at this time slave promoted to master, Data that has just been written before it can be synchronized will be lost. In Reborndb, however, we will try to support synchronous replication (syncreplication) for persistent storage engines (QDB), which can be used by services that have stronger requirements for data consistency and security.



?? When it comes to consistency, this is why Codis supports Mget/mset's inability to guarantee the atomic semantics of the original single point. Because the key that Mset participates in may not be on different machines, if you need to guarantee the original semantics, that is, either success together or fail together, this is a distributed transaction problem, for Redis, there is no Wal or roll back such a saying, So even one of the simplest two-phase commit strategies is difficult to implement, and even if implemented, performance is not guaranteed. So the use of mset/mget in Codis is actually the same as your local multi-threading set/get effect, just is packaged back by the server, we add this command support just to better support the previous use of Twemproxy business.



?? In real-world scenarios, many friends use LUA scripts to extend the functionality of Redis, in fact Codis this side is supported, but remember, codis when it comes to this scenario, is just a forwarding, it does not guarantee that your script operation data is on the correct node. For example, your script involves manipulating multiple Key,codis to execute on the machine that assigns the script to the first key in the parameter list. So in this scenario, you need to make sure that the key you use for your script is distributed on the same machine, and that you can use the hashtag method.



?? For example, you have a script that operates a user's multiple information, such as UID1age,uid1sex,uid1name shape such a key, if you do not have to hashtag, These keys may be dispersed on different machines, if hashtag (with curly braces to extend the area of the hash calculation): {UID1}age,{uid1}sex,{uid1}name, This ensures that the keys are distributed on the same machine. This is a syntax introduced by Twemproxy, and we support it here.



?? After the open source Codis, we received a lot of community feedback, most of the comments are focused on zookeeper dependencies, Redis changes, and why we need proxies above, we are also thinking, these things are not necessary. Of course, the benefits of these components are undoubtedly, as explained above, but there is no way to do more beautiful. As a result, we will take another step forward in the next phase to achieve the following designs:


    • Use proxy built-in raft instead of external zookeeper,zk for us, it's just a strong consistency store, and we can actually use raft to do the same thing. The raft is embedded in a proxy to synchronize routing information. Achieve the effect of reducing dependency.

    • Abstract storage engine layer, which is responsible for initiating and managing the life cycle of the storage engine by proxy or by a third-party agent. Specifically, now Codis also need to manually deploy the underlying Redis or QDB, to configure the master-slave relationship, but in the future we will give this matter to an automated agent or even within the proxy integrated storage engine. The advantage is that we can minimize the loss of proxy forwarding (for example, the proxy will start the Redis instance locally) and the manual error operation, which improves the automation of the whole system.

    • and replication based migration. It is well known that the CODIS data migration method is now implemented by modifying the underlying Redis and adding a single key atomic migration command. The advantage is that the implementation is simple and the migration process is not aware of the business. But the downside is also obvious, first of all, the speed is slow, and the redis is intrusive, and maintaining the slot information for Redis brings additional memory overhead. Probably for the small Key-value primary business and native Redis is the ratio of 1:1.5, so still compared to the cost of memory.


?? In Reborndb we will try to provide replication-based migration, that is, to start the migration, recording the operation of a slot, and then start synchronizing to slave in the background, when the slave synchronization is finished, the recorded operation is played back, the playback is almost, the master's write stopped, After leveling, modify the routing table to switch the slot that needs to be migrated to the new master, master-slave (half) synchronous replication, as mentioned earlier.


third, the CODIS in the production environment of the use of experience and pits


?? To say that some tips, as a development engineer, the first-line operation experience is certainly no operations of the students, we will be able to discuss in depth together.



about multi-product line deployment : Many friends ask us if there are multiple projects, codis how to deploy better, we were in the pea pod, a product line will deploy a full set of Codis, but ZK shared a, different codis clusters have different product Name to differentiate, the design of Codis itself does not have a namespace then a codis can only correspond to one product name. Codis clusters of different product name will not interfere with each other on the same ZK.



about ZK: Because Codis is a strongly dependent ZK project, and in the proxy and ZK connection jitter caused sessionexpired, Proxy is not available to provide services, so as far as possible to ensure that proxy and ZK deployment in the same room. In the production environment ZK must be >=3 the odd machine, 5 physical machines are recommended.



about ha: here ha is divided into two parts, one is the proxy layer ha, and the lower level Redis ha. First, the HA of the proxy layer. Previously mentioned that the proxy itself is stateless, so the proxy itself ha is relatively good to do, because the connection to any live proxy is the same, in the production environment, we are using Jodis, this is our development of a Jedis connection pool, very simple, is to listen to the list of surviving proxies above the ZK, and return the Jedis object to achieve load balancing and ha effect. There are also friends in the production environment using LVS and HA proxy to do load balancing, which is also possible. Redis itself ha, where Redis refers to the Codis bottom of the various server group master, at the beginning of the Codis originally did not put this part of the HA design, because Redis after hanging off, if directly slave ascension up, may result in inconsistent data, as new modifications may not be synchronized to slave in master, in which case the administrator needs to manually fix the data. Later we found that the demand was indeed more of a reflection of the friend, so we developed a simple HA tool: codis-ha, used to monitor the survival of the master of each server group, if a master is dead, Will directly promote a slave of the group to become the new master. The address of the project is: Https://github.com/ngaut/codis-ha.



about Dashboard: Dashboard is a very important role in Codis, all of the cluster information changes are initiated through dashboard (this design is a bit like Docker), Dashboard exposes a range of RESTFULAPI interfaces, whether Web management tools or command-line tools, that operate by accessing these httpapi, so ensure the network connectivity of dashboard and other components. For example, it is often found that the user's dashboard in the cluster of OPS is 0, is because the dashboard can not connect to the proxy machine for the sake of.



about the GO environment : in the production environment as far as possible to use the version of go1.3.x, go 1.4 performance is poor, more like an intermediate version, has not reached the status of production ready to release. Many friends of the go GC quite a criticism, here we do not discuss philosophy, choose Go is a multi-factor tradeoff after the result, and Codis is a middleware type of product, and there is not too many small objects resident memory, so for GC is basically no pressure, so do not consider GC problem.



about the design of the queue : In fact, simply, is "do not put eggs in a basket" principle, try not to put the data into a key, because Codis is a distributed cluster, if you always only operate a key, it is equivalent to degenerate into a single redis instance. Many friends use Redis as a queue, but Codis does not provide a Blpop/blpush interface, which is no problem, you can logically split the list into multiple list keys, on the business side through timed polling (unless your queue requires strict timing requirements), This allows different redis to share the same list of access pressures. And the single key is too large may cause the migration of blocking, because Redis is a single-threaded program, so the migration will block normal access.



about Master and Bgsave: Codis itself is not responsible for maintaining Redis's master-slave relationship, in Codis the master and slave are only conceptually: Proxy will call the request " master", master hangs up Codis-ha will promote a"slave"to master. The actual master-slave replication requires manual configuration when booting the underlying redis. In the production environment, I recommend that Master's machine do not open bgsave, and do not easily execute the Save command, the data backup as far as possible on the slave operation.



about cross-room/multi-work : Don't even think about it ... Codis does not have multiple copies of the concept, and codis more for the caching of business scenarios, the pressure of the business is directly to the cache, in this layer to do cross-room architecture, performance and consistency is difficult to be guaranteed



about the deployment of proxy : In fact, you can deploy proxy in a very close to the client, compared to a physical machine, this will help reduce latency, but it is important to note that the current Jodis is not based on the location of the proxy to select the best location instance, need to be modified.


Iv. Some views on distributed databases and distributed architectures (one more Thing)


?? Codis related content is over. Next I want to talk about my views on distributed databases and distributed architectures. Architects are so greedy that they have to become distributed on a single point and want to be as transparent as possible: P. In terms of MySQL, from the earliest single point to master and slave read and write separation, and later Ali similar to Cobar and TDDL, distributed and scalability is achieved, but at the expense of the business support, and then have a later oceanbase. Redis from single point to Twemproxy, to Codis, to reborn. The last storage is already and initially unrecognizable, but the protocols and interfaces are persistent, such as SQL and Redis Protocol.



?? NoSQL has come one after another, from HBase to Cassandra to MongoDB, to solve the extensibility problem of data by tailoring the store and query model of the business to balance the cap. But almost all of them lost the cross-line business (in a word, added a cross-line transaction on HBase, good work).



?? I think that, aside from the details of the underlying storage, for business, Kv,sql queries (relational database support) and transactions can be said to be the storage primitives that make up the business system. Why the combination of memcached/redis+mysql so popular, precisely because of this combination, a few primitives can be used, for business, can be very convenient to achieve a variety of business storage needs, can easily write "correct" program. However, the problem is that the data is large to a certain extent, from single-machine to distributed evolution process, the most difficult to deal with is the transaction, SQL support can also be done through a variety of mysqlproxy, KV Needless to say, born to distributed friendly.



?? In this way, we go to a world where there is no (cross-line) transaction support, and many business scenarios we can only sacrifice the correctness of the business to balance the complexity of the implementation. For example, a very simple need: Weibo focus on the number of changes, the most straightforward, the most normal wording should be the attention of the number of people who are concerned about the changes and attention to the number of changes in the same transaction, submitted together, either together successfully, or failed together. But now in order to consider performance, in order to consider the complexity, the general practice may be to queue-assisted asynchronous modification, or to bypass the transaction by caching the cache first, and so on.



?? But in some scenarios where strong transactional support is not so good to bypass (we're only talking about open source architecture, for example), the common pointers is that the critical path is based on user characteristics sharding to a single point of MySQL, or MYSQLXA, but performance drops too badly.



?? Google later encountered this problem in their advertising business, requiring both high performance and distributed transactions, as well as consistency: it was previously supported by a large MySQL cluster through sharding, which was too poor to be operational/scalable. This is in the general company, it is estimated to endure, but Google is not a general company, with atomic clocks to fix spanner, and then spanner on the construction of SQL query layer F1. The first time I saw this system, I felt absolutely stunning and should be the first publicly designed system that could really be called a newsql. So, BigTable (KV) +f1 (SQL) +spanner (High performance Distributed transaction support), while Spanner also has a very important feature is the replication and consistency assurance across data centers (implemented via Paxos), multi-data centers, It just complements the entire Google infrastructure database stack, making Google easy to develop for almost any type of business system. I think that's the direction of the future. A scalable KV database (as cache and simple object storage), a distributed relational database with high performance support for distributed transactions and SQL query interfaces, provides table support.


V, Q & A


Q1: I have not seen Codis, you say Codis not multi-copy concept, what does it mean?



A1:codis is a distributed Redis solution that uses presharding to divide the data conceptually into 1024 slots, and then forwards different key requests to different machines via proxy, with copies of the data or via Redis itself guaranteed



Q2:codis's information is stored in a ZK, does ZK have any other role in Codis? Master-Slave Switching why not Sentinel



A2:codis is characterized by dynamic expansion and contraction capacity, transparent to the business; In addition to storing routing information and also serving as a medium for event synchronization, such as change master or data migration, all proxies need to be monitored for specific ZK events It can be said that ZK was used by us as a reliable RPC channel. Because only the admin of the change of the cluster will go to ZK on the event, the proxy after hearing, reply in ZK, Admin received each proxy reply before continuing. The fact that the cluster changes itself does not occur frequently, so the amount of data is small. The master-slave switchover of Redis is to decide whether to initiate a command to elevate the new master by codis-ha the master of each server group on ZK to determine the survival.



Q3: Data sharding, is the consistency hash used? Please specify below, thank you.



A3: No, the Presharding,hash algorithm is CRC32 (key)%1024



Q4: How do I manage permissions?



There are no authentication-related commands in the A4:codis, and the AUTH directive is added to the reborndb.



Q5: How to prohibit the normal user link Redis destroy data?



A5: Ditto, currently Codis no Auth, the next version will be added.



Q6:redis What is the plan of cross-room?



A6: There is no good way, our codis positioning is the same room inside the cache service, cross-room replication for redis Such services, one is the delay is large, the second is the consistency is difficult to ensure that the performance requirements of a relatively high caching services, I think the cross-room is not a good choice.



Q7: The master and slave of the cluster (such as cluster S is the cluster m from, the number of nodes s and M may not be the same, S and M may not be in a room)?



A7:codis is just a proxy-based middleware and is not responsible for data copy related work. That is, only one copy of the data, inside Redis.



Q8: As you have introduced so much, I can conclude that you do not have the concept of multitenant and are not highly available. Can you say so? More of you are designing Redis as a cache.



A8: Yes, in fact, our internal multi-tenancy is through the multi-codis cluster solution, Codis More is to replace the twemproxy of a project. High availability is achieved through third-party tools. Redis is the main solution to Cache,codis is the Redis single point, horizontal scaling problem. Paste the Codis Introduction: Auto rebalance extremely simple to use support both Redis or rocksdb transparently. GUI Dashboard & Admin Tools Supports Most of the Redis commands. Fully compatible with Twemproxy (Https://github.com/twitter/twemproxy). Native Redis clients is supported Safe and transparent data migration, easily add or remove nodes On-demand. The problem is these. In the case of non-stop business, how to dynamically extend the cache layer, this is codis attention.



Q9: Do you have any experience with the migration of the Redis cold database? For Redis thermal data, the data transfer between two Redis processes can be achieved through the migrate command, but if there is a password on the side, migrate is finished (this I have already submitted to Redis official patch).



A9: Cold data We are now implementing the complete Redissync protocol, while implementing a ROCKSDB-based disk storage engine, the cold data of the standby machine, all exist on the disk, directly as a slave on the master. When actually used, the number of 3 Group,keys is the same, but one of the OPS is twice times the other two, what could be the cause? The consistent number of keys does not mean that the actual request is evenly distributed, so you may have a few keys that are particularly hot, and it must fall on the machine that actually stores the key. Just said the ROCKSDB storage engine: HTTPS://GITHUB.COM/REBORNDB/QDB, actually started is a redis-server, support the Psync protocol, so you can directly as a redis never used. is a good way to save memory from the library.



Q10:redis instance memory accounted for more than 50%, at this time to execute Bgsave, open virtual memory support will be blocked, do not open virtual memory support will be directly returned to err, right?



A10: Not necessarily, this to read the data (open Bgsave after the modified data) of the frequency, in the Redis internal execution bgsave, in fact, through the operating system cow mechanism to achieve replication, if you have to change almost all of the data during this time, So the operating system can only be completely copied out, so it exploded.



Q11: Just finished reading, like one. Can introduce the autorebalance realization of the next Codis.



A11: The algorithm is relatively simple and https://github.com/wandoulabs/codis/blob/master/cmd/cconfig/rebalancer.go#L104. Code talks:). In fact, according to the memory ratio of each instance, the slot is allocated well.



Q12: Do you have any experience about reducing the impact of data migration on online services?



A12: In fact, the CODIS data migration method is already very mild, is a key atom migration, if fear jitter can even add each key delay time. The advantage is that there is basically no perception of the business, but the downside is slow.






thanks to Mr. Liu's records and collation, Shu Tao and Chen Gang proofreading, many other editorial team volunteers also contributed to this article. For more information on architecture, please press the image below and follow the "High Availability Architecture" public number to gain valuable experience with architects.





Codis author Huangdongxu The design of the distributed Redis architecture and the pits that have been trampled


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.