Redis and distributed lock __java

Source: Internet
Author: User
Tags failover lua time in milliseconds

Setnx SET if not eXists

If the key does not exist, set to the specified value and do nothing if the key exists.

return value

1:key is set to indicate that key did not exist before

0:key is not set to indicate that key existed previously

The set command from Redis 2.6.12 can replace Setnx,setex,psetex

The SET command adds the following parameters

EX seconds, setting the specified expiration time in seconds

PX milliseconds, setting the specified expiration time in milliseconds

NX set value only if key does not exist

XX set value only when key exists

Simple string reply

"OK" represents the correct execution of set, NULL reply on behalf of the set operation because the NX or XX parameters are unsuccessful

An interface to a query database, because of the large number of calls, so add a cache, and set the cache after the expiration of the refresh, the problem is when the amount of concurrency is larger, if there is no lock mechanism, then the cache expiration of the instant, a large number of concurrent requests will penetrate the cache directly query the database, resulting in avalanche effect, Then you can control that there is only one request to update the cache, and the other requests are either waiting or using an expired cache

$ok = $redis->setnx ($key, $value);

if ($ok) {

$cache->update ();

$redis->del ($key);

}

Issue 1: The effective time of the lock

If the setnx succeeds, but because the unexpected reason did not execute to the DEL program on the problem, or because of the network segmentation (network partition) caused it can no longer communicate with the Redis node, resulting in locks exist, so that the

The cache is no longer updated. So we need to execute in a transaction and add an expiration time

This expiration time is called the valid time for the lock (lock validity times). The client acquiring the lock must complete access to the shared resource within this time.

$redis->multi ();

$redis->setnx ($key, $value);

$redis->expire ($key, $ttl);

$redis->exec ();

Because SETNX does not have the ability to set expiration time, we need to use the expire to set.

But there is another problem: When multiple requests arrive, although only one of the requested SETNX succeeds, any one

The requested expire can be successful, which means that the expiration time can be refreshed even if a lock is not obtained, if the request compares

Intensive, the expiration time will always be refreshed, causing the lock to remain in effect. So you need to use the Lua script to implement

Local key = Keys[1]

Local value = keys[2]

Local TTL = keys[3]

Local OK = redis.call (' setnx ', key, value)

If ok = = 1 Then

Redis.call (' Expire ', Key, TTL)

End

return OK

From 2.6.12, set covers the functionality of Setex, and set itself already contains the ability to set the expiration time, which means that the functionality we need to use is only set to implement.

SET resource-name anystring NX EX max-lock-time

$ok = $redis->set ($key, $value, Array (' NX ', ' ex ' => $ttl));

if ($ok) {

$cache->update ();

$redis->del ($key);

}

Question 2: random string

Setting a random string my_random_value is necessary to ensure that a client-released lock must be the one it holds

Is the code perfect? The answer is still almost. Imagine if a request to update the cache is longer (or because of GC and other reasons), even longer than the expiration of the lock, resulting in the cache update process, the lock is invalidated, at this time another request will acquire the lock, but the previous request in the cache after the update is complete, if not to determine the direct deletion of the lock, There will be a case of accidentally deleting a lock created by another request, so we need to introduce a random value when creating the lock:

$ok = $redis->set ($key, $random, Array (' NX ', ' ex ' => $ttl));

if ($ok) {

$cache->update ();

if ($redis->get ($key) = = $random) {

$redis->del ($key);

}

}

There are some problems with this code, so it leads to question 3 question 3: The release of the lock must be implemented using the Lua script.

Releasing the lock actually involves three steps: ' Get ', judge, and ' DEL ', and use the Lua script to ensure the atomicity of these three steps. Otherwise, if you put these three steps into the client logic, it is possible that a sequence of execution similar to the previous problem 2 occurs.

Client 1 acquires lock success.

Client 1 accesses the shared resource.

Client 1 to release the lock, perform a ' get ' operation first to obtain the value of the random string.

Client 1 determines the value of a random string equal to the expected value.

Client 1 has been stuck for a long time for some reason.

When the time expires, the lock is released automatically.

Client 2 Gets the lock that corresponds to the same resource.

Client 1 recovers from congestion, performs del manipulation, and releases client 2-held locks

If the client is not blocked, there is a large network delay and a similar execution sequence may occur.

The following LUA scripts can solve the problem 3

If Redis.call ("Get", keys[1]) = = Argv[1]

Then

Return Redis.call ("Del", Keys[1])

Else

return 0

End

If the Redis node is down, then none of the clients can get the lock and the service becomes unavailable. To improve usability, we can hang a slave to this Redis node, and when the master node is unavailable, the system automatically cuts to the slave (failover). However, because the Redis master-slave Replication (replication) is asynchronous, this may result in the loss of security during the failover process. Consider the following sequence of executions:

The client 1 acquired the lock from master.

Master downtime, storage lock key has not had time to sync to the slave.

Slave upgrade to master.

Client 2 Gets the lock for the same resource from the new master.

As a result, client 1 and Client 2 hold a lock on the same resource at the same time. The security of the lock was broken. To solve this problem, Antirez designed the Redlock algorithm.

"Other questions"

The effective time of the lock that appears in the previous algorithm (lock validity times) is set to how appropriate. If the setting is too short, locks are likely to expire before the client completes access to the shared resource, thereby losing protection, and if the setting is too long, if a client holding the lock fails to release the lock, then all other clients will not be able to acquire the lock and thus fail to function for a long time. It seems to be a dilemma.

Furthermore, in the previous analysis of the random string my_random_value, Antirez also admitted in the article that it is true to consider the case where the client's prolonged blocking caused the lock to expire. If this happens, then the shared resources are not protected. Antirez redesign of the redlock can solve these problems.

Distributed lock Redlock

Reference URL

https://mp.weixin.qq.com/s/JTsJCDuasgIJ0j95K8Ay8w

Https://mp.weixin.qq.com/s/4CUe7OpM6y1kQRK8TOC_qQ

Https://redis.io/topics/distlock

Since the distributed lock based on single Redis node can not solve the security problem in failover, Antirez proposes a new distributed lock algorithm Redlock, which is based on n completely independent redis nodes (usually n can be set to 5).

The client running the Redlock algorithm performs the following steps to complete the acquisition of the lock:

1 Gets the current time (the number of milliseconds).

2 performs the fetch lock operation sequentially to n redis nodes. This acquisition is the same as the previous process of acquiring a lock based on a single Redis node, containing a random string my_random_value and an expiration time (such as PX 30000, the valid time of the lock). In order to ensure that the algorithm can continue to run when a Redis node is not available, the lock operation has a timeout (time out), which is much less than the effective time of the lock (dozens of millisecond magnitude). After the client has failed to acquire a lock on a redis node, it should immediately try the next Redis node. The failure here should contain any type of failure, such as the Redis node is not available, or the lock on the Redis node is already held by another client (note: Redlock in the original text here only refers to the Redis node is not available, but should also contain other failures).

3 The total amount of time spent in calculating the entire acquisition lock is calculated by subtracting the time of the 1th step record with the current time. If a client successfully acquires a lock from most Redis nodes (>= n/2+1) and the total time spent acquiring the lock does not exceed the valid time of the lock (lock validity times), then the client considers the final fetch lock successful;

4 If the final acquisition of the lock succeeds, then the effective time of the lock should be recalculated, which equals the effective time of the original lock minus the

3 steps calculated to obtain the lock consumption time.

5 If the final acquisition of the lock fails (either because the number of REDIS nodes to the lock is less than n/2+1, or if the entire acquisition of the lock takes more time than the lock's original valid time), then the client should immediately initiate the release of the lock to all REDIS nodes (the Redis described earlier Lua Delete script).

Of course, the process of acquiring the lock is described above, and the release of the lock is simpler: The client initiates the release of the lock to all REDIS nodes, regardless of whether the nodes were successful at the time of acquiring the lock.

Because most of the n redis nodes work properly, the redlock can be guaranteed to work properly, so it is theoretically more usable. The problem of lock invalidation of the distributed lock of the single Redis node that we discussed earlier is not existed in the Redlock, but it will affect the security of the lock if the node has a crash reboot. The extent of the impact is related to the degree to which redis data is persisted.

Suppose a total of 5 Redis nodes: A, B, C, D, E. Imagine that the following sequence of events occurred:

Client 1 successfully locked a, B, C, obtaining lock success (but D and e are not locked).

Node C crash restarted, but the client 1 lock on C was not persisted and lost.

After node C restarts, the Client 2 locks C, D, E, and acquires the lock successfully.

This allows both client 1 and client 2 to acquire a lock (for the same resource).

By default, Redis's aof persistence is to write a disk per second (that is, perform fsync), so it is possible to lose 1 seconds of data at worst. To minimize data loss, Redis allows the fsync to be set to every modification of data, but this can degrade performance. Of course, it is still possible to lose data even if the Fsync is executed (depending on the system rather than the Redis implementation). Therefore, the above analysis due to the node reboot caused by the lock failure problem, it is always possible to appear. In order to deal with this problem, Antirez also proposed the concept of delayed restart (delayed restarts). That is, when a node crashes, it is not restarted immediately, but waits for a period of time to reboot, which should be greater than the lock's valid time (lock validity times). In this case, the lock that the node participates in before restarting will expire, and it will not affect the existing lock after restarting.

There's a little detail about redlock. Analysis: At the end of the release of the lock, Antirez special emphasis in the algorithm description, the client should be to all REDIS nodes to initiate the release of the lock operation. That is, even if the lock was not successful at that time, the node should not be missed when the lock was released. This is why. Imagine a situation in which the client sent a request to a Redis node to obtain a lock that successfully reached the Redis node, and the node successfully performed the set operation, but the response packet it returned to the client was lost. In the view of the client, the request to acquire the lock failed due to timeout, but on the Redis side, the lock was successful. Therefore, when releasing the lock, the client should also initiate a request for those Redis nodes that failed to acquire the lock at the time. In fact, this situation is possible in the asynchronous communication model: it is normal for the client to communicate to the server, but there is a problem in the opposite direction.

"Other questions"

Before discussing the distributed lock for a single redis node, we finally raised the question that if the client blocked the lock for a long time, it would be unsafe to access the shared resource (without the protection of the lock). Whether the problem has improved in the Redlock. Obviously, such problems still exist in the redlock.

In addition, after successful acquisition of the lock in the 4th step of the algorithm, if the process of acquiring the lock has been consumed for a long time, and the remaining locks have been recalculated for a short time, then we have not yet been able to complete the shared resource access. If we think it is too short, we should immediately perform the lock release operation. How short is that? Another choice problem.

Martin's analysis

Martin Kleppmann published a blog on the day of 2016-02-08 called "How does distributed locking

", the address is as follows:

Https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html

In this article, Martin addresses many of the fundamental issues of distributed systems (especially the asynchronous model of distributed computing), which is well worth reading for practitioners of distributed systems. This article can be roughly divided into two parts:

The first half, nothing to do with Redlock. Even if we had a perfectly implemented distributed lock (with automatic expiration), Martin pointed out, we would still not be able to get enough security without the shared resources involved in providing some kind of fencing mechanism.

The latter part is a critique of Redlock itself. Martin pointed out that because Redlock is essentially based on a synchronization model, the system's time hypothesis (timing assumption) is very strong requirements, so the security itself is not enough.

First, let's discuss the key points of the first half. Martin gives the following sequence diagram:



In the sequence diagram above, assuming that the lock service itself is not a problem, it always guarantees that at most one client gets the lock at any one time. The word lease that appears in the above image can be considered as equivalent to a lock with an automatic expiration function. Client 1 had a long GC pause after acquiring the lock, during which time it acquired a lock that expired while client 2 acquired a lock. When client 1 recovers from GC pause, it does not know that the lock it holds has expired. It still initiates a write-data request to a shared resource (a storage service in the image above), while the lock is actually held by client 2, so write requests from two clients may conflict (the lock's mutex fails).

At first glance, one might say that since client 1 recovers from GC pause and does not know that its own lock has expired, it can determine whether the lock expires before accessing the shared resource. But think about it, it doesn't help at all. Because GC pause may occur at any time, perhaps just after the judgment is complete.

It is also said that if the client is using a language that does not have a GC, then there is no such problem. Martin pointed out that the system environment is too complex, there are still many reasons for the process of pause

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.