A few months after VM started to work my feeling about it started
Be not very good. I stated in the blog, and privately on IRC,
Especially talking with Pieter, that VM was not the way to go for
Future of redis, and that the new path we were taking about using less
Memory was a much better approach. Together with cluster.
However there are a number of different models for dealing
Datasets bigger than RAM for redis. Just to cite a few:
1) virtual memory, where we swap values on disk as needed (the redis
Virtual Memory way)
2) storing data on disk, in a complex form so that operations can be
Implemented directly in the on-disk representation, and using the OS
Cache as a cache layer for the working set (let's call it the Mongo DB
Way)
3) storing data on disk, but not for direct manipulation, and use
Memory as a cache of objects that are active, flushing writes on disks
When this objects change.
It is now clear that VM is not the right set of tradeoffs, it was
Designed to be pretty fast but on the other hand there was a too big
Price to pay for all the rest: Slow restarts, slow saving, and in turn
Slow replication, very complex code, and so forth.
If you want pure speed with redis, in memory is the way to go. So as
Reaction to the email sent by Tim about his unhappiness with vm I used
A few vacation days to start implementing a new model, that is was
Listed above as number "3 ".
The new set of tradeoffs are very different. The result is called
Diskstore, and this is how it works, in a few easy to digest points.
-In diskstore key-value Paris are stored on disk.
-Memory works as a cache for live objects. operations are only
Saved med on in memory keys, so data on disk does not need to be
Stored in complex forms.
-The cache-max-memory limit is strict. redis will never use more RAM,
Even if we have 2 MB of Max memory and 1 billion of keys. This works
Since now we don't need to take keys in memory.
-Data is flushed on disk asynchronously. If a key is marked as dirty,
And Io operation is scheduled for this key.
-You can control the delay between modifications of keys and disk
Writes, so that if a key is modified a lot of time in small time, it
Will written only one time on disk.
-Setting the delay to 0 means, sync it as fast as possible.
-All I/O is saved med by a single dedicated thread, that is
Long-running and not spawned on demand. The thread is awaked with
Conditional Variable.
-The system is much simpler and sane than VM implementation, as there
Is no need to "undo" operations on race conditions.
-Zero start-up time... as objects are loaded on demand.
-There is negative caching. If a key is not on disk we remember it
(If there is memory to do so). So we avoid accessing the disk again
And again for keys that are not there.
-The system is very fast if we access mostly our working set, and
This Working Set happens to fit memory. Otherwise the system is much
Slower (I/O bound ).
-The system does not support bgsave currently, but will support this,
And what is cool, with minimal overhead and used memory in the saving
Child, as data on disk is already written using the same serialization
Format as. RDB files. So our child will just copy files to obtain
. RDB. In the mean time the objects in cache are not flushed, so
System may use more memory, but it's not about copy-on-write, so it
Will use very little additional memory.
-Persistence is * per key * This means, there is no point in time persistence.
I think that the above points may give you an idea about how it works.
But let me stress the per-key persistence point a bit.
Lpush a 0
Lpush B 1
Lpuhs A 2
So after this commands we may have two Io scheduled operations
Pending. One for "A" and one for "B ".
Now imagine "a" is saved, and then the server goes down, redis is
Brutally killed, or alike. The database will contain in a consistent
Version of "A" and "B", but the version of "B" will be old,
The "1" pushed.
Also currently multi/exec is not transactional, but this will be
Fixed, at least inside a multi/exec there will be guarantee that
Either all values or nothing will be synched to disk (this will be
Obtained using a journal file for transactions ).
Some more details. The system is composed of two layers:
Diskstore. c -- implements a trivial on disk key-Value Store
Dscache. c -- implements the more complex caching Layer
Diskstore. C is currently a filesystem based kV store. This can be
Replaced with a B-TREE or something like that in the future if this
Will be needed. However even if the current implementation has a big
Overhead, it's pretty cool to have data as files, with very little
Chances of loosing data and partition uption (rename is used for writes ).
But well if this does not scale well enough we'll drop it and replace
It with something better.
The current implementations is similar to bigdis. 256 Directories
Containing 256 directories each are used, for a total of 65536 dir.
Every key is put inside the dirs addressed by sha1 (key) translated in
Hex, for instance key "foo" is:
/0b/EE/0beec7b5ea3f0fdbc95d0dd47f3c5bC275da8a33
The cool thing is, diskstore. C exports a trivial interface to redis,
So it's very simple to replace with something else without touching
Too much internals.
Stability: The system is obviusly in Alpha stage, however it works
Pretty well, without obvious crashes. But warning, it will crash
An assert if you try to bgsave.
To try it download the "unstable" branch, edit redis. conf to enable
Diskstore. Play with it. Enjoy a redis instance that starts in no time
Even when it's full of data
Feedbacks are really appreciated here. I want to know what you
Think, what are your impressions on the design, tradeoffs, and so
Forth, how it feels when you experiment with it. If you want to see
The inner workings set log level to "debug ".
The goal is to ship 2.4 ASAP with VM replaced with a good
Implementation of diskstore.
Cheers,
Salvatore