Recently there is a need to store large amounts of small data, about billions of of the scale, each data is 6 bits of the number plus a 32-bit MD5 (16 binary display). Because the data is very small, the amount of data is not large, we plan to do shards according to MD5, stored in multiple redis, each redis about 100 million of the data, the pure data about (6+32) *10^9 = 3.8G, which is the Redis database is very good at the amount of storage.
1 fast load data to Redis
Redis is already very fast, up to 10w/s, but it will take nearly 20 minutes to face billions of levels of data. If you use pipeline, Redis can also be faster, up to 40w/s, and can easily write 100 million data in 5 minutes.
The redis-cli--pipe parameter of the Redis comes with the ability to load data quickly, but we need to turn the data into a Redis protocol. The--pipe-timeout parameter is set to 0 to prevent redis response too late REDIS-CLI exit prematurely. The PL script in the example below is the Redis protocol. But PL performance is slightly weak, not to redis throughput bottleneck, their CPU first 100%, for this, using 20 processes, each process 5 million data, so Redis CPU utilization to 100%, data load can be completed in 5 minutes.
We use Ps-eo ' pid rss pmem cmd ' | grep Redis and Redis info view Redis memory usage.
2 The most intuitive way to store
Time Head-n 5000000 Data |./redis-pipe-1.pl | Redis-cli--pipe--pipe-timeout 0 redis-pipe-1.pl The most core is print join ("\ r \ n", "s", ' $ $ ', "SET", ' $ '). $keylen, $key, ' $ ', 1), "\ r \ n"; Where key is a 6-digit number plus 32-bit MD5 string, a total of 38 bits.
Memory usage
5490 9033980 6.8 Redis-server *:6379
Used_memory_human:8.45gdb0:keys=100000000,expires=0,avg_ttl=0
In terms of memory usage, about 8G is estimated to be twice times more than 3.8G. Because of the internal data structure of Redis, 1 pointers are 8 bits, plus a small value,slab memory allocation strategy, twice times is not particularly abnormal.
3 Using binary Storage
The MD5 itself is a 16-bit unsigned char, which is converted into 32 bits in order to turn the visible character into a 16 binary display. Originally want to use Base64 24 bit on can, later think Redis support binary system, why do not directly save 16 bit unsigned char. Changed the 32-bit 16 display into 16-bit data in./redis-pipe-2.pl
My @chars = (); My $hex = ""; foreach (Split//, $MD 5) {
$hex. = $_;
if (length ($hex) = = 2) {
Push (@chars, Chr (Hex ($hex)));
$hex = "";
}
}
$key = $appid. Join ("", @chars);
$keylen = Length ($key);
Memory usage
12343 7437316 5.6 Redis-server *:6379used_memory_human:6.96gdb0:keys=100000000,expires=0,avg_ttl=0
From memory use, the reduction of about 1.5G, and expected almost 16*10^9 = 1.6G
So far, it is from the data itself to reduce memory usage. And based on the analysis of Redis's own data structure consumption accounted for about half, how to reduce the cost of REDIS data structure?
4 using Set and Hset to mix data organization mode
First look at two very interesting configuration, is specifically for small hash preparation (using Hset), when the entry in the hash is less than 512, and each value is less than 64 bytes, Redis internal using a special encoding, you can save the memory on average 5 times times.
Hash-max-ziplist-entries 512hash-max-ziplist-value 64
We can reduce the use of memory by dismantling the structure of the key-value into a structure such as Key-smallhash.
My ($appid, $md 5) = Split/\s/, $line; my @chars = (); My $hex = ""; foreach (Split//, $MD 5) {
$hex. = $_;
if (length ($hex) = = 2) {
Push (@chars, Chr (Hex ($hex)));
$hex = "";
}
}
My $hash = $appid. Join ("", @chars [0.. 2]); my $hashlen = Length ($hash);
My $key = Join ("", @chars [3.. @chars-1]); my $keylen = Length ("$key");
Print join ("\ r \ n", "* *", ' $4 ', "Hset", ' $ '). $hashlen, $hash, ' $ '. $keylen, $key, ' $ ', 1), ' \ r \ n ';
Three unsigned char is probably 2^24 = 16777216 If there are 100 million records, each hash itself has an average of 6 key-value
Memory usage
8593 4052120 3.0 Redis-server *:6379
Used_memory_human:3.31gdb0:keys=16733972,expires=0,avg_ttl=0
Memory use 3.31G, than bare data (6+16) *10^9 = 2.2G only about 50% more. Not only does it save memory, but it also has an advantage that memory consumption does not grow linearly with the number of entries. Because a maximum of 16,777,216 entries, even if the data 200 million, it is only a hash to an average of about 12.
5 additional issues to focus on
1. Originally prepared to convert the 6-bit number to 4-bit integer storage, you can save an additional 200M, and then give up, because the number into int, the interoperability of each language has hidden trouble.
2. Our Redis read is not particularly much, we need to test the effect of hash compression storage on the performance, but I estimate it is not affected, because the default is open.
Source: Kuibu Blog
The storage of large amounts of small data for Redis learning