Micro-Architecture Design: micro-blogging counter Design

Source: Internet
Author: User

@cydu

Source:

Http://qing.weibo.com/1639780001/61bd0ea133002460.html

Http://qing.weibo.com/1639780001/61bd0ea1330025sq.html

background:
Every tweet and commentary behind is a string of endless stories, but today is mainly about Counting Service, the Count service records every micro-blog in detail. Number of commentsAnd number of times to be forwarded, and of course, more and more emotions are recorded here. Data Volume:Total number of Weibo: according and every second is growing fast. Each Weibo has a unique 64-bit ID.  Traffic: Millions per second is still growing steadily. Accessed based on the Weibo ID. Main interface:Increase the number of comments (default is 0) increase the number of forwards (default is 0) get the number of comments get the number of forwards to get the number of comments + get the number of forwards (this interface is the most visited) number of comments and the number of forwards, you can be considered to be 32-bit shaping values. is not a negative number, the default is 0. Requirements:Because the user is very sensitive to the number (think you have to pull a fan, but the number of fans did not rise in pain.)  ), so we ask for very accurate data, very low latency (within 1s), service stability is very high (do not because some aunt swept the socket to the number lost ...) As architects, of course, you also need to consider the architectural cost of all aspects, and then make a variety of compromises. The main considerations here are: machine costs, development costs, maintenance costs.

Interested architects and prospective architects can think together how to use the fewest machines to develop the most maintainable counter system in the shortest time possible ... Of course, we have to meet our data volume, performance and high availability requirements.

Very interested in this piece, and have a good idea and suggestions, please send a private resume to @cydu or @ Weibo platform architecture, we have a lot of similar problems are looking forward to you to solve!   Of course, you can also comment directly on the discussion. PS: I will give our understanding and solution of the idea, look forward to everyone to optimize together.  Update: Updated the data persistence and consistency assurance related content, thanks @lihan_harry @ Zheng Zheng @51 Leo Liu and other classmates reminders. Update2: Updated for weibo_id key optimization, using prefix compression, you can save nearly half of the space. Thanks for Wu Tingbin @drdrxp advice!  Update3: Updated the optimization idea for value using two-dimensional arrays, multi-column compression coding, thanks again @ Wu Tingbin, Update4: Update the estimate of memory usage under the Redis scheme, thanks to the reminder @ Liu Hao Bupt. Dug a pit last week ([micro-architecture design] Weibo counter design (top) http://qing.weibo.com/1639780001/61bd0ea133002460.html), although the motive for digging this pit is very impure (obviously the recruiting soft article, It is very gratifying to have received a lot of reliable resume, I hope the resume came more violent!  ), but the process of discussion with you, but also a great harvest, but also met a lot of new friends. For a simple counting service, it is really simple, we can have a lot of solutions: Scenario One: Go directly to MySQLThis needless to say, enough simple violence.  But in the early stages of rapid product development, he can solve a lot of problems, but also a good solution.  What if the data volume is too large? For 100 million or even hundreds of millions of of the size of the data, the split table can solve a lot of problems, for the micro-BO counter is at least two classic demolition method: I. Taking the model by ID, divide the data into N tables. The tragedy of this scheme is: The expansibility is not good, the table is not good, once the data is full, add up very depressed. While it is possible to get a few more tables in advance, for Weibo, this fast-growing business has a significant impact on the fast-growing demands of the business. two. Split the table by the time of the IDand build a new watch when it's full. The tragedy of this scheme is: the hot and cold uneven, the most recent Weibo must have been visited most frequently, and the old library basically did not visit.  Mitigation is possible through a mix of hot and cold storage deployments, but the cost of deployment and maintenance is significant. The amount of data from billion to hundreds of billions of dollars, the nature of the problem has changed, maintenance of thousands of tables, hot spots are not the same need to constantly switch the adjustment, this is a very tragic thing ...  How do I get too much traffic? There are a number of classic ways to deal with traffic: I. On the cache(Eg:memcache), access the cache first, and then access MySQL when it is not hit. There are two depressed points: empty data also have cache (more than half of the microblogging is not forwarded and no comments, but there are still a lot of visits will query him); Cache is frequently invalidated (because the count is updated very quickly, so it is often necessary to fail the cache again to re-seed, but also result in inconsistent data); As the most basic service, the use of complex, the client needs to pay more attention to things two. Better Hardware resolution。  Upper Fusionio + handlesocket + Large memory optimizer. The hardware approach also solves the problem, but this is the most typical scale up scenario. Although not developed at all, hardware costs are not low, and for more complex requirements, as well as the rapid growth of traffic, it is difficult to deal with. Advantages:No development, code farmers can write code time to go out to bubble girls. Two. Mature program, data replication, management, repair programs are very mature. Disadvantages:First, the large data volume and high concurrent access support is not good, very weak.  Two. Maintenance costs and hardware costs are high.   Overall: MySQL sub-table + cache/hardware-accelerated scenarios are a great solution for data size and traffic, but it is very inappropriate after a large amount of information. Since MySQL doesn't work, what about NoSQL? Scenario Two: Redis  As a simple memory data structure, Redis provides a very easy-to-use Access interface, and has a fairly good stand-alone performance. The Counter Pattern, which is implemented by INCR, is simple and easy to use as a counter service. By using the top-level table, adding slave and so on, heap some machines, but also can solve the problem of large data volume and high concurrent access.    but Redis is pure memory (VM mechanism is immature and is about to be discarded, we must be afraid to use it online!) , so the cost is not low, we simply estimate the amount of data storage (following the implementation of the Redis 2.4.16, the 64-bit system, the pointer is 8 bytes to estimate):   assume that key is 8 bytes, value is 4 bytes, through INCR storage:   A value creates a robj through Createstringobjectfromlonglong, because value is between long_min and Long_max, so value can be stored with a PTR pointer, which takes up sizeof (robj) = 16 bytes; A key (that is, the microblog ID) up to 64 digits (eg:5612814510546515491), but stored by Sdsdup as a string, requires at least 8 (struct SDSHDR) +19+1 = 28 bytes In order to deposit into Redis's dict, need a Dictentry object, continue 3*8 = 24 bytes;  put in db->dict->ht[0]->table to store the pointer, and then 8 bytes;    stores a count of 64-bit key,32 bits of value, Redis also consumes at least: 16 + 28 + 24 + 8 = 76 bytes. 100 billion key full memory, you need at least 100G * = 7.6TB of memory (translation 76G memory machine also need 100!) )。   Our valid data is actually 100 billion * 32 bit = 400GB, but it needs 7.6TB to store, the effective utilization of memory is about:  400GB/7600GB  = 5.3%.   Even so, For a lot of hot data, only one copy, single-machine performance is not enough, the stability of the system can not be guaranteed (standalone down to do?) ), you also need to copy multiple copies. Calculate the memory cost of the Jemalloc to avoid the introduction of memory fragments; Forget it, Di.Temporary memory space required by Ctexpand; Then count the memory overhead that the system is going to use ... That requires more machines, and a conservative estimate of more than 300-400 machines.    Overall: Redis as a good memory data structure, interface convenient, easy to use, for small data volume of medium-high traffic count service, is a good choice, but for the micro-blog counter this extreme application scenario, the cost is not acceptable!    Some students have suggested other nosql schemes such as Cassandra,mongodb, either from a maintainability standpoint or from the point of view of machine utilization, which is very difficult to accept (interested students can analyze it carefully).    ordinary NoSQL can't do that, then what? Try to customize our own counter!  update4://@ Liu Hao Bupt: @cydu Just read the section on Redis capacity estimates in the article, with two minor flaws: 1. For the storage of value, the estimated 16 bytes in this article, in fact, this part of the cost can be saved. The Createstringobjectfromlonglong function does not allocate additional space for value values that are less than redis_shared_integers. Redis_shared_integers default is 10000, larger to meet the majority of demand  //@ Liu Hao Bupt: @cydu 2. Is the memory utilization that can be evaluated under use Zipmap. Redis is not only string->string kv storage, there are some things that can be mined. Instagram, introduced in its engineering blog (http://t.cn/S7EUKe), uses Zipmap to store 1M of data, with memory consumption optimized from 70M to 16M. Given Sina Weibo's massive use of Redis, customizing the Redis implementation Service is also a thought.   Thank you Liu Hao Bupt to help me point out that for Redis capacity estimates are inaccurate, the  redis_shared_integers mechanism that comes with Redis can really save a lot of the memory that is accounted for by value. But since this scenario relies on a pointer to store shared_int, it's not good to migrate to scenario three.  zipmap This optimization idea is quite good, we will continue to pay attention to the use of universal Redis.    Scenario Three: CounterThe counter is an ordinary basic service, but because the amount of data is too large, the quantitative change causes a qualitative change. So when we do counter, one of the ideas is: the sacrifice part of the universality, for the microblog forwarding and commenting on the large data volume and high concurrent access to the characteristics of fixed-point optimization. 1. A large number of tweets (more than half) are not forwarded, or no comments, or even no forwarding or comments. Optimization for this situation: Discard storage +cache idea, because these 0 of the data, also must go into the cache (either by the bypass or penetrate), because the query volume is not small, which for our cache utilization impact very very large (half of the data is empty.)    And we use a similar storage-cache (storage itself in memory), this type of data can not be stored, when not found, the return of 0. In this case, 100 billion numbers, we can reduce 3/5, that is, a maximum of 40 billion digits. This is the most classic sparse array of optimized storage methods. 2. The correlation between the number of comments and the number of tweets is very high. They all have the same main key, there is a large number of forwarded tweets generally have comments, there are a lot of comments on the general forwarding volume is not small. And the most visited feed page basically take the number of comments, will also take the number of forwards ... Optimization for this situation: we can consider the number of comments and the number of forwards to be stored together, so that you can save a lot of key storage space. Number of comments by Weibo id+;  Weibo id+ forwarding number to: micro-blog id+ comments + The structure of the number of forwards. PS: This optimization and the previous optimization are some minor conflicts, some have forwarded no comments on the microblog, need to save a 0; But after the data assessment, we found that the optimization is still quite necessary: a. The key stores more space than the number of comments, a 8-byte, a 4-byte;  B. For the application layer, the bulk request can reduce the request pressure and increase the response time. (The specific number is inconvenient to disclose, but this conclusion you can randomly extract a group of public Weibo to verify) 3. Optimization of data structureUsing Redis's analysis of memory usage in scenario two, we found it to be very "extravagant", with a large number of repetitive stores of pointers and reserved fields, and a large amount of fragmented memory used, although Redis is largely for versatility.   In response to this situation: @ Fruit Dad and dad students designed a lighter and simpler data structure, better use of memory, the core idea: A. Use the following item structure to store the number of forwards and comments: struct item{int64_t weibo_id;   int repost_num; int comment_num;  Store numbers, not strings, no extra pointers to store, so that two digits only account for 16 bytes;  B. When the program starts, open a large chunk of memory (table_size * sizeof) and clear 0 of him. C. When inserting: H1 = Hash1 (weibo_id); h2 = Hash2 (weibo_id); If the h1%table_size is empty, the item is stored in this location; Otherwise S=1 and find (H1 + h2*s)% table_size location, if not empty, s++ continue to find vacant ...  D. When querying: similar to the process of inserting, after finding a data, compare weibo_id and item.weibo_id is consistent, the consensus is found, otherwise found empty is represented as a value of 0; E. When deleting: Find the location, set a special flag; The next time you insert, you can populate this flag bit to reuse memory ... After we measured, when the 200 million data of this length of the array, the capacity of not more than 95%, the conflict rate is acceptable (the most tragic time may need to do hundreds of memory operations to find the corresponding vacancy, performance is fully acceptable;) after this optimization, our total data volume becomes: 40 billion * 16B = 640GB; Basically is the plan two of One-tenth also less! 4. Forwarding and commenting number optimization of valueContinue to observe, we found a large number of Weibo, although there are forwarded and commented, but the value is generally relatively small, hundreds of or thousands of, more than tens of thousands of of the Weibo is very few (data research shows below one out of 10,000).    So we upgraded the item to: struct item{int64_t weibo_id;     unsigned short repost_num; unsigned short comment_num;}; For Weibo with a number of forwards and comments greater than 65535, here we record a special flag bit FFFF, and then go to the other dict to find (there is no such optimization).  In fact, unsigned short can also be optimized for extreme situations such as int:12, but more complex and generally profitable, so we still choose unsigned short. After this optimization, our total data volume becomes: 40 billion * 12B = 480GB, this amount of data is almost the capacity of a single machine can be stored. The number of queries per second changed from 100W to 50W, and the amount of updates per second is only tens of thousands of unchanged, and the query volume can be ignored first. 4.1 Optimization of supplementary value@ Wu Tingbin: In addition, 64bit value can be re-compressed with a similar idea of utf-8. Finally, because Cpu/mem is not a bottleneck, you can put the weibo_id and the following value separately in two arrays, the corresponding index can be. Then you will find that the value array of 64bit many bits are all 0, perhaps you can consider the data in K to do simple data compression into the memory, this compression ratio should be amazing. @ Wu Tingbin: Reply @cydu:value can use two-dimensional array how. If 1K is compressed in units, each row represents 1K of data. The data is then compressed and written. It is generally possible to use only 100 bytes per line? @cydu: This really can, variable length coding will make sense, anyway the CPU should not be the bottleneck, there is an update when the whole block re-encode, take is also fully removed and decompressed.  Another advantage is that I add more convenience, now I add the price is actually very high. At the earliest, I also want to use variable length compression, but the idea has been confined to a value inside do compression, because there are only two columns, we are using fixed-length storage, on the one hand, variable length has overhead (the sign bit sign with how many bits to represent), on the other hand, the fixed length of the 32-bit province out of the  Can be combined with the optimization of key, with fewer fields).  @ Wu Tingbin A two-dimensional data, the benefits of long-time compression appear. I can store the key separately, the value, by 1024 value or even more value to compress into a mini block storage, in the case of fixed length, the size of this mini block is 1024*32 = 4K. But in fact, this 4 K contains a large number of 0, I do not have their own complex variable length coding, directly take the data of the 4 K to do LZF compression, only the compressed data stored on the line, the time to extract the first decompression.  The compression efficiency depends on the data, but the compression of the general text to 50% should be very easy, that is to say, at least 40 billion * 2 = 80GB of memory can be saved. One of the biggest benefits of this solution is not the 80GB memory savings, but: 1. I front optimization mentioned more than 65535 of the forwarding and comments, I can consider simple to do, anyway, the longer it does not affect the whole scheme is simplified. (Of course, the need for specific data testing, to verify which is better) 2. [ quite important!! For Weibo count, in fact we have to add a list of needs, such as the number of other similar comments, my original proposal, the cost is quite high, need to re-open a large array, but also in advance to set up a good hint (for the new business, hint value of bad selection, But his impact on performance and memory usage is deadly!), and this scenario, no matter how many columns you add, it doesn't actually matter, the length of the memory is only related to the amount of your actual data.! After this optimization, I conservatively estimate that we are able to save 80GB of memory on previous basis! 5. Optimization of key

@ Wu Tingbin very good article. The weibo_id is 8byte and the compression can be pressed to close to 4byte. If a bunch of data is AB,AC,AD,AE,AF, Xe,xy,xz. Put him in the memory of a in a lump of memory, X in the beginning of another lump, the inside only to save B,c,d,e,f and Y,z. Can basically reduce the 4 bytes. Can you dispense with 40g*4=160g?

@drdrxp: Storage is divided into 2^24, weibo_id% (2^24) refers to the number of the zone, the record is re-used 40bit storage Weibo_id/(2^24), the record of the other 12bit deposit and forwarding, 12bit deposit comments, 1 records total 8 bytes, 480G can be optimized to 320G. If you can actually review the distribution of the number of comments forwarded should be more optimized 1.

Thanks to Wu Tingbin @drdrxp for this proposal, this piece of optimization space is really very big. It is mentioned later that we divide the large table into smaller table based on the time period or according to the Weibo ID (mainly to be able to serialize to disk space for hotter data). So the data in a small table is weibo_id relatively close, eg:5612814510546515491, 5612814510546515987, we can put the 64-bit key in the same high 32-bit merge together. As a property of a small table (prefix), you do not have to store each one.  A 8-byte key that can save at least 4 bytes.    struct item{int weibo_id_low;    unsigned short repost_num; unsigned short comment_num;}; After this optimization, our total data volume has become: 40 billion * 8B = 320GB, ^_^ also thanks to @drdrxp suggestions, before also considered 12bit to deposit comments and forwarding number, really can optimize a lot, but because more out of bit do not know why, did not engage, hehe. Your suggestions and @ Wu Tingbin recommendations are mainly in the key on the fuss, great! 6. Batch QueryFor the Feed page, fetch n Weibo at a time, then query his count, where you can batch query to optimize response time. For a bulk access to 10 micro-bo count, for counter encountered pressure is 5W Requests/second, 100W Keys/second; For a simple service with full memory, a single machine has basically been able to carry 5w+ requests. 7. Hot and cold dataContinuing to look at these 40 billion numbers, we found that access to hotspots was very concentrated, with a large number of last year's and even the Weibo no one visited. Instinctively might think of the classic cache practice of hot data in memory, cold data put on disk. But if we introduce LRU, it means that our struct item expands and consumes more memory. And for 0 data also have cache ... In this case, we designed a very simple memory and disk elimination strategy, according to the weiboid interval (in fact, time) to eliminate, according to the interval division, more than six months dump to disk, six months to continue to remain in memory, when a small number of users digging graves (access to very old Weibo and forwarding/ Comment), we go to the disk and put the results of the query into the Cold cache. To facilitate the dump of the old data to disk, we split the big table_size into smaller table, each table is a Weibo count in a different time interval, dump, in small table units. To improve the efficiency of the disk query, the dump is sorted before it is indexed in memory, and the index is built on the block rather than on the key.  A block can be 4KB or more, according to the index at most once random io to take this block out, in-memory query completed;  After this optimization, we can put nearly 200G of data into the disk, the remaining 120GB of data remain in memory. And even as the number of Weibo more and more, we still just save 120GB of data in memory on the line, the amount of data on the disk will be increased, the hotspot data will change, but the total amount of hot data is very small change! 8. Persistence of dataFor the sorted portion of the data, once the disk is brushed, it will only be read, not modified, unless it is done with cold block to do the merge, will be rewritten (currently this piece of merge logic is not implemented, because the necessity is not high). For in-memory data, we periodically dump the block to disk to form a unsorted block.  Then each memory operation will have the corresponding append log, once the machine fails, you can load from the block on the disk, and then append the operation log in append log to recover the data. Of course, from the entire architecture, once the counter crashes and other serious errors, resulting in data errors, we can also use the specific data of the storage service to re-calculate the data, restore to counter. Of course, the cost of this count is very high, you think Chen Yao so many fans, counter again very scary, we also do some two-level index and other simple optimization. 9. Consistency Assurance@lihan_harry the above article mentions that counting is a high requirement for correctness, because the count does not satisfy the idempotent. So how is this problem solved @cydu reply @lihan_harry: Yes, there is a message queue in front, through similar to the TRANSID scheme to do the removal of the weight, avoid more and less add; Of course, the main point here is to use the master-slave structure, incr accumulation, even if the final agreement is not too outrageous;  In addition, we also have to do the actual storage data to the counter of the periodic data check, with the data stored in the back of the Zheng Zheng seemingly there will be a single point of write requests, old data deleted from the hard drive, multi-machine room redundancy, machine suspended death data will not be lost, the time to delete micro-Bo also to empty related calculations @cydu reply @ Zheng Zheng: Yes, in order to incr accuracy, or use master-slave structure, so master single point problem still exists, need to rely on the master-slave switch, and afterwards data repair to improve the accuracy of the data. 10. DistributedDue to the stability of data redundancy considerations, and given the speed at which Weibo data is now growing, the number will become 150 billion, 200 billion or even higher in the foreseeable future. We in the upper or do some simple split, according to Weiboid, divided into 4 sets (mainly considering the growth of subsequent data), each set of master storage behind and 2 slave, on the one hand is divided reading pressure, on the other hand is the main disaster tolerance (when the Lord hangs up, there are vice in, Does not affect the reading, can also switch) so, I still can not stand up to the 100 billion numbers, and 100W per second query ... Had to have the cheek to ask the eldest brother to apply for more than 10 machines. Advantages: The single-machine performance is really good, the memory utilization is high, the support for the subsequent expansion is also quite good. Cons: We have less time to pick up girls, we have to write code ... But, if you do not have to write code, the yard can do? In short: For this extreme situation, we use the same extreme way to optimize, sacrificing part of the versatility. Scenario Four: Counter ServicePlan three out, the problem of Weibo count is solved, but we also have users concerned about fan count, friend Count, member Count ... The digital Society, nature is a lot of numbers, every number behind is a string of stories.


In view of this situation, we Counter this module on the basis of the service, to provide a full set of Counter service,  and Support dynamic schema modification (mainly increased), the core interface of this service is similar to the following look:    //increase count, count the name is: "Weibo"  add counter Weibo                            //adds a column to the "Weibo" counter, with a column name of weibo_id, a maximum of 64 bits, and typically 64 bits, with a default value of 0, and this column is Keyadd column Weibo weibo_id hint=64 max=64 default=0 primarykey  //to the "Weibo" This counter adds a column, the column name is Comment_num, the longest is 32 bits, generally 16 bits, The default value is 0 add column Weibo comment_num hint=16 max=32 default=0  suffix=cntcm      //to "Weibo" This counter adds a column with a column name of Repost_num, a maximum of 32 bits, typically 16 bits, a default value of 0add column Weibo repost_num hint=16 max=32 default=0  suffix=cntrn & nbsp   //adds a column to the "Weibo" counter, with a column name of Attitude_num, a maximum of 32 bits, typically 8 bits, and a default value of 0add column Weibo attitude_num hint=8 max=32 Default=0  suffix=cntan     ....  //sets the relevant count of weibo_id=1234 in the Weibo count, including Comment_num, Repost_num, Attitude_numset Weibo 1234 111 222 333 //Get weCorrelation count of weibo_id=1234 in IBO count, including Comment_num, Repost_num, Attitude_numget Weibo 1234  //get Weibo count in weibo_id= 1234 related Comment_numget Weibo 1234.cntcm //increase Weibo count weibo_id=1234 related comment_numincr Weibo 1234.cntcm     When add column is added, we will add a larger table (Table_size * sizeof (hint)) based on the hint value,,  but there is no key stored here, only value, Use the same key as the large table of the original item. For more than one part remains to walk additional storage.    after the service through the counter, the biggest advantage is that after we have to add a count, it is possible that the volume is not so large, can be created quickly ...    cons:     for key names of non-numeric classes, which may degrade to the storage of strings, we can shorten the space;    for frequent modification of old data through mechanisms such as simplified base64. The problem that causes cold buffer to swell can be mitigated by regular merge (similar to leveldb mechanism);   Scenario Five: Your planFor engineering problems, there will never be a standard answer, 1000 architects can give 10,000 design, and no one is the standard answer, only the one that suits you best!   Here is a simple way to share one of my thinking processes and the core points of attention at different stages, and you are welcome to discuss them together. Look forward to your ideas and solutions! Look forward to your resume, please private messages @cydu or @ Weibo platform architecture. Of course, the microblogging platform in addition to the counter this is a typical small case, there are more challenges need your solution!

Micro-Architecture Design: micro-blogging counter Design

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.