1. Problem description last weekend a friend who did a live video show me a question about the concurrency of a large number of memory objects, and a description of the problem is that, in the event of live video, you need to provide a free chat function (equivalent to QQ Group) to the person watching the live broadcast. This involves the ability to implement a list of managed users on the server side, which can be a large list (up to 3 million people to watch and chat). Their approach is to split the backend service into two tiers,
Figure-1gate is used for client connection and message distribution services, and chat is used for user recognition, management, and message forwarding. Then you need to maintain a list of users in chat. The problem with them is that when the list of users is large , chat's processing power drops sharply . I asked him in detail about the data structure and concurrency control of the user list maintenance, which is the initial position of the problem.
2. Analysis of the problem we first analyze their implementation, they are using C + + and STL, familiar with C++/stl friends will soon think of using std::map to achieve management, right, this is their idea, here is the simple description of their implementation:
Class User{public:uint64_tuser_id;/*todo: User information Basic information */pthread_mutex_t mutex;/* is used to secure the user's multithreaded concurrency */}std::map<uint64_t , user*> User_map;pthread_mutex_tuser_map_mutex; /* Protect user_map*/during multithreaded operations
User lists for map management need to provide add, delete, change, check, and traverse. For example, to work with a user:
LOCK (User_map_mutex); std::map<uint64_t, user*>::iterator it = user_map.find (ID); if (it = User_map.end ()) { UNLOCK (User_map_mutex); LOCK (It->second->mutex); operator (It->second); /* May be a long time, may be sending network messages, information write disk, RPC calls and other */unlock ((It->second->mutex);} Elseunlock (User_map_mutex);
Other operations are similar. There are several serious concurrency problems with this implementation:
1. Each operation requires a lock on the User_map
2. Each user action requires a user lock
3. the function for each user operation may take a long time, such as a socket bundle, RPC call, and so on.
3.user concurrent Competitive optimization because chat is a single point multi-threaded concurrency system, there are a lot of thread lock competition problems in the case of many network events. The most obvious is the 23rd question, in fact these two are a problem. To solve this, we just have to remove the mutex from user. How to go? My idea is to use the user object reference count to implement. For example:
Class user{public:stringuser_id;/* Other user information */int ref;/* a reference count of 0 o'clock, free to drop the object */} void Add_ref (user* u) {u->ref + +;} void Release_ref (user* u) {u->ref--;if (u->ref <= 0) Delete u;}
Operation rules for reference counting:
When user information is added to the user list, add_ref
When the user deletes from the user list, release_ref
When the user information is referenced, the Add_ref
When the user information is referenced, release_ref
Then the action on a user becomes:
LOCK (User_map_mutex); std::map<uint64_t, user*>::iterator it = user_map.find (ID); if (it = User_map.end ()) {user* U = it->second;add_ref (u); UNLOCK (User_map_mutex); operator (It->second); /* There may be longer */release_ref (U);} Elseunlock (User_map_mutex);
The user object reference count is a good solution to the user lock problem, but the reference count introduces a new problem when multiple threads simultaneously modify a user's information.issues that can cause data to be unprotected。 The problem we are dealing with is very simple. Whether it's adding operations, modifying operations, or deleting operations, you must first delete the corresponding user information that already exists in the User_map from the map, and then add the information. For example, modify the operation:
LOCK (User_map_mutex); std::map<uint64_t, user*>::iterator it = user_map.find (ID); if (It! = User_map.end ()) {/* Copy the old information */user* u =it->second;user_map.erase (it); copy (Update_u, u); release_ref (u) ; /* dereference */update (Update_u); /* Modify user Data */add_ref (Update_u); User_map.insert (Update_u); UNLOCK (User_map_mutex);} Elseunlock (User_map_mutex);
Add and delete implementations are similar. Object reference counting is a good solution to the problem of user data lock competition, but the number of users in the User_map is less than 10,000, using reference counting can avoid the concurrency problem of adding and removing changes and checking operations. It cannot solve the problem of full map scanning concurrency, nor can it solve the concurrency problems that need to manipulate user information when the User_map is large. The problem is that both the full map scan and the individual user need to lock the User_map, which is the first question. Under high concurrent requests, this USER_MAP lock generates a lot of competition, resulting in loss of resources.
4. Give up the std::map to get rid of this lock, which goes back to the
Problem AnalysisOn the first issue, it is well known that STD::MAP does not support multithreading concurrency, and the std::map operation is not friendly to the CPU cache. To remove this global lock instead of a smaller-grained lock, you need to discard the std::map. In the case of large amounts of data, it is common to use hash table or btree to organize the data (the people who have studied the database storage engine know, hehe!) )。 For simplicity, here's an example of hash table for analysis.
Figure-2
Figure 2 is a hash table of the structure, where the hash buckets is the number of groups. There is a pointer to the user structure within the array. Well, we understand the structure of hash table and we go back to the problem of narrowing the lock granularity. For example, we define a hash table whose number of buckets is 1024, and we define a pthread_mutex_t array with a length of 256. Reducing the granularity of a lock is simple.
The first Mutex array unit is responsible for the mutual exclusion of 0 256 512 768 ordinal buckets, and the second is responsible for the concurrent mutex of 1 257 512 769 sequence numbers, and so on. The calculation of a bucket number by which mutex is responsible for mutual exclusion is actually:
the mutex subscript = bucket_seq% Mutex_array_size;
This implementation is easy to understand, in the internal user object operation we still adopt the method of reference counting. Subdivision of the lock granularity, can make the entire user list has very good concurrency, and because buckets is a continuous array, the CPU L1/l2 Cahce is also very friendly, but also greatly improve the CPU cache hit rate problem. General optimization to this, basically can be said to do 90% of the work. But there are still a few questions:
? Why use pthread_mutex_t? Does it cause unnecessary OS context switching under high concurrency?
? What data structures can support fine-grained locks in addition to hash table?
For the first question above, we can use CPU atomic operations to implement a simple mutex. Examples are as follows:
void LOCK (int* q) {int i;while (__sync_lock_test_and_set (q, 1)) {for (i = 0; i <; i + +) cpu_pause (); Sched_yield ();/* Release C Pu execution, let the operating system re-dispatch this thread */}}; #define UNLOCK (Q) __sync_lock_release ((q))
Then you can remove the pthread_mutex_t array and replace it with an int array.
Why can this be achieved? In the lock function does idling not consume CPU? This can be combined with our hash table to analyze, a hash table of the additions and deletions to the operation, generally hundreds of CPU instruction cycle can be completed ( do not calculate the hash function run time, because the calculation of hash (key) without waiting for the lock ) , that is, the lock waits for a long time, and the CPU instruction executes much faster than the CPU is loading the data from memory, so CPU spin wait for the operating system because the context switching loss caused by Pthread lock is worthwhile. This can be self-testing, hehe.
For this hashtable structure concurrency, I did a preliminary test, test machine configuration: 4 core 2.4GCPU, Memory 16G, the program started 8 threads to test, Hashtable has 8 million user information, each second can support about 1 million queries, 500,000 about the increase of the deletion.
5. Thinking
Back to the original problem, in fact, is to manage a large amount of memory objects in memory, this is not a new technology, in the database storage engine, you can see such a solution everywhere, such as: Memcache index Implementation, InnoDB Adaptive Hash Index implementation and btree implementation, The memtable implementation of the LSM tree does not solve such problems. Through the analysis of this problem, we can get the following understanding:
1. C + + practitioners should use Stl/boost with caution in high concurrency design, and many of their data structures are not friendly to multicore concurrency, which is only for C + +.
2. Many seemingly very difficult problems, in fact, many other areas of the system has a good solution. As a C + + practitioner, you should learn more about the database kernel, the operating system kernel, or the programming language kernel (jvm/goruntime). These three places have the technical treasures that cannot be dug out.
3. C + + language in multi-core concurrency control can be said to be very primitive, as a C + + practitioners of the US, should be more to understand the CPU working mechanism, C + + memory model, so as to help us to analyze system bottlenecks and optimize the system.
4. Giving up means reaping more, abandoning C + +, and writing such systems in a language that is easier to write concurrent programs such as Go, Scala, Erlang.
Legacy of study Questions :
1. Using CPU CAS + memory barrier How to implement the hash table lock-free concurrency? Can try to realize a look.
2. In addition to the hash table to solve the problem of a huge list of users, but also with the Skip list, btree and other data structures to achieve, how to achieve? Skip list, btree, and hash table compare the pros and cons where?
3. Does Hashtable have any drawbacks when managing a huge list of users? What are the drawbacks?
Analysis of user list concurrency problem in chat system