How to implement a long-connected messaging system that supports hundreds of millions of of users

How to implement a long-connected messaging system that supports hundreds of millions of of users | Golang High Concurrency case

Last Update:2015-10-24 Source: Internet

Author: User

Tags ack cas lock queue

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This is a creation in Article, where the information may have evolved or changed.

This article is based on Zhou Yang in the "high-availability Architecture group" in the sharing of content, forwarded please indicate the source.

Zhou Yang, mobile phone assistant technical manager and architect, responsible for long connection message system, mobile phone assistant architecture development and maintenance.

I do not know when our name changed to "Python High-availability architecture group", so I have to say, it is an honor to discuss Golang in the Python group for the next one hours ....

Message System Introduction

message system is more specifically a long-connected push system, currently serving more than internal products, development platform thousands of apps, also support some chat business scenarios, single-channel multi-app multiplexing, support upstream data, provide access to different granularity of upstream data and user status callback service.

At present, the whole system is divided into 9 full-featured clusters according to different business, deployed on multiple IDC (each cluster covers different IDC), real-time online hundreds of millions of magnitude. Typically, the push messages for products on PCs, phones, and even smart hardware are basically sent from our system.

Discussion on the comparison and performance index of push system

Many peers are more concerned about the performance of the go language in the implementation of the push system, the performance of the single machine, can be compared with other languages to achieve similar systems? Even if it is a start-up, the third party cloud push platform, which is recommended?

In fact, the major companies have similar push system, the market also has a similar function of cloud services. Similar systems, including those implemented by Erlang,nodejs in the early days of our company, were once asked to do similar comparisons. I feel that in the discussion of comparative data, it is difficult to ensure that the environment and the unity of demand, I can only say my experience here, the data is there, but this data is estimated to have a lot of attributes ~

First important indicator: number of connection indicators for stand-alone units

Do long-connected peers, should have experience, if in a stable connection, the number of connections in this indicator, in the absence of network throughput compared, in fact, the meaning is often small, maintenance connection consumes CPU resources is very little, each connection TCP stack will account for about 4k of memory overhead, system parameters adjusted, we stand alone test data, The highest can also reach a single-instance 300w long connection. But to do a higher test, I personally feel little meaning.

Because the actual network environment, single-instance 300w long connection, theoretically calculate the pressure is very big: in the actual weak network environment, the mobile client's disconnection rate is very high, assuming 1000 per second of one of the users disconnected reconnection. 300w long connection, the new connection to 3w per second, which is connected to the 3w user, to register, load the offline storage and other internal RPC calls, in addition to 300w long connected users heartbeat needs to maintain, assuming a heartbeat 300s once, heartbeat package needs 1w TPS per second. Unicast and multicast data forwarding, broadcast data forwarding, itself also to respond to internal RPC calls, 300w long connection, GC pressure, internal interface response delay can be guaranteed stability. These are concentrated in one instance, and usability is a challenge. Therefore, the online single instance will not hold a high long connection, the actual situation should be based on the access to the client network condition to decide.

Second important indicator: Memory usage metrics for messaging systems

At this point, with the go language, there will be some additional overhead due to the reason for the association process. But there are a few things that need to be identified to make a comparison of two push systems. For example, whether the system needs to be full-duplex (that is, whether the read or write needs to be done simultaneously) if half-duplex, theoretically to a user's connection only need to use a co-process (in this case, the user's disconnection detection may have a delay), if it is full-duplex, then read/write each one. There is a difference in memory overhead between the two scenarios.

In addition, the size of the test data often determines how much of the read/write buffer we set on the connection, whether it is global reuse, or whether it is exclusive to each connection, or dynamic application. In addition whether full duplex also determines how the buffer opens. Different strategies may behave differently in different situations of testing.

The third important indicator: the volume of messages per second

This, too, depends on the QoS level that we arrive at the message (the difference between reply ACK policy), and the schema policy, which has a more appropriate scenario for each strategy, is purely push. Or a push-pull combination? Did you even turn on the message log? What is the implementation mechanism of the log library, and how much is the buffer open? Flush policy ... These all affect the throughput of the entire system.

In addition, for HA, the internal communication cost is increased, in order to avoid some small probability events, provide a flash compensation strategy, these should be taken into account. If all is removed, then the performance of the underlying library is compared.

So I can only give approximate data, 24-core, 64G server, on QoS for message at least, purely push, message body 256b~1kb case, single instance 100w real User (200w+) coprocessor, Peak can reach 2~5w QPS ... The memory can be stabilized around 25G, GC time around 200~800ms (and optimized space).

Our normal online single-instance user control is within 80w, with up to two instances of a single machine. In fact, the whole system in the push demand, the output of the peak is not speed up, is often the speed limit, in order to prevent the push system instantaneous high throughput, converted into a DDoS attack into the server's business servers so for performance, I feel that we can rest assured that the use, at least in our magnitude, withstood the test, After the arrival of go1.5, there is a feeling that the investment has added value before.

Introduction to the Message system architecture

Here is a general introduction to the message system, some of the previous students may be seen in gopher China to share, here is a brief explanation of the architecture and the various components of the function, add some of the information missing at the time:

The schema diagram below, all service is written by Golang.

Several probably important components are described below:

The Dispatcher service sends a set of IP messages to the client based on the client request information, which should be the long-connected server of the network and the zone. The client establishes a long connection based on the returned IP and connects to the service.

service, long connection gateway, hold user connection, and register the user into the register Service, also do some access security policy, whitelist, IP restrictions and so on.

Register Service is our global session storage component that stores and indexes information about the user for access and query.

The Coordinator service is used to forward the user's upstream data, including the callback of the user status information of the subscriber's subscription, and also to coordinate the asynchronous operations of each component, such as kick user operations, which require other users to do asynchronous operations from the register.

The Saver service is a storage access layer that takes on Redis and MySQL operations, and also provides some memory caches related to the business logic, such as the loading of broadcast information that can be cached in saver. Other strategies, such as the client SDK because of malicious or accidental modification, each time the message is loaded, do not reply to the ACK, the server will not delete the message, the message will be repeatedly loaded, forming a dead loop, can be done in the saver strategy and judgment. (The client is always untrusted).

The Center service provides access to an internal API server, such as unicast or broadcast interfaces, stateful query interfaces, and other APIs, including operations and managed APIs.

Two common examples to understand the mechanism: such as sending a unicast to a user, the center first request register to obtain the user's previous registered connection channel identity, the bedroom instance address, through the service to the Long Connection center Service heavy work such as full-screen broadcast, you need to break down all the tasks into a series of sub-tasks, distributed to all center, and then in all sub-tasks, respectively, get online and offline all users, and then batch push to room Service. Usually the whole cluster is under a lot of pressure at that moment.

deployd/agent service is used for deploying and managing processes, collecting status and information for each component, zookeeper and keeper for profile management and simple scheduling of the entire system

About the server-side architecture of push

Common push model has long rotation pull, Server Direct push ( message system is mainly this), push-pull combination (push only send notifications, push after notification to pull messages).

Pull the way not to say, now is not commonly used, the early many are nginx+lua+redis, long rotation, the main problem is the cost is relatively large, timeliness is not good, can do the optimization strategy is not many.

Direct Push system, is currently message system this, the message type is consumption type, and for the same user is not allowed to repeat consumption, if the need for multi-terminal re-consumption, need to be abstracted into different users.

The advantage of the push is good real-time, low overhead, send the message directly to the client, do not need the client to walk from the access layer to the storage layer active pull.

But the pure push model, there is a big problem, because the system is asynchronous, his timing can not be accurately guaranteed. This is sufficient for push requirements, but it may not be appropriate to reuse a push system for IM-type communication.

For the strict requirements of time-series, the message can be repeated consumption of the system, is now a push-pull combination of models, is only using our push system to send notifications, and with ID and so on to the client to do the decision-making policy, the client based on the push key, actively pull messages from the Business Server. And when the master-slave synchronization delay, follow-up push key to do a deferred pull strategy. At the same time, the message itself can be the QoS of the pure push strategy, such as some "typing" low-priority messages, do not need to pull the initiative, through the direct consumption of push.

What factors determine the effect of the push system?

The first is the degree of improvement of the SDK, the SDK strategy and detail, often determine the weak network environment of the final push quality.

SDK routing strategy, some of the most basic strategies are as follows: Some open Source services may be targeted at the user hash a fixed IP of the access area, in fact, in the domestic environment is not possible, the best allocator (dispatcher) is the return hash of a group, and the port must be joined, when necessary, The client informs the retry that multiple groups are not connected and returns different IDC servers. Because we will often detect some cases, the same region of different users, may have different IP connectivity within the same IDC, there are different ports of the same IP connectivity is different, so the user's routing strategy must be flexible, The strategy should be perfect. In addition, in the process of routing, the client to the different network conditions of long-connected IP cache, when the network environment switch (WiFi, 2G, 3G), re-request the allocator, cache different network environment long connection IP.

Client-side for data heartbeat and read/write timeout settings, perfect disconnection detection and re-connect mechanism

For different network environments, or the client itself, the activity of the message, the heartbeat to adapt to adjust and negotiate with the server, to ensure the connectivity of the link. And in a weak network environment, in addition to network switching (WiFi cut 3G) or read and write errors, when to re-establish the link is also a problem. The client sends out a PING packet, under different networks, how long it has not been responded to, think that the network has a problem, re-establish the link need to have a trade-off. In addition, for different network environment, read different message length, also have different tolerance time, can not fits. A good heartbeat and read-write timeout setting allows the client to detect network problems as quickly as possible, reestablish the link, and perform large data transfers in the event of network jitter.

Combining service-side strategy

In addition, the system may be combined with the server to do some special strategies, such as when we are on the road, we will try to map the same user to the same service instance. When the disconnection occurs, the client tries to retry the address that was last successfully connected. The main is to facilitate the service side to do the flash-off situation of the policy, will be staged in the user flash when the information on the instance, re-connect, do a single instance of the migration, reduce delay and load overhead.

Client KeepAlive Policy

Many start-up companies are willing to re-establish a push system, it is not difficult to achieve, in fact, in the case of complete protocol (the simplest is the client does not return ACK unclear data), the service will ensure that the message is not lost. But the question is why is the arrival rate not going to be within the validity of the message? Often because their app's push service is not capable of surviving. Choose the cloud Platform or the factory, often the SDK will do some keepalive strategy, such as the coexistence with other apps, wake up each other, which is also the cloud Platform push service more security reasons. I believe that many of the cloud Platform's SDKs, multiple apps using the same SDK, can wake up and be active in order to survive the service. In addition, the push SDK itself is a single-connection, multi-app multiplexing, which adds new challenges to the SDK implementation.

In summary, for me, choose the push platform, priority will consider the client SDK to improve the degree. For the server, the choice of conditions is slightly simple, the need to deploy access points (IDC) more, with fine routing strategy, the more assured, as to what cloud services to know the number of points, the group from all over the small partners, can be a partnership to test.

Go language development problems and solutions

Below, go development process encountered challenges and optimization strategy, to see a picture of the year, in the first version of the optimization plan the day before the launch ~

As you can see, the highest memory consumption 69G,GC time single instance up to 3~6s. In this case, imagine a tragic request, after a few of the components that are executing the GC, the consequences must be timed out ... the GC-illuminated access retries add to the burden on the system. In this case, the worst-case scenario of the entire system was restarted every 2 or 3 days.

At that time, there were problems, and now summed up, probably the following points

1. I/o,buffer and objects scattered in the thread are no longer used.

At that time (12) because the GC efficiency of the GO is limited, more unrestrained, the program of a large number of short live, internal communication of many IO operations, because do not want to block the main loop logic or need to respond to the logic, through a separate go process to achieve asynchronous. This will bring a lot of burden to the GC.

In response to this problem, we should try to control the creation of the process, for the long-connected application, itself already has millions of concurrent process cases, many situations do not need to do asynchronous IO within the various concurrent processes, because the degree of parallelism is limited, in theory, do blocking operations in the co-operation is not a problem.

If some need to be executed asynchronously, such as if it is not executed asynchronously, the impact on the user's heartbeat or waiting for response is not responding, it is best to consume, process the result, and pass the channel back to the caller through a task pool, and a set of resident processes. There are additional benefits of using task pooling to package requests, improve throughput, and add a volume policy.

2. The network environment is not enough to cause a surge

The go process compared with the previous high concurrent programs, if not well-controlled, will cause the number of processes to proliferate. In the early days, it was also found that some host memory would be much larger than the other servers, but when found, all the main profiling parameters were normal.

Later found that more communications systems, network jitter blocking is not exempt (even the intranet), the external acceptance of new requests, but the implementation process, due to internal communications congestion, a large number of co-process is created, the business process waiting for communication results are not released, often instantaneous will usher in the association of the explosion. However, after the stability of these systems, both the virt and res are not completely released, and after falling, they maintain a high position.

To deal with this situation, you need to add some flow control policy, the flow control policy can choose to do in the RPC library, or the above-mentioned task pool to do, in fact, I feel in the task pool do more reasonable, after all, RPC Communication library can do read and write data limit, but it does not know the specific current limit policy, Whether to retry or log or cache to the specified queue. The task pool itself is business logic-related, and it clearly addresses the flow control throttling policies required for different interfaces.

3. Inefficient and costly RPC framework

Early RPC communication framework is relatively simple, internal communication is also a short connection when used. This would have short connection overhead and performance bottlenecks beyond our expectations, short-connect IO efficiency is lower, but the port resource is sufficient, itself the throughput can meet the needs, with is no problem, many layered systems, there are also HTTP short connections internally requested

But early go version, so write program, in a certain magnitude situation, is unable to support. Short connections A large number of temporary objects and temporary buffer creation, in this already million-way program, is unbearable. So we made two adjustments to our RPC framework in the future.

The second version of the RPC framework, which uses a connection pool, communicates internally through long connections (reusable resources including client and server: Codec buffer, request/response), which greatly improves performance.

But this in a request and response still occupy the connection, if the network condition OK, this is not the problem, enough to meet the need, but imagine a space instance to be with the back of the hundreds of register,coordinator,saver,center, Keeper instances to communicate, you need to establish a large number of resident connections, each target machine dozens of connections, there are thousands of connections are occupied.

When the non-constant jitter (continuous tease how many no solution), or there is a high latency request, if the target IP connection is less, there will be a large number of requests blocking instantaneous, the connection can not be fully utilized. The third version increases the pipeline operation, and the pipeline brings some additional overhead, taking advantage of TCP's full dual feature to complete RPC calls to each service cluster with minimal connectivity.

4.Gc time is too long

The go GC is still improving, and a large number of objects and buffer creation will still put a heavy burden on the GC, especially a program that takes up around 25G. Before go team's big email also told us that the future will make the use of the process cost is lower, theoretically do not need to do more in the application layer strategy to alleviate the GC.

The way to improve, one is multi-instance splitting, if the company does not have port restrictions, can quickly deploy a large number of instances, reduce the GC duration, the most direct method. For 来, however, the extranet usually uses only 80 and 433. So you can only open two instances on a regular. Of course, many people give me advice on whether to use So_reuseport, but our kernel version is really low, and has not been practiced.

In addition, can imitate nginx,fork multiple processes monitoring the same port, at least we do not currently do, mainly for our current process management, or independent operation, external monitoring of different port programs, as well as supporting internal communications and management ports, instance management and upgrade to make adjustments.

The other two means of resolving GC are memory pool and object pool, but it is best to do careful evaluation and testing, memory pool, object pool use, also need to weigh the readability of code and overall efficiency.

This kind of program will reduce the degree of parallelism, because the resources in the pool must be mutually exclusive or atomic operations to do CAs, usually atomic operations measured faster. CAs can be understood as actionable finer-grained locks (you can do more CAS policies, quit running, prevent busy, etc.). The problem with this approach is that the readability of the program will be more and more like C, each time to malloc, each place after the use of free, for the object pool free before reset, I once in the application layer tried to do a hierarchical "no lock queue"

The array on the left is actually a list, which blocks the memory by size and then uses the atomic operation for CAs. But actually to see the test data, the pool technology can significantly reduce the temporary object and memory application and release, GC time will be reduced, but the locking brings the degree of parallelism, whether can give a period of time the overall throughput to improve, to do testing and trade-offs ...

In our message system, in fact, in addition to some of the black technology, think of the million in the process to do spin operation application to reuse buffer and object, the cost will be very large, especially in the multithreading-to-line multi-model case, more dependent on the Golang itself scheduling strategy, unless I add more policy processing pool, Less busy and so on, the feeling is in the runtime to do things, in the application layer is very not elegant implementation. The general use of cost theory is greater than the benefits.

However, for RPC libraries or codec libraries, within the task pool, these open quantitative processes, centralized processing of data areas, you can try to transform ~

For some fixed object reuse, such as fixed heartbeat package or something, you can consider using some global objects, reuse, for application layer data, the specific design object pool, in part to reuse, may be more than this non-discriminatory design of a common pool can be more effective evaluation.

Operation and testing of message system

The following is an introduction to the architecture of the message system iteration and some iterations experience, because previously shared in other places, the following will give the relevant links, below actually do a brief introduction, interested can go to the link inside to see

Architecture iteration-based on business and cluster splitting, can solve partial grayscale deployment on-line testing, reduce the interaction between point-to-point communication and broadcast communication products, and do independent optimization for specific functions.

Message system architecture and cluster splitting, the most basic is to split multiple instances, followed by the business type of resource occupancy classification, according to the user access to the network and the IDC Point requirements Classification (currently no conditions, all products are deployed to all IDC)

The system's test go language has a unique advantage in concurrent testing.

For the stress test, it is mainly for the designated server, select the idle server on the line to do long connection pressure measurement. Then, combining visualization, the system state of the pressure measurement process is analyzed. But the measurement of the early use of more, but the realization of the statistical report function and my ideal there is a certain gap. I think recently out of the Golang open source products are in line with this scenario, go write network concurrency program to bring everyone the convenience, let everyone to reduce complexity, disassembly or layered collaboration components, and combined together.

Q&a

Q1: protocol stack size, time-out customization principle?

The time-out period under the mobile network is usually 5 minutes by product demand and 5-8 minutes for wifi in the case of 2g,3g. However, for individual scenarios, the request responds very quickly to the scene, if the connection idle more than 1 minutes, there will be ping,pong, to verify whether the disconnection detection, as soon as possible to reconnect.

Q2: Is the message persisted?

Message persistence, usually is the first to save the post, the storage uses the Redis, but the landing uses the MySQL. MySQL only fails to recover the use.

Q3: How did the message storm work out?

If it is sent, the general product is not required speed limit, for the larger product is the sending queue to do control speed, according to the number of people, in seconds to control the speed issue, send the next one successfully sent.

What about Q4:golang's tool chain support? I have written a few small programs in the thousands of lines, indeed very good, but do not know the amount of code to go up, the matching debug tools and profiling Tools, I look at the above share said Golang own profiling tool is also good, that debug how about it? The official has not been out of debug tools, GDB support is not perfect, I do not know what you use?

So, we are normal is println, I feel basically can locate all my problems, but also do not rule out because of parallelism through println can not reproduce the problem, at present can only rely on experience. As long as common concurrent attempts are analyzed, they can be found. Go will soon launch debugging Tools ~

Q5: Is the protocol stack based on TCP?

Is there a protocol extension feature? The protocol stack is TCP, the whole system TCP long connection, does not consider extending its function ~ If have the good experience, can share ~

Q6: Ask a question, this system is to receive uplink data, the system receives upstream data is forwarded to the corresponding system to do the processing, how to forward it, if you need to return the call to the client and how to deal with it?

The uplink data of the system is forwarded according to the protocol header, the product and forwarding type are marked in the protocol header, the product and forwarding type are followed in the coordinator, the callback user, if the user needs to block wait for reply to follow up, then send the message again, route back to the user. Because the entire system is fully asynchronous.

Q7: Ask a pushsdk question. PUSHSDK single-connection, multi-app multiplexing, in such cases the following issues are how to solve: 1) System traffic statistics will all the traffic to the application to start the connection? And the connection to start the app is not fixed, right? 2) The version number of the same PUSHSDK in different applications may not be the same, so the exposed interface may have version problems, if the single connection mode how to solve?

Traffic can only be counted on the launch of the app, but generally this high installation rate of the app is very likely to bear, the common app itself is less likely to detect and kill, in addition, the amount of the message is strictly controlled. Overall, the user is also saving power and saving traffic. We pushsdk as far as possible, for this purpose, the Push SDK itself does very limited work, abstract out some common features, pure push of the system, the client strategy is currently doing very little, but also for this reason.

Q8: Is the profiling of the production system always open?

Not always open, each cluster has sampling, but it needs to be turned on which can be controlled in the background. This profling is called through an interface.

Q9: Can the message consumers in front of the system be grouped? Similar to Kafka.

Clients can subscribe to messages from different products and accept different groupings. Bind or Unbind operation when accessing

Q10: Why give up Erlang and choose go for any particular reason? The Erlang we're using now?

Erlang has no problem because after we're online, other teams have made it, with QA a department-by-sector comparison test that chooses to continue using the go version of push as a company base service without significant performance gains.

Q11: Is there an idle problem in the flow control problem that has been caused by the configuration of the NIC?

Flow control is the business level of the flow control, we go online before the limit of the internal network of traffic has been tested, and subsequent requests in the RPC library, the control is less than the upper limit of intra-communication overhead. Flow control before reaching the upper limit.

Q12: Coordinated scheduling of services Why did ZK have considered raft implementation? Golang raft realize a lot of ah, such as consul and ECTD.

3 years ago, neither the latter nor the latter two had ever heard of it. ZK at that time the company's internal mature program, but at present, we are not ready to use ZK as a combined system of custom development, prepared to use their own keeper instead of ZK, complete configuration file automatic data structure, data structure automatically synchronize the specified process, while the inside can complete a lot of custom discovery and control strategy, The client contains keeper's SDK to implement all of the above monitoring data, profling data collection, profile updates, startup shutdown, and other callbacks. Completely abstract idiom keeper communication between sdk,keeper is considered with raft.

Q13: is the load policy simultaneously done simultaneously on the service side and client side (DISPATCHER returns a set of IPs)? In addition, the consistency of the Server/register SERVER connection state | How is availability guaranteed? Are there any areas of special concern for service side keepalive? Is the security aspect based on TLS plus application layer encryption?

will be done on the server side, such as restarting the operation, will be issued a command type message, the client to take the initiative behavior. Some of the messages use encryption policies, custom rsa+des, and the need to meet our security companies, as well as custom development of many secure encryption strategies. Consistency is solved by cold standby, the early consideration of double write, but the real-time state double write synchronization cost is too high and easy to have dirty data, such as register hung, call all room, by re-brush into the designated register to solve.

Q14: Does this keeper have open source plans?

Still writing, if not coupled with our system too many features, will certainly open source, mainly this means that all of our bind in the SDK Library also need open source ~

Q15: More curious lisence is which if open source?

Freebsd

This article is planned Guo June, the content is edited by Liu Wei, four, Tim Proofreading and release, and many other volunteers to this article also contributed. For more information on architecture, readers can view multi-architecture content and gain valuable experience on the path to architects by searching for "archnotes" or by long clicking on the image below, focusing on the "High Availability Architecture" public number. Please specify from the "High Availability Architecture (archnotes)" public number.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More