Design a millions message push system

Source: Internet
Author: User
Tags zookeeper

Original address: https://my.oschina.net/crossoverjie/blog/2208192 Preface

First of all, I wish you a happy Mid-Autumn festival.

More than a week has not been updated. In fact, I have been trying to suppress a big move to share some of the dry goods that interest you.

In view of my personal work recently, I used this three-day small holiday to suppress one (actually played for two days).

First of all, the theme of this topic, because I have recently done is the internet of things related development work, which will inevitably encounter and device interaction.

The main task is to have a system to support device access, push messages to the device, but also to meet the needs of a large number of device access.

So the content of this share can not only satisfy the IoT domain but also support the following scenarios:

    • Based WEB chat System (Point-to-point, group chat).
    • WEBScenarios in which the application requires service-side push.
    • The SDK-based message push platform.
Technology selection

To meet a large number of connections, while supporting dual-work communications, and performance is also guaranteed.

In the Java technology stack to choose the first thing is naturally ruled out the tradition IO .

That is only the choice of NIO, at this level is not much choice, considering the community, data maintenance and other aspects of the final choice of Netty.

The final architecture diagram is as follows:

Now it's okay to look at it, as described below.

Protocol resolution

Since it is a messaging system, it is natural to have a protocol format that is well defined with the client.

The common and simple is the HTTP protocol, but one of our needs is the interaction of the two-way, while HTTP is more of a service to the browser. What we need is a more streamlined protocol that reduces the number of unnecessary data transfers.

So I think it's best to tailor your own private protocols to meet business needs, and in my case there's a standard IoT protocol.

If it is other scenarios can draw on the current popular RPC framework to customize the private protocol, making communication between the two more efficient.

However, according to the experience of this period of time, either way, the security-related location must be reserved in the agreement.

The content of the agreement is discussed, and more specific applications are introduced.

Simple implementation

First consider how to implement the function, and then think about the million-connected situation.

Registration authentication

The first thing to consider before making a real message, the downside, is the question of authentication.

Just like you use it, the first step is to log in, no matter who can connect directly to the platform.

So the first step is to register.

As the module in the above architecture diagram 注册/鉴权 . In general, a client is required to HTTP pass a unique identity through a request, and the background authentication will then respond to one and token maintain the token relationship with the client to Redis or from the DB.

The client will also save the token locally, and each subsequent request will have to take this token. Once this token expires, the client needs to request token again.

After authentication is passed, the client passes directly TCP 长连接 to the module in the diagram push-server .

This module is really the top and the downside of processing messages.

Save channel Relationships

After connecting, it is necessary to maintain the relationship between the current client and Channel before actually processing the business.

Assuming that the client's unique identifier is a mobile phone number, it needs to maintain the phone number and current Channel into a Map.

This is similar to the previous Springboot integrated long connection heartbeat mechanism.

You also need to set the corresponding attributes in the channel in order to obtain the client's unique identity (mobile number) through the channel:

public static void putClientId(Channel channel, String clientId) { channel.attr(CLIENT_ID).set(clientId);}

When you get your mobile phone number:

public static String getClientId(Channel channel) { return (String)getAttribute(channel, CLIENT_ID);}

This allows us to log the relevant logs when our client is offline:

String telNo = NettyAttrUtil.getClientId(ctx.channel());NettySocketHolder.remove(telNo);log.info("客户端下线,TelNo=" +  telNo);

Here's one thing to note: The MAP that holds the client-Channel relationship is best for a preset size (avoid frequent expansion) because it will be the most frequently used and the most memory-consuming object.

Message upstream

Next is the real business data upload, usually the first step is to decide what type of business the upload message enters.

In the chat scene, it is possible to upload text, pictures, videos and other content.

So we have to make a distinction to do different things, which is related to the protocol negotiated by the client.

    • You can use one of the fields in the message header to differentiate.
    • Simpler is a JSON message that takes a field to distinguish between different messages.

Either way, you can only differentiate.

Message parsing and business decoupling

The message can be resolved after the business, such as can be written to the database, call other interfaces and so on.

We all know that handling messages in Netty is typically in a channelRead() method.

Here you can parse the message and differentiate the type.

But if our business logic is also written in it, the content here will be enormous.

Even we are divided into several development to deal with different business, so there will be many conflicts, difficult to maintain and other problems.

So it's very necessary to completely separate the message parsing from the business process.

This is where interface-oriented programming works.

The core code here is consistent with "building a wheel"--cicada (lightweight WEB framework). "

is to define an interface to process the business logic, and then create a concrete object by reflection after parsing the message to execute it 处理函数 .

Such a different business, different developers only need to implement this interface and implement their own business logic.

The pseudo code is as follows:

To find out more about Cicada's implementation, click here:

Https://github.com/TogetherOS/cicada

There is one more thing to note on the upstream, because it is based on a long connection, so the client needs to send a heartbeat packet periodically to maintain the connection. At the same time the service side will have a corresponding check, N time interval does not receive the message will be actively disconnected from the connection to save resources.

This can be IdleStateHandler achieved with one, and more can be viewed Netty (a) Springboot integrated long connection heartbeat mechanism.

Message downlink

There is also a downside to the upward nature. For example, in the chat scene, there are two clients connected push-server , they directly need point-to-point communication.

The process at this point is:

    • A sends the message to the server.
    • After the server receives the message, it learns that the message is to be sent to B and needs to find the Channel of B in memory.
    • The message of a is forwarded through the Channel B.

This is a downward flow.

Even administrators who need to send system notifications to all online users are similar:

Traverse the MAP that holds the channel relationship, sending messages to each other. This is also the main reason for the previous need to store in the Map.

The pseudo code is as follows:

For details, refer to:

https://github.com/crossoverJie/netty-action/

Distributed Solutions

Stand-alone version of the implementation, and now focus on how to achieve millions of connections.

Millions of connections are really just an adjective, more to express how to achieve a distributed solution, can be flexible level expansion to support more connections.

To do this before you first have to figure out how much of our stand-alone version can support the connection. The factors that affect this are much more.

    • The server itself is configured. Memory, CPU, network card, maximum number of open files supported by Linux, etc.
    • Application of its own configuration, because the Netty itself needs to rely on out-of-heap memory, but the JVM itself is required to occupy a portion of memory, such as the large storage channel relationship Map . This needs to be adjusted in combination with its own circumstances.

Combined with the above, you can test the maximum number of connections that a single node can support.

No matter how the optimization of a single machine is capped, which is also the main problem of distributed solution.

Architecture Introduction

Before you do that, you need to first talk about the overall architecture diagram that was posted above.

Start with the left side first.

The modules mentioned above 注册鉴权 are also deployed by the cluster, and are loaded by the front-facing Nginx. I've mentioned it before. The main purpose is to do the authentication and return a token to the client.

But push-server after the cluster it has one more effect. That's going to have to go back to one that is available to the current client push-server .

The right side of the 平台 general refers to the management platform, which can view the current real-time online number, to the specified client push message, etc.

Push messages require a push route () to find a push-server true push node.

The rest of the middleware such as: Redis, Zookeeper, Kafka, MySQL are all prepared for these functions, see the following implementation.

Register Discovery

First of all, the first problem is 注册发现 push-server how to select an available node for the client after it becomes more than one, the first one needs to be resolved.

This piece of content in fact has been in the distributed (a) service registration and discovery in detail.

All of push-server them need to register their information in Zookeeper when they start.

注册鉴权The module subscribes to the nodes in Zookeeper so that the latest list of services can be obtained. The structure is as follows:

Here are some pseudo-code:

App launches registration Zookeeper.

For 注册鉴权 the module, you only need to subscribe to this Zookeeper node:

Routing Policy

Now that you have access to all the list of services, how do you choose one that is just right for the push-server client to use?

This process focuses on the following points:

    • Try to ensure that each node is connected evenly.
    • Adding or deleting nodes to do Rebalance.

First, there are several algorithms to ensure the equalization:

    • Polling. Each node is assigned to the client. However, there will be a case where the new node is unevenly distributed.
    • The way to take a Hash of the mold. Similar to HashMap, but there is also a polling problem. Of course, you can do it like HashMap once Rebalance, so that all clients reconnect. However, this can cause all connections to be interrupted and re-connected, at a great cost.
    • Due to the problem of the method of Hash modulus 一致性 Hash , there is still a part of the client that needs Rebalance.
    • Weight. Can manually adjust the load situation of each node, even can be made automatic, based on monitoring when some nodes load higher automatically adjust the weight, low load can increase the weight.

One more question is:

What does a client on that node do when we restart a partial application for an upgrade?

Because we have a heartbeat mechanism, when the heartbeat is not able to think that the node has a problem. Then you have to re 注册鉴权 -request the module to get an available node. The same applies in the case of weak nets.

If the client is sending a message at this point, the message needs to be saved locally to wait for the new node to be sent again.

Stateful connection

In such a scenario, unlike HTTP, which is stateless, we have to be clear about the relationship between the client and the connection.

In the standalone version above, we saved this relationship to the local cache, but it obviously didn't work in a distributed environment.

For example, when the platform pushes a message to the client, it must first know which node the client's channel is stored on.

With our previous experience, it is natural to introduce a third-party middleware to store this relationship.

This is the storage in the schema diagram 路由关系的 Redis , which push-server requires the current client unique identity and the service node to be stored in the client Access ip+port Redis .

At the same time, when the client is offline, it has to delete the connection in Redis.

Thus, ideally, the map relationship in each node's memory should be exactly equal to the data in Redis.

The pseudo code is as follows:

There are concurrency problems when storing routing relationships, preferably in a lua script.

Push routing

Imagine a scenario in which an administrator would need to push a system message to a recently registered client.

Combining the architecture diagram

Assuming this batch of clients have 10W, first we need to send this batch 平台 of numbers through Nginx a push route.

In order to improve efficiency, you can even spread this number again into each push-route .

Get the specific number and then start the multi-threaded way to get the client in the previous route Redis based on the number of numbers push-server .

The real message is sent by means of HTTP calls push-server (Netty also supports the HTTP protocol well).

After the push is successful, the results need to be updated to the database, and the clients that are not on the line can be pushed again based on the business.

Message Flow

Perhaps some scenarios are very important to the client upstream message, need to be persisted, and the message volume is very large.

When push-sever doing business is obviously inappropriate, it is entirely possible to choose Kafka to decouple.

Leave all the upstream data directly in the Kafka and forget it.

The consumer program then extracts the data into the database.

In fact, this piece of content is also very worthy of discussion, you can first read this article: Strong as disruptor also occurred memory overflow?

Follow up on Kafka to do a detailed introduction.

Distributed issues

Distributed solves performance problems but brings other problems.

Application Monitoring

For example, how to know push-server the health status of dozens of nodes on the line?

This is where the monitoring system works, and we need to know the current memory usage of each node and the GC.

And the memory usage of the operating system itself, after all, Netty uses a lot of out-of-heap memory.

You also need to monitor the number of online nodes currently on each node, as well as the number of online in Redis. Theoretically, these two numbers should be equal.

This also can know the use of the system, you can flexibly maintain the number of these nodes.

Log processing

Logging also becomes extremely important, such as when a client is not connected at any time, you need to know where the problem is.

It is best to add a traceid log to each request so that you can see through the logs where the card is in each node.

And ELK these tools have to be used.

Summarize

This is a combination of my daily experience, some pits may not step in the work, and there will be some missing places.

For the time being to make a stable push system is actually more troublesome, which involves a lot of points, only really done before you will know.

Please feel comfortable to forward the share after reading it.

Welcome to the public exchange:

Design a millions message push system

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.