Nice server architecture reconstruction and evolution since php has been made, and the source code involved in this article is also related to php.
So I sent it to the php Forum. if the zuning moderator thinks it is inappropriate, please help transfer it...
To avoid advertising suspicion, the company introduction section has been removed.
In addition, I shared some of the questions and answers at the time and did not turn them over.
For more information about the original document, click the following link to the public document.
Go to the public account to read the original article
At this stage, the nice server mainly faces the following challenges:
System Design, change-oriented, must be able to well support the diversity of requirements in the "product exploration stage.
Stability, to avoid damage to existing users due to stability issues, and to cope with sudden business growth.
Collaboration: Although the server side serves as a bridge between various teams, such as clients, policy recommendations, big data, QA, operations, and products, how to bridge the network between different parties through technical or non-technical means.
"Re-launch": The Road to nice reconstruction
As soon as I joined nice, I received a very challenging task to reconstruct the overall service and framework of the server. Like many entrepreneurial teams, we have accumulated a series of technical debts during our growth.
The old system is written using the CI framework. there is no module division, no explicit hierarchy, and various services are processed directly at the entrance, which is hardly reused.
API version management directly copies directories. with the development of the business, it is difficult to maintain more than 10 versions of interface code at the same time.
The code is filled with compatibility logic such as if ($ isAndroid & $ appVersion> = 3.
So that the client/server can call together. I believe many startups have experienced these problems.
Old System architecture
The old system architecture is the most typical integrated application architecture: backend, HTML5, and interfaces are all integrated.
In the face of the current situation, first analyze the problems to be solved, and think that the following three problems are critical.
Structure problems. The code structure is messy and cannot be reused.
Client difference management. Interface version copy results in a huge amount of repeated code, mainly for the special requirements of specific clients/grayscale/small traffic and other issues.
Client/Server RD collaboration issues.
Layered and modular
First, we can use a simple two-layer architecture to solve the first problem.
The application layer and service layer are divided at the encoding level:
The application layer solves entry-level problems, such as interaction protocols, authentication, Antispam, and other general service access and personalized requirements of each end.
The service layer solves the problem of business logic and vertically divides the service layer into modules according to the business.
By dividing layers and modules, code management becomes clearer and logic reusability is greatly improved. At the same time, business division also provides basic support for business isolation and hierarchical management.
Two basic components
The preceding client differential management and client-server collaboration are implemented through two basic components of the framework:
ClientAdapter: client adapter, used to handle all the logic problems caused by client differences.
CKCR: CheckAndCorrect, used to control the input/output protocol and solve the technical issues of client/server RD collaboration.
ClientAdapter component
First, let's look at the application scenario of a ClientAdapter.
The preceding configuration is used as an example to implement the following common similar rules.
All clients are 3.1.0 and later versions. The "greeting" function is supported.
All clients are 3.1.0 and later, while the gray-level channel abc36032 is 3.1.0. The "emoticon" function is supported.
As you can see from the above nice code, in this way, nice is logically dealing with a variety of "Feature", rather than the specific client environment. The preceding example is an application scenario of ClientAdapter.
Overall structure of ClientAdapter
The most basic part of the ClientAdapter is to abstract the concept of "client runtime environment" and use it to describe the information of the client that initiates each request, for example, the system, App version, IP address, network standard, network operator, and geographic location. In addition, it provides a simple description rule to describe a restricted client environment.
At the application level, it only exposes the checkEnv () interface to check whether the current client meets the specified description rules. On top of this infrastructure, there are many applications on the nice upper layer.
For example, the difference between the original client will lead nice to face complicated client adaptation. through NiceFeature, RD is actually facing product Iterative Feature under the Feature mechanism. For example, in NiceUrl, nice implements unified CDN scheduling. through ClientAdapter, we can flexibly control various regions to adopt different scheduling policies.
In addition, in terms of user traffic distribution, the flexibility provided by this mechanism can also meet the needs of nice to extract experiment users in various dimensions.
ClientAdapter open source address
The implementation of the ClientAdapter component is very simple, with only 200 rows. The code has been extracted and stored on github for your reference at http://t.cn/rgnqnpj.
In addition, the design of ClientAdapter references common methods in C language. In the autoconf stage, various environment information of the system is defined as various HAVE_XXX. In this way, the complexity of the environment and the actual business code are decoupled.
CKCR introduction
In Question 3 above, the client/server RD collaboration problem is divided into two parts.
Technical Level: protocol layer constraints.
Process Layer: how to work together.
At the technical level, the problem to be solved is how to ensure that the interface protocol is executed. From the input point of view, we do not trust the client. as long as the protocol is set up, we have to follow the rules to prevent the client from being controlled by "bad guys" and causing damage to the service. From the perspective of output, ensure that the data returned by the business layer is transmitted to the client according to the protocol, to avoid unexpected results from the client, such as Crash caused by common types of mismatch.
In a word, it is to correct the checksum and output data of the input data. Therefore, nice introduces a component named CKCR, whose full name is ChecK & CorRect.
CKCR implementation
CKCR implements a small set of descriptive syntax rules. This set of syntax is used to describe the checksum and correction of data. The checksum and correction behaviors can be freely expanded. The following is an example of its application.
In this example, $ data indicates the data to be processed, and $ ckcrDesc is the description string of this syntax rule. It describes the following rules.
The overall data is a KV array (Mapping), which only retains the data of the user and shows subkeys.
User is also a KV array, whose id is int type and whose name is str type.
Shows is an array. each element in the array is a KV array. it filters out its id sub-keys, its url sub-keys, and applies the custom processing of imgCdnUrl (implements cdn scheduling ).
It has similarities and differences with protobuf and Thrift scheme. The main difference is that CKCR provides a general method of scalable data verification/correction. Because of this scalability, it can make more articles on the common data layer. For example, the imgCdnUrl in the above example is the unified Mount point for nice CDN scheduling.
CKCR internal structure and syntax rules
The basic functions of CKCR are described above. Later, we found that in various scenarios of the system, most of the data structures output by the system's core data are the same. Therefore, the reusability of CKCR descriptive strings becomes a problem.
To solve this problem, nice introduced the preprocessing mechanism before CKCR compilation. You can use special syntax to reference a fixed data structure description. This mechanism has brought about an additional benefit, that is, to precipitate the system's core data structure.
The client/server collaboration problem is solved by nice at the technical level through this component. So how can we solve the human collaboration problem?
First, the description rules of CKCR are concise. Therefore, it can be output directly as an interface document.
Secondly, based on the interface documentation, there is a clear solution for collaboration between nice server and client RD.
Agreement: Both parties communicate with each other, agree on the interface agreement and provide documents, and each party enters the design phase.
False data: the RD on the server side quickly provides a pseudo interface for Mock data for self-testing of the RD basic functions of the client.
Real Interface: The RD of the server provides real interfaces for joint debugging.
This step-by-step development method basically solves the problem of "joint call and shouting. The work of both parties is basically decoupled, and the development progress of both parties is basically not affected.
CKCR open source address
The source code of the CKCR component has been extracted and put on github at: http://t.cn/rgn57xk.
Stage 1 summary
The above are the solutions for nice to deal with the three problems in the first stage:
Hierarchy and modularization: solves structural problems through two-layer architecture.
Client adapter: solves the problem of client difference.
CKCR: solve the problem of client/server RD collaboration through CKCR and collaboration process.
At this stage, we mainly solve the development problem through the overall reconstruction and pave the way for future architecture adjustment.
At that time, the practice was to "reinvent again". now, it is correct to review this option. Before and after that time, we did not reconstruct the problems left over by the system. after the system became larger, it became more difficult.
However, after all, the reconstruction path of "re-launch" is full of various risks. before making such a decision, we must make full resource and risk assessment.
Traps for stability
After the complete reconstruction, nice entered the rapid development of the business. In, another month of SpeciaForce was launched. almost everyone in the R & D team lives near the company and the work volume was more than hours. During that time, the Sprint increased key data such as daily activity for our products, and the interface PV reached the peak of 0.5 billion/day.
Until August 2015, the stability of the service had stood a great test. To be honest, I was about to crash. The most important problems were as follows.
1. MySQL cannot survive
Nice's MySQL Cluster was initially a single-instance, single-database, master-slave, and hard drive. In April 2015, just like many teams with fast business growth, OP switched all the databases to SSD to quickly solve the problem.
In addition, the service cannot be isolated due to the consideration of a single database, and the writing of the master database also becomes a bottleneck. Around March 2015, nice began to consider database/table sharding. The Database Sharding scheme is mainly divided by vertical businesses.
Technical debt is really not enough !!!
It took about half a year to split the database. Next, it took about a quarter to split the system's core big tables.
In MySQL, the lessons are as follows.
Solve the problem with hardware, which is very cost-effective.
Division of business lines of databases, evaluation of table scale, and so on. It may take several more days to get started in advance. the drag-and-drop operation may be like we need to spend more than one year to get your ass wiped.
On the other hand, nice is also very dependent on Redis. Some data is a typical Cache usage. When no online service accesses the Cache, it will automatically fallback to the database to flush the Cache. Another part of the data is quasi-persistent data. This part of our online business will not be fallback to DB.
2. Redis cannot help
Nice Redis is also a single cluster. Around 2015 and May, with the rapid iteration of the business, many new features were launched, and the pressure on Redis increased rapidly, and Redis faults began to occur occasionally. At that time, nice began to split Redis services. Because the business model is relatively simple, Redis is faster to split.
But at the same time, because data is lost in the fault, we decided to do some independent development in Redis high availability. It mainly targets smooth expansion and automatic failover.
The experience in this area is not very good, leading to a lot of problems. The most serious one is that no problems were found during the online trial run, but after the full scale, multiple cluster nodes encountered problems one after another. After two days, I had to switch back. However, at that time, we were faced with the problem of machine resources. we could not switch back all at once, but we could only switch back to the cluster by Cluster. at the same time, we caught up with several RD lines to write almost all recovery scripts for quasi-persistent data.
Redis's lessons are very painful. from the pitfall experience in 2015, we have learned the following.
Experience in stress and capacity evaluation
Stress-related problems should be solved through isolation/splitting.
Basic service monitoring is required. the cost of monitoring the CPU, memory, disk, bandwidth, and other basic resources is not high, but it often helps us identify problems in advance.
Capacity evaluation of services requires careful consideration. For online businesses, the Cache service will be Fallback to the DB, and the risk of penetration after failure must be taken into account.
Data-related experience
Quasi-persistent data. if Redis does not have a disaster recovery solution, it is best to prepare a full data recovery record. If something goes wrong, write the code and you will be paralyzed.
To change online data, you must prepare a rollback solution. Otherwise, a problem may be a disaster.
Finally, let's talk about the idea of independent R & D. If the technical scale cannot be reached, we suggest using existing mature solutions. Even if the hard condition is up to standard, you have to be cautious at every step of the process.
3. front-end servers cannot handle
The last problem is the pressure on the front-end server.
First of all, when a startup company is small, there are usually few access layer issues, but we recommend that you do a proper job of preventing or preparing a plan for "bad guys.
Let's look at the front-end machine cluster. The problems we have encountered can be classified into two categories.
If the client polling fails, the user may cause a large number of requests to the server.
A backend service failure causes the front-end server to be slow or unable to connect, so that the process pool of the front-end server is easily full.
Because we have already divided service resources and business modules, it is relatively easy to handle this issue. The general idea is as follows.
Frontend machines are classified based on business. The core business usually has fewer changes, and other business changes cause faults to avoid affecting the core business.
Downgrade: We have provided a downgrade plan to immediately launch the downgrade plan. There are two main policies.
A. request end: respond to a user group or a business problem;
B. backend service: a dependent backend service is faulty.
Backend service disaster tolerance: At present, nice has added an LVS layer for backend service failover (for internal MySQL/Redis services). In addition, on the front-end server application side, nice also provides a back-end service scoring mechanism to prevent problems detected by the application side from being ignored.
Finish the above two phases. The current architecture of our server is as follows.
Nice's pain points and next steps
After August 2015, with the joint efforts of RD and OP, the service stability problem was guaranteed. At the same time, the technical team is developing rapidly, and there are also complete data and strategy teams. In the process of multi-party interconnection, servitization becomes increasingly important.
In addition, the scale of our team is also expanding, and all the online business code is combined. sometimes more than 10 people need to change the code at one time. The coupling of code libraries has also become an obvious problem.
Currently, the most important problem nice faces is servitization and code splitting.
To solve this problem, we have the following basic ideas.
Service-oriented, not one-size-fits-all. Supports remote service call and simultaneous deployment. Avoid too many services and introduce service management too early.
Code splitting and application development framework upgrade. Dependency management tools that detach, maintain, and import databases, frameworks, and services.
During the more than one year of participation in entrepreneurship. I have a lot of insights.
Entrepreneurship is a way of life. You may face various problems at any time, some of which you are good at and some of which you are not good. But in any case, you have to stand up. If you select this option, you have to bear it.
This sharing focuses on actual problems. I hope that my friends who are also involved in the entrepreneurship will be able to see this sharing and take a detour.
Reply to discussion (solution)
Learn something!
Not clear at all
Yes !!!!
Some do not understand, it seems that there are still a lot to learn
Not understandable
Well written.