Exploring How large B2C websites implement high-performance Scalable Architecture Technology
As the largest B2C website in China, its website architecture has been bearing the pressure of rapid growth in data volumes. To ensure good load and process user experience, A Scalable high-performance website architecture is essential.
I. Stateless applications
The scalability of a system depends on how the application state is managed. Imagine if we have saved a large amount of client status information in the session, what should we do when the server that saves the status information goes down? Generally, we solve this problem through clusters. Generally speaking, clusters not only have Server Load balancer, but more importantly, Failure Recovery failover, for example, Tomcat adopts the cluster node broadcast replication and JBoss adopts the session state replication policies such as pairing replication. However, restoring the status in the Cluster also has its disadvantages, which seriously affects the scalability of the system, the system cannot achieve good horizontal scaling by adding more machines, because session communication between cluster nodes will increase the overhead with the increase of nodes, so to achieve the scalability of the application itself, we need to ensure that the application is stateless, so that all nodes in the cluster are the same, so that the system can be better horizontally scaled.
The importance of stateless is described above. How can this problem be solved? At this time, a session framework will play a role. Generally, it is implemented through cookies or centralized session management. Specifically, multiple stateless application nodes can connect to a session server, the session server saves the session to the cache, And the backend of the session server is equipped with underlying persistent data sources, such as databases and file systems.
Ii. Effective use of Cache
Everyone who is doing internet applications should know how important cache is to an Internet application, from browser cache, reverse proxy cache, page cache, and partial page cache, object Caching and so on are cache application scenarios.
Generally, the cache can be divided into local cache and remote cache based on the different distance from the application. Generally, either local cache or remote cache is used in the system. If both are used together, Data Consistency processing between local cache and remote cache will become more troublesome.
In most cases, the cache we mentioned is read cache, and there is another type of cache: Write cache. For data with low read/write ratio and low data security requirements, we can cache the data to reduce access to the underlying database, such as counting the number of visits to products, to count the number of API calls, you can first write the memory cache and then delay the persistence to the database, which can greatly reduce the write pressure on the database.
Iii. Application splitting
First, before explaining application splitting, let's first review some problems encountered in the process of a small system growth, through these problems, we will find that splitting is important to building a large system.
At the early stage of the system's launch, there were not many users, and all logic may be placed in one system. All logic ran into one process or application. At this time, because of the low number of users and low system access traffic, therefore, putting all the logic in one application is not a problem. However, everyone knows that it is not a long time. With the increasing number of system users, the access pressure on the system is increasing. With the development of the system, in order to meet the needs of users, when new functions need to be added to the original system and the system becomes more and more complex, we will find that the system becomes increasingly difficult to maintain and expand, and the system scalability and availability will also be affected. How can we solve these problems at this time? The wise way is to split (this is also a decoupling). We need to divide the original system into different subsystems according to certain standards, such as business relevance, different systems are responsible for different functions. After splitting, we can expand and maintain individual subsystems to improve system scalability and maintainability, at the same time, our system's horizontal scalability scale-out has been greatly improved, because we can perform horizontal scaling for subsystems with high stress, without affecting other subsystems, instead of splitting them as before, when the system pressure increases, we need to scale the entire system, and the cost is relatively large. In addition, after splitting, the coupling between the subsystem and the subsystem is reduced, when a subsystem is temporarily unavailable, the overall system is still available, thus greatly enhancing the availability of the overall system.
Therefore, a large Internet application must be split, because only the split, system scalability, maintainability, scalability, and availability will become better. However, splitting also poses a problem to the system, that is, how the subsystems communicate with each other. What are the specific communication methods? There is usually synchronous communication and asynchronous communication. Here we will first talk about synchronous communication. The following topic "message system" will talk about asynchronous communication. Since communication is required, a high-performance Remote Call framework is very important at this time.
All of the above are the advantages of splitting, but after splitting, it will inevitably bring about new problems. Apart from the Subsystem Communication Problems just mentioned, the most noteworthy problem is the dependency between systems. Because there are too many systems, the dependency between the systems will become complex. In this case, we need to pay more attention to the sharding standard, for example, whether or not some dependent systems can be vertically implemented to ensure that the functions of these systems are as vertical as possible. This is also the current system vertically implemented by the company. At the same time, we must pay attention to the circular dependency between systems, be careful when loop dependency occurs, because this may cause system chain startup failure.
We can see from the above that a large system must be split to become maintainable, scalable, and scalable, splitting also brings about communication between systems and dependency management between systems.
Iv. Database splitting
In the previous topic "application splitting", we mentioned that a large Internet application needs to be well split, where we only talk about "application-level" splitting, in fact, apart from application-level splitting, our Internet applications also have an important level of storage splitting. Therefore, this topic mainly involves how to split the storage system, which is usually referred to as RDBMS.
After the topic of this section is determined, let's review some problems encountered during the process of an Internet application growing from small to small. The problems we encounter will lead to the importance of splitting RDBMS.
At the beginning of the system, the system was launched and there were not many users. At that time, all the data was stored in the same database. In this case, there was less pressure on users, A database is enough to cope with, but with the hard work and promotion of those running buddies, suddenly one day, I found that oh, god, the number of users suddenly increased, then the database guy couldn't stand it. It finally got down when everyone was happy one day. At this point, let's look at what the reason is. After checking it, we found that the pressure on Database reading was too high, at this point, we all know that the read/write splitting is the case. At this time, we will configure a server as the master node, and then configure several salve nodes. In this way, through read/write splitting, the data read pressure is distributed to different salve nodes, and the system returns to normal again and starts to run normally. However, Hao Jing is still not long. One day, we found that the master node was unable to support it, and its load was too high. It was sweating and at any time there was a risk of being tilted, in this case, we need to vertically partition (also called database shards), such as storing product information, user information, and transaction information in different databases, at the same time, you can also use the master, salve mode and OK mode for the product information database. After database sharding, the write pressure of each database split by function is shared to different servers, in this way, the pressure on the database is finally restored to normal. But is that true? No, this no is not what I said. It is summed up by our predecessors through experience. As the number of users increases, you will find that some tables in the system become very huge, for example, a friend relationship table or a parameter configuration table of a store, whether it is writing or reading data from these tables, it is a very laborious task for the database, therefore, we need to perform "Horizontal partitioning" (this is the sharding as the saying goes ).
As mentioned above, it is nothing more than telling everyone the fact that "database is the least easy to scale out of the system", a large Internet application will inevitably go through a single DB server, to master/salve, then to the vertical partition (Database sharding), and then to the horizontal partition (Table sharding, sharding) process, and in this process, master/salve and vertical partitions are relatively easy and have little impact on applications. However, table sharding may cause some difficult problems, such as the inability to join and query data across multiple partitions, how to balance the load of each shards, and so on, we need a universal Dal framework to shield the impact of underlying data storage on application logic, so that the access to underlying data is transparent to the application.
V. Asynchronous Communication
In the introduction to the remote call framework, we mentioned that a large system must be split for scalability and scalability, but after the split, communication between subsystems has become our primary concern. In the "Remote Call framework" section, we talk about the application of synchronous communication in a large distributed system, in this section, we will talk about asynchronous communication. Now that we have talked about asynchronous communication, "Message Middleware" is coming soon. Adopting asynchronous communication is actually related to system scalability and maximizing the decoupling of various subsystems.
When it comes to asynchronous communication, we need to pay attention to the fact that the asynchronous communication here must be based on the Business characteristics, and it must be for business Asynchronization, it is usually suitable for asynchronous communication scenarios where loose coupling occurs. For business systems with a large degree of relevance, we still need to adopt synchronous communication.
OK. Next, let's talk about the benefits of Asynchronization to the system. First of all, let's think about it. If the system has two subsystems, A and B, and A and B are synchronous communications, we must scale both A and B to improve overall system scalability, this affects the scale out of the entire system. Second, synchronous calls also affect availability. From the perspective of mathematical reasoning, a synchronously calls B. If a is available, B is available, and the inverse negative proposition is that if B is unavailable, then a is also unavailable, which will greatly affect the system availability. Once again, after asynchronous communication between systems, the system response time can be greatly improved, shortening the response time of each request, in this way, the user experience is improved, so Asynchronization improves system scalability and availability while greatly enhancing the request response time (of course, the overall request processing time may not be less ).
6. unstructured data storage
In a large Internet application, we will find that not all data is structured, such as some configuration files, the dynamics of a user, and snapshot of a transaction, this information is generally not suitable for being stored in RDBMS. They are more in line with a key-value structure. In addition, there is another type of data that requires a large amount of data, but the real-time performance is not high, in this case, the data needs to be stored in another storage method, and other static files, such as images and product descriptions of various products, are stored in large volumes, placing an RDBMS will cause read performance problems and affect the reading performance of other data. Therefore, this information needs to be stored separately from other information, generally, Internet application systems store the information in distributed file systems.
With the development of the Internet, nosql has become a popular concept in the industry since the second half of the year. We all know that, according to cap theory, consistency, availability, and partition fault tolerance cannot be met at the same time. At most, they can satisfy both. Our traditional relational data adopts ACID transaction policies, acid's transaction strategy focuses more on High Consistency and reduces availability requirements. However, Internet applications tend to have higher availability requirements than consistency requirements, at this time, we need to avoid using ACID transaction policies of data, instead of base transaction policies. Base transaction policies are short for basic availability, soft state of transactions, and final consistency of transactions. Through base transaction policies, we can improve system availability through eventual consistency. This is also the strategy adopted by many nosql products, including Facebook's Cassandra, Apache hbase, and Google bigtable, these products are very suitable for some unstructured data, such as key-value data storage, and these products have a good advantage of horizontal scalability. At present, the company is also researching and using some mature nosql products.
VII monitoring and warning system
The only reliable part of a large system is that it is not reliable.
Because a large distributed system will inevitably involve a variety of devices, such as network switches, General PCs, network adapters, hard disks, and memory, when the number of these things is very large, the probability of errors also increases. Therefore, we need to monitor the status of the system at all times, and monitoring also has fine granularity, if the granularity is rough, we need to monitor the entire application system, such as the current system network traffic, memory utilization, Io, and CPU load, the Service Access pressure, response time, and so on are monitored. If we have a finer granularity, we need a function in an application, for example, a URL has a large number of visits, the PV of each page, the bandwidth occupied by the page every day, and the page rendering time, static resources are further fine-grained than the bandwidth occupied by a slice every day. Therefore, a monitoring system becomes indispensable.
We have mentioned the importance of a monitoring system. With a monitoring system, it is more important to integrate it with an early warning system. For example, when the access volume to a page increases, the system can automatically issue an alert, when the CPU and memory usage of a server suddenly increases, the system can also automatically warn, and when concurrent requests are seriously lost, the system can also automatically warn, etc, in this way, the combination of the monitoring system and the warning system enables us to quickly respond to system problems and improve system stability and availability.
8. Unified Configuration Management
A large distributed application usually consists of many nodes. If you want to change the configurations of other nodes every time a new node is added, or you need to change the configurations of each node to be deleted, this is not conducive to system maintenance and management, but also easier to introduce errors. In addition, many systems in the cluster have the same configurations. If you do not perform unified configuration management, you need to maintain one configuration on all systems, this will cause trouble in configuration management and maintenance, and a unified configuration management can effectively solve these problems. When a new node is added or deleted, the configuration management system can notify each node to update the configuration so that the Configuration consistency of all nodes is achieved, which is both convenient and error-free.
From: http://www.cnblogs.com/FredChan/archive/2010/07/27/1786226.html