High-performance server architecture ideas

Source: Internet
Author: User
Tags epoll

In the field of server-side program development, performance issues have been the focus of attention. The industry has a large number of frameworks, components, class libraries are widely known for performance as a selling point. However, the server-side program should have some basic ideas on performance issues, which are rarely mentioned in the documentation for these projects. This article formally wants to introduce the basic strategy and classic practice of server-side solution to performance problems, and it is divided into several parts to illustrate:

1. Concepts and examples of caching strategies

2. The difficulty of caching policy: the scavenging mechanism of cache data with different characteristics

3. Concepts and examples of distribution strategies

4. The difficulty of the distribution strategy: the balance between shared data security and code complexity

The concept of cache caching policy

When we mention server-side performance issues, it's often confusing. Because when we access a server, the presence of the service stuck to the data, it would be considered a "performance problem." However, this performance problem may be a different cause, and it is shown to be a long delay or even interruption to customer requests. Let's see what these reasons are: the first is the so-called concurrency, which means that too many clients are being requested at the same time, resulting in a denial of service for customers exceeding capacity, which is often caused by server memory exhaustion, and the second is too long to handle. That is, some customers request processing time has exceeded the user can tolerate the length, this situation often manifests as CPU occupies 100%.

When we are in the server development, the most commonly used are the following hardware: CPU, memory, disk, network card. Where the CPU is representative of the computer processing time, hard disk space is generally very large, mainly read and write disk will bring relatively large processing delay, and memory, network card is limited by the capacity of storage, bandwidth. So when our server is having performance problems, it is the case that some or even a few of these hardware load up. These four hardware resources can generally be abstracted into two categories: a kind of time resources, such as CPU and disk read and write, a class of space resources, such as memory and network card bandwidth. So when we have a server performance problem, there is a basic idea, that is-time space conversion. We can give a few examples to illustrate the problem.

[Dam is an example of changing the flow time using reservoir space]

When we visit a Web site, the URL you enter will be read by the server into a file on the disk. If a large number of users visit this site, each request will result in a read operation on the disk, which may overwhelm the disk, resulting in the inability to instantly read the contents of the file. But if we write a program that will read the contents of the file once, long time in memory, when there is another read to the same file, the data directly from memory to the client, there is no need to let the disk read. Because the files that users access are often very centralized, a large number of requests may be able to find a saved copy from memory, which can greatly increase the amount of traffic that the server can host. This practice, is to use the memory space, in exchange for the disk read and write time, belongs to the use of space-time strategy.

[Instant noodles Pre-caches a lot of cooking operations]

Another example: We write a server-side program for online games that provides player data archiving through a read-write database. If a large number of players enter this server, there must be a lot of players data changes, such as upgrading, acquiring weapons and so on, these through the read and write database to achieve the operation, may cause the database process overload, so that the player can not complete the game operation immediately. We will find that in-game reading operations, most of the needles are static data, such as game level data, weapon props specific information, and a lot of write operations, in fact, will be covered, such as my experience, maybe every dozen a strange will increase by dozens of points, but the last recorded is only the final experience value, It does not record every process of playing the blame. So we can also use the space-time conversion strategy to provide performance: we can use the memory, the static data in those games are read and saved at once, so that each time this data is not related to the database, and the player's data, not every change to write the database, But first in memory to maintain a copy of the player data, all the write operations first to write the structure in memory, and then periodically by the server actively write back to the database, so that many times the write database operations into a write operation, also can save a lot of write database consumption. This approach is also a strategy for space-time change.

[Assembling furniture is a very provincial transportation space, but installation is time-consuming]

Finally, let's talk about the time-to-space example: suppose we want to develop a data storage system for a business Address book, the customer asks us to save every new, modified, deleted operation of the address Book, which is the history of all changes in the data, so that the data can be rolled back to any past point in time. The simplest way to do this is to copy a copy of this data whenever it changes. However, this can be a huge waste of disk space, because the data itself may vary in a very small part, but the copy may be very large. In this case, we can write down a record every time the data changes, the content is the situation of data change: inserted a content is a contact method of xxx, deleted a contact method of xxx ..., so that the data we record is only a change of the part, and do not need to copy many copies. When we need to restore to any point in time, we just need to press these records to modify the data again, until the specified point in time Records. The recovery time may be a bit long, but it can save a lot of storage space. This is the policy of swapping the disk's storage space with the CPU time. This strategy is used by our common MySQL InnoDB log data table and the SVN source code store.

In addition, our web server, when sending HTML file content, often also first zip compression, and then sent to the browser, the browser to be extracted after the first, and then to display, this is also used by the server and the client CPU time, in exchange for network bandwidth space.

In our computer system, the idea of caching almost everywhere, such as our CPU has 1 cache, 2 cache, they are to use these fast storage space, in exchange for memory this relatively slow storage space waiting time. Our display cards also contain large caches, which are used to store the results of the operation of the display graph.

[Traffic jams on suburban roads leading to large spaces]

The nature of the cache, in addition to the "processed data, do not need to repeat processing", there is "fast data storage read and write, instead of slower storage read and write" strategy. When we choose the cache strategy for time and space conversion, we must make sure that we have to convert the time and space is reasonable, whether it can achieve the effect. For example, some early people would cache Web files on a distributed disk (for example, NFS), but because accessing the disk over the network itself is a slow operation, it also consumes potentially less network bandwidth space, which can cause performance to become slower.

In the design of the caching mechanism, we are also vulnerable to another risk, that is, the caching of data programming processing problems. If we are going to cache the data, it is not completely unnecessary to deal with direct read and write, but need to read into memory, in a language structure or object to deal with, this need to involve "serialization" and "deserialization" problem. If we use direct copy memory to cache the data, when we need to cross-process, or even cross-language access to these data, there will be those pointers, IDS, handle data invalidation. Because in another process space, these "tagged" data are non-existent. So we need a more in-depth approach to data caching, and we may use the so-called deep copy scheme, which is to follow those pointers to figure out the target memory and copy the data. Some of the more modern practice is to use the so-called serialization scheme to solve this problem, that is, with some well-defined "copy method" to define a struct, then the user can clearly know that the data will be copied, directly cancel the pointer and other memory address data exists. For example, the famous protocol buffer can be very convenient for memory, disk, network location cache; Now our common JSON is also used by some systems as a cached data format.

But what we need to note is that the data that is cached and the data that our program really wants to manipulate are often copied and manipulated, which is the process of serialization and deserialization, which can be very fast or slow. So when we choose the data cache structure, we must pay attention to its conversion time, otherwise your cached effect may be copied, the conversion consumes a lot of data, serious or even worse than not cache. In general, the more the cached data to solve the memory structure used, the faster the conversion speed, at this point, Protocol buffer using TLV code, compared to the direct memcpy of a C struct, but rather than encoded into plain text XML or JSON to come faster. Because the process of encoding and decoding often need to carry on complex look-up table mapping, list structure and so on operation.

The difficulty of caching policy

While the idea of using caching seems like a simple thing to do, the caching mechanism has a central difficulty--cache cleanup. What we call a cache is to keep some data, but the data tends to change, and it may not be easy for us to clean up the stored "dirty" data for these changes.

First, let's look at the simplest cache data--static data. This kind of data is often not changed when the program is running, such as HTML file data cached in the Web server memory. In fact, all data that is not uploaded by external users belongs to this "runtime static data". Generally speaking, we can use two kinds of methods to establish the cache: first, when the program starts, it reads all the static data from the file or database into memory, and the second is that when the program starts, it does not load the static data, but waits for the user to access the relevant data before loading, which is called lazy Load of the practice. The first method of programming is relatively simple, the program's memory is stable after startup, not too prone to memory vulnerabilities (if the load of too much cache, the program will be started immediately after the start of memory and exit, it is easier to find the problem); The second method starts quickly, but limits or plans the amount of space that the cache consumes. Otherwise, if you are caching too much data, you may run out of memory and cause online service outages.

In general, static data is not "dirty" because no user is going to write the data in the cache. But in practice, our online services often require "immediate" changes to some cached data. For example, when a news release is posted on the portal, we want to make it visible to all users who visit it at once. In the simplest way, we generally simply restart the server process and the in-memory cache disappears. This is possible for a very low-frequency business with a static cache, but if it is a news site, you cannot restart the Web server process every few minutes, which can affect the access of a large number of online users. There are two common strategies for dealing with this type of problem:

The first is the use of control commands. Simply put, in a server process, a real-time command port is opened, we can send a command message to the server process by means of a network packet (such as a UDP packet), or a Linux system signal (such as the kill SIGUSR2 process number), allowing the process to start cleaning up the cache. This cleanup may be the simplest "all clean up", and some can be detailed, so that the command message with "Want to clean up the data id" such information, such as we send to the Web server Cleanup message network package with a string URL, which is to clean up which HTML file cache. The benefit of this approach is that the cleanup operation is precise and the time and data of the cleanup can be clearly controlled. But the disadvantage is more cumbersome, manual to write to send this command is very annoying, so generally we will clean the cache command work, write to upload static data tools, such as the site of the content publishing system, once the editor submitted a new news, The program that publishes the system automatically sends a cleanup message to the Web server.

The second is to use the field to determine the logic. That is, the server process, before each read the cache, according to some characteristic data, quickly determine the memory of the cache and the source data content, whether there is inconsistency (dirty) where, if there is inconsistency, automatically clean up the cache of this data. This consumes a portion of the CPU, but does not require manual processing to clean up the cache and is highly automated. Now between our browser and the Web server, this mechanism is useful: Check the file MD5, or check the last time the file was updated. Specifically, each time the browser initiates a request to the Web server, in addition to sending a URL to the server, it also sends a MD5 check string that caches the file contents of the URL, or the last update time on the server for this file (this check string and last update time) Is the first time the file obtained from the server); When the server is received, the MD5 check string or the last update time is compared to the target file on the disk, if it is consistent, the file has not been modified (the cache is not "dirty"), you can use the cache directly. Otherwise, the target file will be read back to the browser with the new content. This practice for server performance is a certain consumption, so if often we will be paired with other cache cleanup mechanism to use, such as we will be setting a "timeout check" mechanism: just for all the cache cleanup checks, we have to simply see if the cache exists time "Timeout", if more than, Before the next check, so that you do not have to count each request to calculate MD5 or to see the last update time. However, there is a possibility that the cache will become dirty in the "timeout" time.

[Web server static Cache example]

It says run-time static cache cleanup, which now says runtime-changing cache data. During the run of the server program, if the interaction between the user and the server causes the cached data to change, it is called "Runtime change Caching". For example, we play online games, after the login role data will be read from the database, into the server's cache (may be heap memory or memcached, shared memory), when we continue to play the game, the corresponding role data will produce modified operation, this cache data is " Runtime-changing Cache ". This runtime changes the data, there are two aspects of reading and writing cleanup problem: Because the cached data will change, if another process from the database to read your role data, you will find and the current game data inconsistent; If the server process suddenly ended, you upgrade in the game, Or the data may disappear from the memory cache, causing you to work in vain for a long time, which is not a write-back (cache write cleanup) caused by the problem. This situation in the field of e-commerce is also very common, the most typical is the train ticket online purchase system, train ticket data cache in memory must have the appropriate cleaning mechanism, otherwise let two buy the same ticket on the trouble, but if not cache, a large number of users at the same time Rob tickets, the server should not come over. Therefore, there should be a special cache cleanup strategy for data caches that change at run time.

In the actual operation of the business, the change in the operation of data is often based on the increase in the number of users, so the first thing to consider is the possibility of insufficient cache space. We are not likely to put all the data in the cache space, it is not possible to clean up the cache when all the data together to clean up, so we generally want to split the data, this segmentation strategy is common in two ways: one is divided by the important level, one is divided by the use of parts.

First of all, for example, "Segmentation by Importance", in the online game, is also the role of data, some changes in the data may be immediately written back to the database (clean write cache), some other data changes will be delayed for some time, even some data until the character exits the game to write back, such as player's level change (upgrade), The acquisition and consumption of weaponry, the data that these players value very much, will basically be immediately back to write, these are the so-called most important cache data. The player's experience value changes, current HP, MP changes, will be delayed for a period of time to write, because even if the cache is lost, the player will not be too concerned about. Finally, some such as the player in the room (region) x/Y coordinates, dialogue chat record, may exit when write back, even do not write back. This example says "Write cache" cleanup, the following say "Read cache" by the important level of partition cleanup.

If we write an online shop system, which contains a lot of products, some of these products will be frequently retrieved by users, relatively hot, and some other products are not so hot. The balance, sales, and evaluation of the best-selling goods will change more frequently, while the slow-moving goods will change very little. Therefore, when we design, we should according to the frequency of access to different products, to determine which products to cache the data. When we design the structure of the cache, we should build an indicator that can count the number of cache reads and writes, and if some data read or write too often, or idle (no human read, write cache) for a long time, the cache should proactively clean up the data so that other new data can enter the cache. This strategy is also called the "Hot and Cold Exchange" strategy. When implementing the strategy of "hot and Cold Exchange", the key is to define a reasonable hot and cold statistical algorithm. Some fixed indicators and algorithms, often do not respond well to different hardware, different network conditions of change, so now people generally use some dynamic algorithms, such as Redis adopted 5 kinds, they are:

1. According to the expiration time, clean up the longest time useless

2. According to the expiration time, clean up the expiring

3. According to the expiration time, arbitrarily clean up a

4. Random cleanup Whether or not it expires

5. Regardless of expiration, clean up according to the LRU principle: the so-called LRU is least recently used, which has not been used in the last time. The idea of this principle is that if a data has not been accessed in the last few days, then the likelihood of his being visited in the future is very small. LRU is a common principle in the operating system, such as the memory of the page replacement algorithm (also including FIFO,LFU, etc.), for the LRU implementation, or very skilled, but this article does not explain how to achieve, left to everyone online search "LRU" keyword learning.

The data cache cleanup strategy is much more than the above, to use a good cache of this weapon, it is necessary to carefully study the data characteristics that need to be cached, their reading and writing distribution, the difference between the data. Then maximize the use of knowledge in the business domain to design the most reasonable cache cleanup strategy. There is no universal optimization cache cleanup strategy in the world, there is only one strategy for optimizing the business domain, which requires our programmers to understand the business domain and discover the laws behind the data.

Concept of distributed distribution strategy

The performance of any server is limited, in the face of massive internet access needs, it is impossible to rely on a single server or a CPU to bear. So we will generally at the beginning of the runtime architecture design, consider how to use multiple CPUs, multiple servers to share the load, this is the so-called distributed strategy. The distributed server concept is simple, but it is more complex to implement. Because the programs we write are often designed on the basis of a CPU, a piece of memory, so that multiple programs run at the same time and work together in a coordinated operation, which requires more of the underlying effort.

The first to emerge is a multi-process technology that supports distributed concepts. In the age of DOS, computers can only run one program at a time, and if you want to write programs while listening to MP3, it is impossible. However, under the WIN95 operating system, you can open multiple windows at the same time, the back of which is running multiple programs at the same time. In Unix and later Linux operating systems, there is a general support for multi-process technology. The so-called multi-process, is that the operating system can run at the same time we write a number of programs, each program running, as if they are exclusive CPU and memory. When the computer has only one CPU, the computer actually runs multiple processes with time-sharing, and the CPU switches between multiple processes. But if the computer has multiple CPUs or multiple CPU cores, it will actually have several processes running at the same time. So the process is like a run-time "program box" provided by an operating system that can be used at runtime to accommodate any program we want to run. When we have mastered the multi-process technology of the operating system, we can divide the running task on the server into several parts, then write to different programs, and take the load with CPU or multi-core or even multiple servers.

[Multi-process utilization multi-CPU]

This architecture, which divides multiple processes, typically has two strategies: one that is broken down by function, such as a process that is responsible for network processing, a process that is responsible for database processing, and a process that calculates a business logic. Another strategy is that each process is the same function, and it only shares different computational tasks. Using the first strategy of the system, when running, directly according to the operating system provided by the diagnostic tools, can be intuitive to monitor the performance of each function module consumption, because the operating system provides process boxes, but also provides a full range of process monitoring, such as CPU usage, memory consumption, disk and network I/O and so on. However, the operational deployment of this strategy is slightly more complicated because any process that does not start or is not configured to communicate with other processes can cause the entire system to fail, whereas the second distribution strategy, because each process is the same, is very simple to install, and to find a few more machines with insufficient performance. Multiple start-up processes are complete, which is called parallel extensions.

Now more complex distributed systems, will be combined with these two strategies, that is, the system has a number of functions to separate the specific function of the process, and these processes can be parallel expansion. Of course, the complexity of such systems in terms of development and operation is higher than using separate "by function" and "parallel partitioning". Because managing a large number of processes, it is increasingly impractical to rely on configuration files to configure the entire cluster: These running processes may have a communication relationship with many other processes, and when one of the processes changes the communication address, it is bound to affect the configuration of all other processes. So we need to centrally manage the communication address of all processes, and when there is a change, only one place needs to be modified. In a large number of process-building clusters, we also encounter the problem of disaster tolerance and scaling: When a server in the cluster fails, there may be some processes disappearing, and when we need to increase the load capacity of the cluster, we need to add new servers and processes. These work in a long-running server system, will be more common tasks, if the entire distribution system has a running central process, can automatically monitor all process status, once a process joins or exits the cluster, can immediately modify all other process configuration, which formed a dynamic multi-process management system. Open Source Zookeeper provides us with an implementation that can act as a hub for this dynamic cluster. Since the zookeeper itself can be extended in parallel, it is also capable of disaster tolerance. More and more distributed systems are now using a dynamic process management strategy that takes zookeeper as the center of the cluster.

[Dynamic process cluster]

In the policy of invoking the multi-process service, we also have certain policy choice, the most famous strategy has three: one is dynamic load balancing strategy; One is a read-write separation strategy; one is a consistent hash policy. A dynamic load balancing strategy typically collects the service state of multiple processes and then picks a least-loaded process to distribute the service, a strategy that is more appropriate for a homogeneous process. Read-Write separation strategy focuses on the performance of persistent data, such as the operation of the database, we will provide a batch of processes dedicated to provide read data services, and another (or more) process for writing data services, these write data process will write multiple copies to the "Read service process" Data area (which may be a separate database), so that more hardware resources can be provided when services are provided externally. The consistent hashing strategy is for any task, to see whether the task involves reading and writing data, which one belongs to, if there is some kind of feature that can be cached, and then according to the data ID or eigenvalue, the "consistent hash" calculation, to share the corresponding processing process. This process invocation strategy, which can make great use of the cache in the process (if present), such as one of our online games, by 100 processes to assume service, then we can put the game's ID, as a consistent hash of the data ID, as a process call key, If the target service process has cached data for the gamer, all of the player's operation requests will be transferred to the target service process, and the cache hit rate is greatly increased. Instead of using a "consistent hash" instead of another hashing algorithm, or a modulo algorithm, the main consideration is that if a part of the service process fails, the cache of the remaining service processes can still be valid, not the cache of all the processes in the cluster being invalidated. Specific interested readers can search for a "consistent hash".

Using a large number of servers in multiple processes, as well as multiple CPU cores on the server, is a very effective means. But the problem of using multiple processes brings additional programming complexity. In general, we think it is better to have a process per CPU core, so that the best use of hardware. If you run too many processes at the same time, the operating system consumes a lot of CPU time during the switching process of different processes. However, many of the APIs we acquired earlier were blocked, such as file I/O, network Read and write, database operations, and so on. If we only use a limited number of processes to execute programs with these blocking operations, then the CPU will be wasted, because the blocked API will let the limited processes wait for the results. Well, if we want to be able to handle more tasks, we have to start more processes to take advantage of the blocking time, but because the process is the "box" provided by the operating system, the box is larger and the switch takes more time, so a lot of parallel processes will needlessly consume server resources. In addition to the memory between processes is generally isolated, if you want to exchange some data between processes, often need to use some operating system provides tools, such as network sockets, which will consume additional server performance. Therefore, we need a parallel technology with less cost, easier communication and simpler programming methods, and this time, multithreading technology appears.

[Thread box inside the process box]

Multithreading is characterized by low switching costs and the ability to access memory at the same time. We can arbitrarily allow a function to be put into a new thread at the time of programming, and the parameter of this function can be any variable or pointer. If we want to communicate with these runtime threads, simply read and write the variables that the pointers point to. When a large number of blocking operations are required, we can start a large number of threads, so that we can better utilize the CPU idle time, the thread switching cost is much lower than the process, so we can use a lot more CPU. A thread is a smaller "program box" than a process, and he can put a function call instead of a complete program. In general, if more than one thread is running within a process, it is not taking advantage of the parallel benefits of a multicore CPU, but only taking advantage of a single idle CPU core. However, in the Java and C # type of language with virtual machines, the implementation of multi-threading in the bottom, according to the specific operating system task scheduling units (such as processes), as far as possible to let the thread also become the operating system can be dispatched units, thereby leveraging the CPU core. For example, after Linux2.6, which provides a NPTL kernel threading model, the JVM provides a mapping of Java threads to NPTL kernel threads to take advantage of the multicore CPUs. In the Windows system, it is said that the thread itself is the system's smallest scheduling unit, so multithreading is also the use of multi-core CPU. So when we're using java\c# programming, multithreading often has the two benefits of multi-process leveraging of multicore CPUs and a low switching overhead.

Some of the early network chat room services, combined with examples of multithreading and multi-process use. The program starts with multiple broadcast chat processes, each of which represents a room, and each user connects to a chat room, starting a thread for him, which blocks the read user's input stream. This model is very simple, but also very effective, in an environment where the blocking API is used.

When we use multithreading extensively, we find that, despite the many advantages of multithreading, there are still obvious two drawbacks: a memory footprint is large and less controllable; the second is that multiple threads need to consider complex "lock" issues when using one data. Because multithreading is based on a function call in parallel run, this function may call a lot of child functions, each call to a layer of sub-function, will be on the stack to occupy new memory, a large number of threads at the same time when running, there will be a large number of stacks, these stacks together, it may form a large memory footprint. And, we write server-side programs, often want to use the resources to be as controllable as possible, rather than dynamic changes too large, because you do not know when the memory is exhausted and when the machine, in multi-threaded programs, due to the contents of the program running the stack may be large scale, it is possible to exceed our expected memory consumption, The failure that caused the service. And for the memory "lock" problem, has been a multi-threading complex topic, many multi-threaded tool library, has introduced a large number of "lock-free" container, or "thread-safe" container, and also a lot of design to coordinate the operation of the thread class library. But these complex tools are undoubtedly proof of the problem of multithreading for memory use.

[Multiple lines at the same time are parallel]

Because multithreading still has a certain disadvantage, so many programmers think of a drastic approach: the use of multithreading is often because of the existence of a blocking API, such as a read () operation will always stop the current thread, then we can let these operations become non-blocking it? --selector/epoll is the non-blocking API that Linux exits. If we use non-blocking operation functions, we do not need to use multithreading to wait for blocking results concurrently. We only need to use a thread, loop the state of the check operation, if there is a result on the processing, no result will continue to loop. The result of such a program tends to have a large dead loop called the main loop. Within the main loop, programmers can schedule each action event, and the processing logic for each logical state. This eliminates the need for the CPU to switch between multiple threads and to handle complex parallel data locks-because only one thread is running. This is what is called a "concurrency" scenario.

[Waiter and a la carte, serving is concurrent]

In fact, the bottom of the computer has long been the use of concurrency strategy, we know that the computer for external devices (such as disk, network card, video card, sound card, keyboard, mouse), have used a "interrupt" technology, early computer users may also be required to configure the IRQ number. The feature of this interrupt technique is that the CPU will not block until it waits for external device data, but when the external data is ready, it sends an "interrupt signal" to the CPU, which allows the CPU to go to process the data. Non-blocking programming is actually similar to this behavior, where the CPU does not block the API calls waiting for some I/O, but instead handles the other logic first and then proactively checks the status of these I/O operations each time the main loop.

multithreaded and asynchronous examples, the most famous is the Web server domain of Apache and nginx model. Apache is a multi-process/multithreaded model, it will start a batch of processes, as a process pool, when the user requests come, the process pool from the allocation process to specific user requests, which can save the creation and destruction of multi-process/thread cost, but if there are a large number of requests come over, It is still necessary to consume a higher process/thread switchover. Nginx uses Epoll technology, a non-blocking approach that allows a process to handle a large number of concurrent requests simultaneously without having to switch repeatedly. For a large number of user access scenarios, Apache will have a large number of processes, and nginx can only use limited processes (such as the number of CPU cores to boot), so that the Apache save a lot of "process switching" consumption, so its concurrency performance will be better.

[Nginx fixed multi-process, one process asynchronous processing of multiple clients]

[Apache Multi-State multi-process, one process processing a customer]

In modern server-side software, Nginx's operation and management of this model is simpler and the performance consumption is slightly smaller, so it becomes the most popular process architecture. But this benefit comes at some additional cost: the complexity of the non-blocking code becomes larger in programming.

The complexity of distributed programming

Previously our code, from the top down, each row will occupy a certain amount of CPU time, the direct order of the Code, and the order of writing basically consistent, any line of code, is the only time to perform the task. When we write distributed programs, our code will no longer be as simple as those single-process, single-threaded programs. We're going to write different code that runs at the same time, in the same piece of code. It's like we're going to write all the music for each instrument in the symphony Orchestra on a piece of paper. In order to solve this kind of programming complexity, the industry has developed a variety of coding forms.

In a multi-process coding model, the fork () function can say a very typical representation. In a piece of code, the part after the fork () call may be executed in a new process. To distinguish between the process in which the current code is located, the return value variable of fork (). This approach is tantamount to merging the code of multiple processes into one piece, and then dividing them by some variables as flags. This way of writing, for the different process code most of the same "homogeneous process", it is more convenient, the most afraid is that there are a large number of different logic to be handled by different processes, in this case, we can only by the specification fork () near the code, to control the chaotic situation. The typical example is to make the code near the fork () a similar dispatcher (dispatcher), put the code of different functions into different functions, and decide how to invoke the tag variable before fork.

[Code mode for dynamic multi-process]

When we use multi-threaded APIs, things are a lot better, we can use a function pointer, or an object with a callback method, as the principal of the thread execution, and control these threads in the form of a handle or an object. As a developer, we can control the parallel multithreading as long as we have a limited number of APIs to start and stop the thread. This is more intuitive than a multi-process fork (), but we have to make a clear call to a function, and create a new thread to call a function, the difference between a new thread to invoke a function, an operation that ends quickly, and does not execute that function sequentially, but instead represents the The code in that function may be executed alternately with the code after the thread is called.

Since multithreading defines "parallel tasks" as a clear programming concept and encapsulates them in the form of handles and objects, we naturally want to have more complex and granular control over multithreading. So there are a lot of multithreading-related tools. Compare typical programming tools such as thread pool, threaded security container, lock. The thread pool provides us with the ability to automatically manage threads in the form of a "pool": we don't have to think about how to build a thread, recycle a thread, but instead give the thread pool A policy, then enter the task function that needs to be executed, and the thread pool will automatically operate, for example, it will maintain the number of threads running concurrently. or maintain a certain amount of idle threads to save the cost of creating and destroying threads. In multi-threaded operations, unlike multiple processes in memory is completely differentiated, so you can access the same memory, that is, to read and write the same variable inside the heap, which can produce a situation that the programmer does not expect (because we write the program only consider that the code is executed sequentially). There are also object containers, such as hash tables and queues, which, if manipulated by multiple threads at the same time, can cause serious errors due to internal data pairs, so many people develop containers that can be manipulated concurrently by multiple threads, as well as tools for so-called "atomic" operations to solve such problems. Some languages, such as Java, provide a keyword to "lock" a variable at the syntactic level to ensure that only one thread can manipulate it. Multi-threaded programming, many parallel tasks, there is a certain blocking order, so there are a variety of locks are invented, such as the countdown lock, queue lock and so on. The Java.concurrent library is a large collection of multithreaded tools that are well worth learning. However, the multi-threading of these multifarious weapons, in fact, also proves the multithreading itself, is a less easy to use the handy technology, but we suddenly have no better alternative.

[Multi-threaded object model]

In the multi-threaded code, in addition to starting the thread place, and the normal execution order is different, the other basic is still relatively similar to single-threaded code. But if you are under asynchronous concurrency code, you will find that the code must be loaded into a "callback function". These callback functions, from the organization form of the code, are almost completely unable to see the order of their expected execution, and generally can only be analyzed by breakpoints or logs at run time. This poses a great barrier to code reading. So now more and more programmers are paying attention to the technology of "co-process": You can write asynchronous programs in a similar synchronous way without having to plug the code into different callback functions. The most important feature of the process technology is the addition of a concept called yield, which is the line of code that is similar to return, but also represents a subsequent time, the program will continue to execute from the place of yield. This frees the code that needs to be recalled from the function and puts it behind the yield. In many of the client-side game engine, we write code is a frame, at 30 frames per second in the speed of repeated execution, in order to let some tasks can be placed in each frame, instead of always blocking lead to "card frame", the use of the process is the most natural and convenient--unity3d with the support of the co-process.

In a multithreaded synchronization program, our function call stack represents a series of processes that are identical to one thread. However, in the programming mode of single-threaded asynchronous callbacks, one of our callback functions is not easy to know, which is the sequence in which the request is processed. So we often need to write our own code to maintain such a state, the most common practice is that each concurrent task starts with a serial number (SEQID), and then in all of the concurrent task processing of the callback function, the seqid parameter is passed, so that each callback function, You can use this parameter to know which task you are working on. If there are different callback functions, want to exchange data, such as the result of a function of the processing of the B function can be obtained, you can also use seqid as a key to store the results in a common hash table container, so that the B function according to the incoming seqid can go to hash tables to get the result of a function deposit, Such a piece of data is often called "conversation". If we use the coprocessor, then these sessions may not need to be maintained by themselves, because the stack in the process represents the session container, and when the execution sequence switches to a process, the local variables on the stack are the results of the previous processing.

[Code Characteristics of the co-process]

In order to solve the complex operation of the callback of asynchronous programming, the industry also invented many other means, such as LAMDA expression, closure, promise model and so on, these are all hope that we can, from the surface of the Code of the Organization, put in a number of different time periods of code to run, Organized together in the form of business logic.

Finally, I would like to talk about functional programming, in the multi-threaded model, parallel code brings the greatest complexity, is the simultaneous operation of heap memory. That's why we got the locking mechanism, and a bunch of strategies to deal with the deadlock. and functional programming, because it doesn't use heap memory at all, does not need to deal with any locks, but makes the whole thing very simple. The only thing that needs to change is the way we get used to putting the state into the heap of programming ideas. The language of functional programming, such as Lisp or Erlang, whose core data result is a linked list-a structure that can represent any data structure. We can put all the state into the linked list of the data train, and then let a function to deal with the string of data, so it can also pass the state of the program. This is a stack to replace the heap of programming ideas, in the context of multi-threaded concurrency, is very valuable.

The programming of distributed programs has been accompanied by a lot of complexity, affecting our reading and maintenance of code, so we have a variety of technologies and concepts to try to simplify this complexity. We may not be able to find a common solution, but we can choose the scenario that best suits us by understanding the goals of the various scenarios:

L Dynamic Multi-process fork--homogeneous parallel tasks

L multithreading-logically complex parallel tasks that can be clearly drawn

L Asynchronous concurrency Callbacks-high performance requirements, but less processing parallel tasks that are blocked in the middle

L-The concurrent task is written in a synchronous way, but it is not appropriate to initiate complex dynamic parallel operations.

L Functional Programming--parallel processing task with data flow as model

Distributed Data communication

In distributed programming, the segmentation of CPU time slices is not a difficult problem in itself, the most difficult part lies in the multiple code fragments in parallel, how to communicate. Because no single piece of code can work completely alone, it requires a certain dependency on other code. In a dynamic multi-process, we often can only through the memory of the parent process to provide the shared initial data, running in the operating system only through the communication between: sockets, signals, shared memory, pipelines and so on. Whatever that is, it brings a bunch of complex coding. Most of these approaches are similar to file operations: One process write, another process read out. So many people have designed a model called "Message Queuing" that provides an interface for "put" messages and "take out" messages, and the bottom layer can be implemented with sockets, shared memory, and even files. This approach can handle almost any situation of data communication, and some can also save messages. But the disadvantage is that each communication message, must be encoded, decoded, received packets, the package of these processes, the processing delay has a certain amount of consumption.

If we communicate in multi-threading, we can directly read and write the variables in a heap, which is the highest performance and very convenient to use. However, the disadvantage is that several threads can use variables at the same time, resulting in unpredictable results, in order to deal with this problem, we have designed a "lock" mechanism for variables, and how to use the lock is another problem, because there may be so-called "deadlock" problem. So we generally use some "thread-safe" containers, which are used as a protocol for inter-threaded communication. There are many types of "tool locks" that can be used to coordinate the order of execution between multiple threads.

In the case of single-threaded asynchronous concurrency, communication between multiple sessions is also possible by directly reading and writing to the variables without a "lock" problem, because essentially only one segment code at a time will manipulate the variable. However, we still need to do some planning and collation of these variables, otherwise the various pointers or global variables scattered in the code, is also a bug. So we tend to turn the concept of "conversation" into a data container, where each piece of code can use the session container as an "inbox", and other concurrent tasks that need to communicate in this task, put the data into the "Inbox". In the field of web development, the server-side session mechanism corresponding to cookies is a typical implementation of this concept.

Distributed Cache Policy

In a distributed program architecture, the most difficult problem to solve is the state of memory in each process if we need more stability in the entire system to support the process disaster recovery or dynamic scaling. Because once the process is destroyed, the in-memory state disappears, which makes it difficult to not affect the services provided. So we need a way to make the memory state of the process less affected by the overall service, or even better to become a "stateless" service. Of course, if the "state" is not written to disk, it is always necessary for some processes to be hosted. In the current popular web development model, many people will use the Php+memcached+mysql model, where PHP is stateless, because the state is placed in the Memcached inside. This practice for PHP can be dynamically destroyed or built at any time, but the memcached process is guaranteed to be stable, and memcached as an additional process, and its communication itself will consume more delay time. So we need a more flexible and versatile process state preservation scheme, which we call a "distributed cache" strategy. We want the process to read the data, the highest performance, preferably in the heap in memory and read and write similar, but also hope that these cached data can be placed in multiple processes, in a distributed form to provide high throughput services, the most critical problem is the synchronization of cached data.

[PHP Common memached do cache]

In order to solve this problem, we need to break down the problem in a first step:

First, our cache should be a certain form of object, not a variable of any kind. Because we need to standardize the management of these caches, although the C + + language provides computational overloads, we can redefine the write variable operation of the "=" number, but it is no longer recommended to do so. And we have the most common model, suitable for caching the use of this concept, it is a hash table. All hash tables (or map interfaces) are stored in the data, divided into key and value two parts, we can put the data we want to cache as value into the "table", and we can also use key to remove the corresponding data, and "Table" object represents the cache.

Second, we need to make this "table" available in multiple processes. If the data in each process is irrelevant, the problem is actually very simple, but if we could write the data from a process to the cache and then read the data in the B process, it would be more complicated. Our "table" has the ability to synchronize data between A and B two processes. So we generally use three strategies: lease cleanup, lease forwarding, modification of broadcast

• Lease cleanup, which generally refers to the process of storing a cache of a key, called the "lease" of the data holding the key, which is registered to a place where all processes can access it, such as the zookeeper cluster process. Then in the reading, writing occurs, if the process does not have a corresponding cache, the first to query the corresponding lease, if it is held by other processes, then notify the other party "clean", so-called "cleanup", often refers to the deletion of data to read, write back to write data to the database and other persistent equipment, etc., During normal read and write operations, these operations may re-establish the cache on the new process. This strategy in the case of high cache hit ratio, the performance is the best, because generally do not need to query the lease situation, you can directly operate, but if the cache hit rate is low, then the cache will be repeatedly "moving" between different processes, will severely reduce the system's processing performance.

L Lease forwarding. Similarly, the process of caching a key is called a "lease" that holds the key data, and it is also enlisted in the shared data process of the cluster. Unlike the above lease cleanup, if it is found that the process holding the lease is not the process of the operation, the entire data read and write requests are "forwarded" through the network to the process holding the lease, and then wait for the result of his operation to return. This practice is slightly less performance due to the need to query leases for each operation, but if the cache hit ratio is not high, this can be done by sharing the cached operations across multiple processes and without having to clean up the cache, which is better than the policy of lease cleanup.

L Modify the broadcast. Both of the above strategies need to maintain a lease on the cached data, but the operation of the lease itself is a performance-intensive thing. So sometimes you can adopt some more simple, but may tolerate some inconsistencies of the strategy: for the read operation, each node read the cache, each read to determine whether to exceed the preset read cooldown time X, more than the cleanup cache from the persistent reconstruction, for the write operation, the node is determined whether the preset write cooldown y, The cleanup operation is expanded. There are two types of cleanup operations, which can broadcast modified data if the volume of data is small, and if the volume of data is large, the broadcast cleanup notification is written back to persistence. This may be inconsistent risk, but if the data is not the kind of requirements too high, and the cache hit ratio can be more secure (such as a consistent hash access cache process based on key), then really because the write operation broadcast is not timely, resulting in inconsistent data situation is still less. This strategy is very simple to implement, without the need for a central node process to maintain the data lease, but also without complex judgment logic to synchronize, as long as the ability to broadcast, coupled with the write operation of some configuration, can achieve efficient caching services. So the "modify broadcast" strategy is the most common tool in most areas where real-time synchronization is required, but data consistency is not a requirement. The well-known DNS system cache is close to this strategy: we want to modify a domain name corresponding to the IP, not immediately in all the global DNS server to take effect, but need some time to broadcast modification to other service areas. Each of our DSN servers has a large number of other domain name cache data.

Summarize

In high-performance server architectures, common caches and distributions are often used in conjunction with each other. Although both strategies have countless different manifestations and become a variety of technical genres, only a clear understanding of the principles of these technologies, combined with real business scenarios, can truly make a high-performance architecture that meets the requirements of the application.

High-performance server architecture ideas

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.