Explanation and Q & A on the application of multi-thread servers
Author: solution: 23
Chen Shuo (giantchen_at_gmail)
2010 March 3-rev 01
The article "application scenarios of multi-thread servers" (hereinafter referred to as "application scenarios") was questioned by enthusiastic readers after the blog was published. I myself felt that the original article did not fully explain the truth.ArticleTry to use some examples to answer your questions. I was going to modify the original article, but considering that the readers who have read it may not necessarily notice the changes in the article, I would just write another article. For ease of reading, this article is presented in the Q & A body. This chapter may be modified and expanded repeatedly. Note the above version number.
The definition of the "multi-threaded server" mentioned in this article is the same as the previous article. For details, see the definition of the "common programming model of multi-threaded server" (hereinafter referred to as "Common Model, the following "connections and ports" refer to the TCP protocol.
LinuxHow many threads can be started at the same time?
For 32-bit Linux, the address space of a process is 4 GB, where the user State can access about 3 GB, and the default stack size of a thread is 10 MB, A process can start up to 300 threads at the same time. If the thread's call stack size is not changed, about 300 is the upper limit, becauseProgramOther parts (data segment,CodeSegments, heaps, dynamic libraries, and so on) also occupy memory (address space ).
For 64-bit systems, the number of threads can be greatly increased. I did not test the specific number, because I did not actually use such multithreading.
The following discussion takes 32-bit Linux as an example.
Can multithreading improve concurrency?
If it refers to "concurrent connections", no.
From Question 1, we can see that if the thread per connection model is used, the maximum number of concurrent connections is 300, which is far lower than the number of concurrent connections easily achieved by the event-based Single-thread Program (tens of thousands, tens of thousands ). The so-called "Event-based" refers to the programming model using Io multiplexing event loop, also known as the reactor mode, which has been introduced in the article "common models.
So what about using the event loop per Thread recommended in "common models? At least not inferior to a single-threaded program.
Conclusion: the scalability of thread per connection is not suitable for high concurrency scenarios. The concurrency of event loop per Thread is no worse than that of a single-threaded program.
Can multithreading increase throughput?
For Computing-intensive services, no.
Assume there is a time-consuming computing service. It takes 0.8 s to use a single thread. On an 8-core machine, we can start eight threads for external services (if the memory is sufficient, the same is true for the eight processes ). In this way, it still takes 0.8 s to complete a single computation, but since these processes can be computed at the same time, the throughput can ideally be increased from 1.25cps (calc per second) of a single thread to 10cps. (In actual situations, you may have to take a discount-if not .)
Use parallelAlgorithmIn theory, if the acceleration ratio is as high as 8 in parallel, the computing time is 0.1 s and the throughput is 10 CPS, but the response time of the first request is much lower. In fact, according to Amdahl's law, even if the algorithm's degree of parallelism is as high as 95%, the 8-core acceleration ratio is only 6, and the computing time is 0.133 S, this will cause the throughput to decline to 7.5cps. However, at this cost, it is worthwhile to improve the response time in some applications.
This also answers Question 4.
If the thread per request model is used and each customer request is processed by one thread, the throughput will decrease when the number of concurrent requests exceeds a critical t, the context switching overhead will also increase when there are more threads (for analysis and data, see a design framework for highly concurrent systems by Matt Welsh et al .). Thread per request is the simplest way to use threads. It is the easiest to program. A multi-threaded program is considered as a bunch of serial programs, and sequential programming is performed in synchronous mode. For example:JavaIn servlet, a page request is synchronized by an httpservlet # Service (httpservletrequest req, httpservletresponse resp) function.
To maintain stable throughput when the number of concurrent requests is very high, we can use the thread pool. The thread pool size should meet the "impedance matching principle". For details, see question 7.
The thread pool is not omnipotent. If more computing is required to respond to a request (for example, the computing time accounts for 1/5 of the total response time), it is reasonable to use the thread pool, simplified programming. If thread is mainly waiting for IO in a request response, other programming models, such as proactor, are often used to further improve the throughput. For details, see question 8.
Can multithreading reduce the response time?
If the design is reasonable, you can make full use of multi-core resources. The timeliness of burst requests is particularly effective.
Example 1: multi-thread processing input.
Take the memcached server as an example. A memcached request response can be divided into three steps:
- Read and parse client input
- Operate hashtable
- Back to client
In single-threaded mode, these three steps are executed in serial mode. When multithreading mode is enabled, multiple input threads are enabled (4 by default), and the new connection is assigned to one of the input threads according to the round-robin method, this is exactly the event loop per Thread model I mentioned. In this way, the 1st-step operation can be parallel with multiple threads, improving the response speed of multiple users on multi-core machines. The global lock or single thread is used in step 1, which is a worthy of further improvement.
For example, if two users send a request at the same time and the connection between the two users is allocated to two Io threads, the 1st-step operations of the two requests can be executed in parallel on the two threads, then, the result is summarized to Step 1 for serial execution, so that the total response time is shorter than full serial execution (the effect is more obvious when the proportion of "read and parse" is large ). Continue with the example below.
Example 2: multi-thread load balancing.
Suppose we want to create a service for solving Sudoku (see "talking about Data independence"). This service program accepts requests on port 9981, enter 81 numbers in a row (the number to be filled in is expressed as 0), and the output is the 81 numbers (1 ~ 9). If there is no solution, output "NO \ r \ n ".
Since the input format is simple, you can use a single thread for Io. Assume that the computing time for each solution is 10 ms and the previous method is used for calculation. The maximum throughput that a single-threaded program can reach is req/S. On an 8-core machine, if the thread pool is used for computing, the maximum throughput can be req/s. Next we will look at how multithreading reduces the response time.
Assume that a user sends 10 requests in a very short period of time. If a single thread is used to "process one" model, these reqs are processed in the queue in sequence (this queue is the operating system's TCP buffer, not the program's own task queue ). Without network latency considerations, the response time for 1st requests is 10 ms. For 2nd requests, the CPU resources can be obtained only after 1st requests are completed, after 10 ms calculation, the response time is 20 ms. Similarly, the response time of 10th requests is 100 ms, and the average response time of 10 requests is 55 ms.
If the sudoku service starts timing when each request arrives, it will find that each request has a response time of 10 ms. from the user's point of view, the average response time of 10 requests is 55 ms, please think about the difference.
Use multiple threads: One Io thread and eight computing threads (thread pool ). Use blockingqueue to communicate with each other. Similarly, 10 concurrent requests, 1st requests are allocated to the computing thread 1, 2nd requests are allocated to the computing thread 2, and so on until 8th requests are assumed by 8th computing threads. Requests No. 9th and No. 10th are waiting in blockingqueue until a computing thread returns idle. (Please note that the allocation here is actually done by the operating system. The operating system will pick one from the thread in the waiting status, not necessarily round-robin .)
In this way, the response time of the first eight requests is almost 10 ms, and the last two requests belong to the second batch. The response time is about 20 ms, and the total average response time is 12 ms. It can be seen that it is much faster than a single thread.
Because each Sudoku question is difficult, a simple question may be computed in 1 ms, and a complex question can be computed in 10 ms at most. The advantage of the thread pool solution is more obvious, which can effectively reduce the probability of simple tasks being overwhelmed by complex tasks.
The above are computing-intensive examples, that is, the thread does not wait for Io when responding to a request. The following describes more complex cases.
How does a multi-threaded program overlap Io and "computing" to reduce latency?
The basic idea is to assign the IO operation (usually write operation) to another thread through blockingqueue, so you don't have to wait.
Example 1: Logging
In multi-threaded server programs, logging is critical. In this example, only writeLogFile, regardless of log server.
In a request response, you may need to write multiple log messages. if you write a file (fprintf or fwrite) in synchronous mode, the performance may be reduced because:
- File Operations are generally slow, and the service thread will wait on Io to idle the CPU and increase the response time.
- Even if there is a buffer, it still does not work. Multiple Threads write together. In order not to confuse buffer write, locks are often required. This will allow service threads to wait for each other and reduce the concurrency. (It is not possible to use multiple log files at the same time. Unless you have multiple disks and ensure that log files are distributed across different disks, the disk Io bottleneck is still restricted .)
The solution is to use a separate logging thread to write disk files and provide interfaces through one or more blockingqueue. When other threads want to write logs, they must first prepare the message (string) and then plug it into the queue. Basically, there is no need to wait. In this way, the computing of the service thread overlaps with the disk Io of the logging thread, reducing the response time of the service thread.
Although logging is very important, it is not the main logic of the program. Therefore, the smaller the impact on the program structure, the better. It is better to be as simple as a printf statement without worrying about other performance overhead, A good multi-thread asynchronous logging library can help us do this. (ApacheBoth log4cxx and log4j support asyncappender asynchronous logging .)
Example 2: memcached Client
Suppose we use memcached to save the last posting time of the user, so every time we respond to the user's posting request, we need to set the value in memcached in the program. If synchronous Io is used in this step, the latency will increase.
For write-only idempotent operations such as "setting a value", we don't actually need to wait for memcached to return the operation results. Here we don't need to care about the set operation failure, so we can use multithreading to reduce the response latency. For example, we can write a multi-threaded memcached client. For the set operation, the caller only needs to prepare the key and value and call the asyncset () function, put the data on blockingqueue to immediately return, with a low latency. The rest is left to the memcached client thread, and the service thread is not blocked.
In fact, all network write operations can be done in this way asynchronously, but this also has a drawback, that is, every time asyncwrite needs to pass data between threads, if the TCP buffer is empty, we can write it in this thread without having to bother with the special I/O thread. JBoss netty uses this method to further reduce latency.
The preceding sections only discuss how to "run with one shot". For example, to obtain a value from memcached, "overlapping Io" cannot reduce the response time, because you have to wait for the reply from memcached anyway. In this case, we can use other methods to improve the concurrency. For details, see question 8. (Although the response time cannot be reduced, do not waste the thread on the air, right)
In addition, the above example shows that blockingqueue is a powerful tool for building multi-threaded programs.
Why do third-party libraries often use their own threads?
Generally, the event loop model has no standard implementation. If you write your own code, you can use the recommended reactor programming method. However, third-party libraries may not be able to adapt to and integrate the event loop framework. Sometimes it is necessary to use a thread for some serial and conversion.
For Java, this problem is better, because the thread pool has a standard implementation in Java, called executorservice. If a third-party Library supports a thread pool, it can share an executorservice with the main program, rather than creating a bunch of threads on its own. (For example, input the OBJ of the main program during initialization .) For C ++, the situation is much more troublesome. There is no standard library for reactor and thread pool.
Example 1: libmemcached only supports synchronization
Libmemcached supports the so-called "non-blocking operations", but does not expose a file Describer that can be selected/poll/epoll. Its memcached_fetch will always be blocked. It claims that memcached_set can be non-blocking. It actually means that it does not have to wait for the result to be returned, but in fact this function will call write () synchronously and may still block the network Io.
If we call the libmemcached function in our reactor event handler, latency is worrying. If you want to continue using libmemcached, we can perform a thread encapsulation for it. According to the solution of Problem 5, example 2, memcached Io is dedicated to the additional thread, and the program subject is reactor. We can even inject memcached "Data ready" as an event into our event loop to further improve concurrency. (The example remains for Question 8)
Fortunately, the memcached protocol is very simple. You can write a reactor-based client yourself, but the database client is not so lucky.
Example 2: MySQL official c API does not support asynchronous operations
The mysql client only supports synchronous operations. For operations such as update, insert, and delete, the execution result is ignored ), we can use a separate thread to reduce the delay of the service thread. The above example of memcached_set can be used as an example. The trouble is that select. If you want to make it asynchronous, you have to use a more complex mode. See Question 8.
In contrast,PostgreSQLThe Design of libpq on the C client is much better. We can use pqsendquery () to initiate a query, and then use the standard select/poll/epoll to wait for pqsocket. If data is readable, use pqconsumeinput for processing, and use pqisbusy to determine whether the query result is ready. Finally, use pqgetresult to obtain the result. With this asynchronous API, we can easily write a set of wrapper for libpq to integrate it into the reactor model used by the program.
What is the principle of Impedance Matching for thread pool size?
I mentioned the "Impedance Matching Principle" in "common models". Here I will give a general introduction.
If the time consumed by intensive computing is P (0 <p <= 1) when the threads in the pool are executing tasks, and the system has a total of C CPUs, in order to make the c cpu fully run without overload, the empirical formula T = C/P of the thread pool size. (T is an hint. Considering that the estimation of the P value is not very accurate, the optimal value of T can fluctuate up or down by 50% .)
In the future, I will explain how this empirical formula came about. First, I will verify the correctness of the boundary conditions.
Assume that c = 8, P = 1.0, and the tasks in the thread pool are completely intensive computing, then t = 8. As long as eight active threads can make eight CPUs saturated, it is useless to add more because the CPU resources are exhausted.
Assume that c = 8, P = 0.5, half of the tasks in the thread pool are computing, and half of them are on Io, then t = 16. Considering that the operating system can flexibly and reasonably schedule sleeping, writing, and running threads, about 16 "50% busy threads" can keep 8 CPUs busy. Starting more threads does not increase the throughput, but reduces the performance by increasing the overhead of context switching.
If P <0.2, this formula does not apply. t can take a fixed value, for example, 5 * C.
In addition, the C in the formula is not necessarily the total number of cpus. It can be the number of CPUs allocated to this task. For example, four cores are assigned to an 8-Core Machine for a task, then c = 4.
Apart from the reactor + thread poll you recommended, are there other non-trivial multi-threaded programming models?
If you have to deal with other processes multiple times in a request response, the proactor model can always achieve higher concurrency. Of course, the cost is that the Code becomes fragmented and hard to understand.
Here, the HTTP proxy is used as an example. If an HTTP proxy request does not hit the local cache, most of the requests will:
- Resolve a domain name (do not underestimate this step. It may take half a second to resolve a strange domain name)
- Establish a connection
- Send http request
- Wait for response
- Return the result to the customer
In the five steps, three round-trip occurs with two servers:
- Ask the DNS for a domain name and wait for a response;
- Initiate a connection to the HTTP server of the other party and wait for the TCP three-way handshake to complete;
- Send an HTTP request to the other party and wait for the response from the other party.
In fact, HTTP proxy does not have much computing workload. If a thread pool is used, the number of threads in the pool will be large, which is not conducive to the operating system.ManagementScheduling.
At this time, we have two solutions:
- Make "Domain Name resolved", "Connection established", "response completed by the other party" into an event, and continue programming by reactor. In this way, each customer request cannot be completed by executing a function from start to end. Instead, it must be divided into multiple stages and the Request status should be managed ("the current step ?").
- Use a callback function to concatenate tasks. For example, if a user request is received and does not hit the local cache, the system immediately initiates an asynchronous DNS resolution startdnsresolve (), telling the system to call the dnsresolved () function after the resolution is complete; In dnsresolved, initiate a connection to inform the system to call connectionestablished () after the connection is established. Send an HTTP request in connectionestablished () to tell the system to call httpresponsed () After receiving the response. Finally, in httpresponsed () returns the result to the customer .. Net is also the programming mode for a large number of begin/end operations. Of course, the code will look ugly to those who are not familiar with this programming method. For examples of proactor mode, see the boost: ASIO documentation.
The proactor mode relies on the operating system or database to efficiently schedule these subtasks. Each subtask is not blocked. Therefore, a small number of threads can be used to achieve high I/O concurrency.
Proactor can increase throughput but cannot reduce latency, so I have not studied it in depth.
How do I choose between Mode 2 and mode 3A?
Here, the "pattern" is not pattern, but model. Unfortunately they are translated in the same way. As mentioned in application, Mode 2 is a multi-threaded process, and mode 3A is multiple same single-threaded processes.
In my opinion, when other conditions are the same, you can choose based on the size of the work set. The working set refers to the memory size accessed by the Service Program in response to a request.
If the working set is large, multiple threads are used to avoid the impact of CPU cache switching on performance. Otherwise, single-thread multi-process is used to enjoy the convenience of Single-thread programming.
For example, memcached, a large memory-consuming user, uses a multi-threaded server to run multiple memcached instances on the same machine. (Unless you run 32-bit memcached on a 16 GB memory machine, so many instances are required .)
For example, sudoku does not use much memory. If single-thread programming is more convenient, you can use single-thread multi-process. Add a single-threaded Load balancer to the front to simulate the Lighttpd + FastCGI example.
The thread cannot reduce the workload, that is, the CPU time cannot be reduced. If you need to execute 0.1 billion commands to solve a problem (this number is not big, don't be scared), using multithreading will only increase the number. However, by properly deploying the execution of these 0.1 billion commands on multiple cores, we can end the construction period early. This sounds like the overall planning method, and it is indeed the overall planning method.
Please note that I usually do not reply to questions from anonymous users in the comments and posts of the csdn blog. If you want me to answer questions, please: 1. write to me, 2. on Twitter, follow @ bnu_chenshuo, 3. post comments after login. Sorry for any inconvenience.