Optimize Your Nic. This section describes some Optimization Methods for large-traffic NICs, which are worth learning.
Please move to http://blog.sina.com.cn/s/blog_466c66400100bi2n.html
-------------------- The following is the body -------------------------------------
-- Improves performance while saving you 10 times or more
From: http://blog.sina.com.cn/iyangjian
I. How to save CPU
Ii. How to use memory
3. Reduce disk I/O
4. Optimize Your Nic
5. Adjust Kernel Parameters
Vi. performance metrics of Web Server
VII. Development History of nba js live broadcasting
8. historical issues left over by Sina Financial Real-time quote system (7 byte = 10.68 w RMB/year)
Bytes -----------------------------------------------------------------------------------------
I. How to save CPU
1. select a good I/O model (epoll, kqueue)
Three years ago, we were still concerned about the c10k problem. With the improvement in hardware performance, it was no longer a problem. However, if we wanted to make the piII 900 server support 5 W + connections, we still needed some capacity.
Epoll is best at monitoring a large number of idle connections and returning available descriptors in batches, which makes it possible for a single machine to support millions of connections. Epoll is supported in Linux 2.6 and later versions, and kqueue exists in FreeBSD. However, I personally prefer Linux and do not care much about kqueue.
Edge trigger et and horizontal trigger lt options:
Early documents said that ET was very efficient, but it was a bit aggressive. But in fact, I have been suffering for more than a month when I was using it. If I did not care about the CPU utilization rate of 99%, it may be that I did not handle it well. Later, Zhongying helped to change the drive mode to the et mode, which is both efficient and stable.
To put it simply, if you have data, if you don't get it, it will always harass you and remind you to get it. Et will tell you once that you love to get it, unless new data comes, otherwise, you will not be reminded.
Let's focus on ET, non-blocking mode,
The man manual says that if et prompts you that you have read data, you should read it continuously until eagain or ewouldblock is returned, but I didn't do this in actual implementation, I optimized my application. Because most of the current operating system implementations are the maximum transmission unit value of 1500.MTU: 1500-ipheader: 20-tcpheader: 20 = 1460 byte.
HTTP header, usually only 500 + bytes without Cookie. Leave 512 to Uri, which is basically enough, with savings.
What if the request header happens to be 2050 bytes larger than this?
There are two scenarios: 1. Data arrives at the same time next to each other, and it is done after one read. 2. There is a certain interval between two Ethernet frames.
My method is to use a large buffer such as 1 m to read the header. If you are sure that your service object request is smaller than 1460, read it once. If the request will arrive several Ethernet frames successively, that is, if you have just read it and it comes with a new packet, ET will return again and process it again.
By the way, write data. Generally, you can write a dozen K of data at a time to the kernel buffer.
Therefore, for many small data file services, it is unnecessary to allocate a sending buffer for each connections.
Only one piece of memory is allocated when one sending is not complete. The data is saved and will be sent when the next write is returned.
This avoids memory copy and saves memory.
Selecting epoll does not mean that you have a good I/O model. If you are not good at it, you cannot catch up with select. This is the truth.
I have mentioned so many epoll problems. For details about descriptor management, please refer to my previous post about how to use the epoll model and how to exhaust its descriptors.I have discussed about 18 pages. I just put the solution in the first post. If you are interested in epoll
A simple epoll-based Web Server example.
In addition, you need to use multiple threads or multi-process, depending on which one you are more familiar with, each has its own advantages.
In multi-process mode, a single process crash does not affect other processes. In addition, you can set different CPUs for each worker, so that some CPUs can be empty to handle interruptions and system events. Multithreading facilitates data sharing and consumes less resources. The number of processes or threads should be fixed at (number of CPU Cores-1 )~ It is recommended that the number of CPU cores be 2 times. If there are too many time slice rotations, frequent switching will occur, but less will not be able to achieve multi-core concurrent processing.
And how to accept is also a learning, not the best, only more suitable, you need to do a lot of experiments, determine the most efficient way for yourself. With a good I/O framework, it is not easy to think about low efficiency. This is the overall situation of program implementation.
For more information about the network I/O model, see
<Scalable network programming>
Chinese version.
In addition, it must be emphasized that the code and structure should be concise and efficient, and you must analyze specific issues. There are no rules that are omnipotent and customized according to your service.
2. disable unnecessary standard input and standard output.
Close (0 );// Stdin
Close (1 );// Stdout
If you are not careful, with printf output debugging information, this is definitely a performance killer.
A high-performance server should not have any output without errors, so as not to delay work.
In this way, at least two descriptor resources can be saved for you.
3. Avoid using locks (I ++ or ++ I)
Multi-threaded programming locks are a common phenomenon and seem to have become a habit.
However, each thread should be independent without a synchronization mechanism.
Locks consume resources and cause queuing or even deadlocks. Try to avoid them.
When it is not available, for example, to calculate the load of each thread in real time, multiple threads need to write global variables.
Use ++ I because it is an atomic operation.
4. Reduce system calls
System calling is very expensive because it usually needs to be drilled into the kernel and then drilled out.
We should avoid switching between user space and kernel space.
For example, if I want to set a timestamp for each request to calculate the timeout, I can only call time () once before returning a batch of available descriptors, instead of calling each request once. Time () is only accurate to the second, and a batch of requests are processed in milliseconds, so there is no need to do so. Besides, what is the impact of computing timeout error in one second?
5. Connection: CloseKeep-alive?
When talking about httpd implementation, you cannot leave a persistent connection to keep-alive.
Keep-alive is added to HTTP 1.1, and now the browser is 99. 99% should support keep-alive.
Let's talk about what keep-alive is:
This is based on TCP connections, that is, a descriptor (FD). It does not mean that a process or thread is used independently. A thread can maintain thousands of persistent connections in non-blocking mode.
Let's talk about a complete HTTP 1.0 request and response:
Establish a TCP connection (syn; ack, syn2; ack2; three groups have finished handshaking)
Request
Response
Disconnect (Fin; ack; fin2; ack2Four groups close the connection)
In addition, HTTP 1.1 requests and responses:
Establish a TCP connection (syn; ack, syn2; ack2; three groups have finished handshaking)
Request
Response
...
...
Request
Response
Disconnect (Fin; ack; fin2; ack2Four groups close the connection)
If there is only one group for both requests and responses, HTTP 1.0 must transmit at least 11 groups (Supplement: each request and response data requires an ACK confirmation) to obtain the data of one group. Persistent connections can make full use of the established connections to avoid frequent establishment and disconnection and reduce network congestion.
I did a test. On a 2 CPU * 4 core server, I kept accept and did not process it. I directly closed it. The maximum accept value is one second.7 w/s. This is the limit. What should I do if I want to process more than 10 million HTTP requests per second?
At present, the only and best choice is to maintain persistent connections.
For example, when we open the nba js live video page, we will send 6 HTTP requests to my Js server and then generate two requests every 10 seconds on average. For example, many of our pages embed several static pool images. If each request is independent (establish a connection and close it), it is a waste of resources.
Persistent connections are good, but choosing keep-alive depends on your application. For example, in nba js live broadcast, I will certainly generate a request within 10 seconds, so the timeout value is set to 15 seconds, and there is no activity in 15 seconds. It is estimated that I am going to make soy sauce, resources must be recycled by me. If the timeout setting is too long, all connections can heap your servers to death.
For some Apache servers, the load is high, so it is easier to turn off the load of keep-alive?
Apache has two working modes: prefork and worker. Apache 1.x only supports prefork.
Prefork is a typical process pool. Each time a batch of processes are created, Apache is implemented based on select. When there are not many users, persistent connections are still very useful, which can save grouping and improve the response speed. However, once a balance point is exceeded, in order to maintain many persistent connections, too many processes have been created, causing the system to be overwhelmed and the memory is insufficient. The CPU is consumed by many processes and loaded. In this case, it is more cost-effective for Apache to re-establish a connection each time than to maintain such a long connection and process.
6. Pre-processing (pre-compression, prefetch lastmodify, mimetype)
The principle of preprocessing is that we will never calculate the second result that we can know in advance.
Pre-compression: We began to use pre-compression technology two or three years ago to save CPU usage. The great Microsoft Company also started to use it in IIS 7. The so-called pre-compression means that pre-compressed data is provided from the data source, and the data is compressed during synchronous transmission of the IDC until the Web server output is compressed, it is automatically decompressed by the user's browser.
Prefetch lastmodify:The lastmodify time of the file. If it is not updated, we should not take the second time. Do not forget that the fsat system call is very expensive.
Prefetch mimetype: mimetype. If your file type is no more than 256 types, you can identify it in one byte and output it directly using the array subscript, instead of seeing a JS file, then strcmp () after nearly a hundred extension names are added, the output Content-Type: Application/X-JavaScript should be known, and this method will consume more CPU resources as the file type increases. Of course, you can also write a hash function to do this. It also requires at least the Merge function to call and perform some value calculation, which is a hash table that is several times as big as the actual data.
How to better use the CPU level-1 Cache
Data Decomposition
CPU hard affinity settings
To be supplemented ....
Ii. How to use memory
1. Avoid memory copy (strcpy, memcpy)
Although the memory speed is very fast, the core part with a high execution frequency can avoid copying and try not to use it. If copy is required, try to replace sprintf and strcpy with memcpy, because it does not care whether you encounter '\ 0'; memory copy and HTTP Response involve string length calculation. If you know the length in advance, it is best to retain it with an intermediate variable. Do not use strlen () to calculate the increment value because it counts until it reaches '\ 0 '. Do not use strlen when sizeof () is used, because it is an operator and is replaced with a specific number during pre-editing, rather than during execution.
2. Avoid kernel space and user process space copy (sendfile, splice and tee)
Sendfile: it provides a mechanism to access the currently expanding Linux network stack. This mechanism is called "zero-copy", which can call "Transmission Control Protocol (TCP) the framework is directly transmitted from the host memory to the network card buffers, avoiding two context switches. For more information, see
<Use sendfile () to optimize data transmission>. According to tests by colleagues, SSD is highly efficient for random reading of small files. For image services that are not updated frequently, there are many reads, and each file is not very large, sendfile + SSD should be out of stock.
Splice and tee: The real concept behind splice is the concept of "random kernel buffer" exposed to the user space. "That is to say, splice and tee run on the user-controlled kernel buffer. In this buffer, splice transfers data from any file descriptor to the buffer (or from the buffer to the file descriptor), while tee copies the data in one buffer to another. Therefore, in a very real (abstract) sense, splice is equivalent to read/write in the kernel buffer, while tee is equivalent to memcpy from the kernel buffer to another kernel buffer .". I think this technology is suitable for proxy. Because data can be directly transferred from one soket to another soket without switching between the user and the kernel space. This is not supported by sendfile. For more information, see
<Splice and tee in linux kernels later than linux2.6.17>. For details about the instance, seeMAN 2Tee, which has a complete program.
3. How to clear a piece of memory (memset ?)
For example, if there is a buffer [1024*1024], we need to clear it and then append the string to strcat (in many cases, it can be replaced by the starting position of the record write + memcpy.
In fact, we do not need to use memset (buffer, 0x00, sizeof (buffer) to clear the entire buffer, memset (buffer, 0 x) can achieve the goal. I prefer to use buffer [0] = '\ 0'; instead, saving the overhead of callback function calls.
4. Memory Reuse(Is it necessary to allocate memory for each response ?)
For the nba js service, we all return compressed data, 99% of which should not exceed 15 kb. Basically, all the writes will go out at a time, and there is no need to allocate memory for each response, public buffer is enough. If you really encounter big data, I write it once first, and save the rest in the memory, waiting for the next sending.
5. Avoid frequent dynamic application/release of memory (malloc)
This does not seem to need to be said. If you want a server to run for years, you should not dynamically apply for and release the memory. The reason is simple. 1. Avoid Memory leakage. 2. Avoid excessive fragments. 3. efficiency is affected. Generally, you apply for a large block of memory at a time, and then write your own memory allocation algorithm. The life cycle of the buffer allocated to HTTP users is characterized by the ability to be recycled as FD is disabled to avoid network leakage. Server writers should also be aware of the memory consumption when their own programs reach the maximum support volume.
6. byte alignment
Let's take a look at the differences between the following two struct types:
Struct {
Short size;
Char * PTR;
Int left;
};
Struct B {
Char * PTR;
Short size;
Int left;
} B;
The structure B sequence is only a change in order:
In a 32bit Linux system, it is aligned according to 32/8 bit = 4 byte, sizeof (A) = 12, sizeof (B) = 12.
In 64-bit Linux, It is aligned according to 64/8 bit = 8 byte, sizeof (A) = 24, sizeof (B) = 16.
The result size of A and B on the 32bit machine is the same, but the effect of changing int to short is different.
If I want to force 2byte alignment, you can do this:
# Pragma pack (2)
Struct {
Short size;
Char * PTR;
Int left;
};
# Pragma pack ()
Note that the parameters in pack () can only be specified to be smaller than the byte alignment standard supported by the local machine, but not greater.
7. Memory Security Issues
Let's give a fun example. Instead of using a, assign a value to:
Int main ()
{
Char A [8];
Char B [8];
Memcpy (B, "1234567890 \ 0", 10 );
Printf ("A = % s \ n", );
Return 0;
}
Program outputA = 90.
This is a typical overflow. If the memory is idle, it will be used. However, it is not good to overwrite the data on other websites.
The user data to be received must be strictly determined and determined not to cross-border. If not, everyone will follow the rules.
8. Cloud wind memory management theory (obtained by the sd2c Conference)
Blog & ppt)
There is no principle that never changes
Slow changes to big principles
There is no permanent solution
Memory Access is cheap but costly
It makes sense to reduce the number of memory accesses.
Random Access Memory is slower than sequential access memory
Make the data physically continuous
Centralized memory access is better than distributed access
Store data together as closely as possible
Unrelated memory access is superior to correlated Memory Access
Consider the possibility of parallelism, even if your program itself does not use the Parallel Mechanism
Control the data size of periodic intensive access
If necessary, change the time to space.
Read Memory is faster than write memory
The Code also occupies memory, so keep the code concise.
Physical Law
Transistor Arrangement
Reclaim memory in batches
Do not release the memory and leave it to the system.
List MapVector (100 calls generate 13 memory allocations and releases)
Use a string to make a hash, and use a pointer to access
Direct Memory Page Processing Control
3. Reduce disk I/O
This is actually to improve performance and reduce I/O by using the memory as much as possible. Disk I/O can be effectively reduced from system read/write buffer to user space's own cache. You can store data in your own buffer to read and write large data blocks in batches. It is necessary to use the cache, which can be implemented by using the shared memory method or the ready-made bdb. Welcome to my public welfare site
Berkeleydb.net, but I don't really welcome people who run when they ask questions. The default cache size of bdb is only 256 kb. You can increase the value or use the mem only method. If you do not retrieve the results from the disk for the second time, the disk will be liberated. Bdb obtains about data records per second (2 CPU * 2 core Xeon (r) e5410 @ 2.33ghz environment test, dozens of bytes of single data ), if you want to achieve higher performance, we recommend that you write it yourself.
4. Optimize Your Nic
First, ethtool ethx to check whether your Internet egress is speed: 1000 Mb/s.
For multi-core servers, run the TOP command and click 1 to view the usage of each core. If you find that the CPU usage of cpuid = 0 is significantly higher than that of other cores, it means that the CPU with ID = 0 may become your bottleneck in the future. Then you can run the mpstat (non-default installation) command to view the distribution of system interruptions and use the CAT/proc/interrupts Nic interruption distribution.
The following data shows the server interrupt distribution we have optimized:
[Yangjian2 @ d08043466 ~] $ Mpstat-P all 1
Linux 2.6.18-53. el5pae (d08043466) 12/15/2008
01:51:27CPU % USER % Nice % Sys % iowait % IRQ % Soft% Steal % Idle Intr/s
01:51:28All 0.00 0.00 0.00 0.00 0.00 0.00 0.00100.00 1836.00
01:51:28 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00100.00 179.00
01:51:28 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00100.00 198.00
01:51:28 2 1.00 0.00 0.00 0.00 0.00 0.00 0.00100.00