Some experiences in Server programming

Source: Internet
Author: User
Tags epoll

Due to the limited level, the following is only my personal experience and I hope to have a reference for new users. In addition, due to the time relationship, the writing is a bit complex, and some points may not be strongly related to Server programming.

Performance problems

1. apply various pools.

A) mempool

For example, you can use the mem pool to improve the memory allocation efficiency. When the corresponding scenario is simple, you can customize your own memory pool management. When the memory pool design is relatively complex, you can directly use jemalloc and tcmalloc.

B) socket pool

For example, DNS resolution is generally based on UDP protocol. To improve performance and avoid overhead caused by repeated socket creation and destruction, you can create a UDP socket pool, and register these FD epollin events to epoll in advance. When there is a DNS request, get an FD from the UDP socket pool for sending, and then wait for the input event of epoll. After Recv is complete, return the FD to the UDP socket pool. You can refer to the complete DNS IPv4 address resolution code in http://code.oa.com/v2/weima/detail/7373.

2. network requests should be processed in batches as much as possible to greatly improve the overall transmission performance.

3. Network send/Recv buffer avoids unnecessary memset. memset only wastes CPU operations. In fact, you only need to manage the corresponding length field. I have seen a piece of code before and opened a very large buffer. No matter whether memset is set to 0, it takes several hundred milliseconds for only memset.

4. Avoid repeated operations. I have seen a lot of such writes: For (INT I = 0; I <strlen (STR); ++ I ){... }. In fact, you can calculate strlen (STR) to coexist with a variable, and then use this variable to reduce unnecessary performance loss.

5. Try not to use new or delete as much as possible. While improving the performance, it can also reduce the encoding complexity, because explicit Delete is not required. The most important thing is to reduce the risk of Memory leakage. GCC supports changing to an array. In addition, the current machine's default stack is several Mb. In many scenarios, you can directly declare an array as follows: int arr [num]. I have seen repeated new and delete buffer statements in the same function. Even if the stack is not used here, at least a new buffer can be used repeatedly.

6. Avoid using duplicate copies and references. Copying large objects is very time-consuming, so avoid it as much as possible. However, misuse of references is not good. For example, it is better to pass an int parameter to copy a value than a reference. A reference is essentially a pointer and there is a certain overhead in addressing operations. However, misuse of reference is much less harmful than misuse of value copy.

7. The use of epollout events is a technical activity. Poor use can easily lead to high CPU performance. There are two ways to do this:

A) listen only once during asynchronous connect. After the connection is successful, the out event can be removed. Because the send operation only fills data into the socket send buffer, the send operation can be successful immediately in most cases. Even if it fails, you can put the unsent data in a queue and send the data after processing the next epoll event, it is very likely that the message is sent successfully (because the last data may be completely removed or a part is removed), but the message is not sent (completed) the data will be processed after the next epoll event is processed.

B) similar to the above, asynchronous connect listens once, and the out event is removed after the connection is successful. If you try to send a task once, it is possible that all tasks are sent. If the message cannot be sent, the data is put into a queue and an out event is added to epoll. When the out event arrives, the remaining data is sent.

8. When encapsulating a log interface, it is best to encapsulate the log interface into a macro. one advantage is that it can automatically embed information such as file, line, and function. Another advantage is that you can first check whether the log level of the current call is enable. If there is no enable, no call is generated. Otherwise, even if the debug log is not enabled, it will still perform function parameter pressure stack operations, and some parameters themselves involve complex operations, which will sacrifice a lot of CPU time.

9. In many cases, you need to use (hash) map. when the key is a complex object, the performance is relatively low. If the scenario permits, you can perform a hash preprocessing on the object first, and then use this hash value as the key for the comparison operation in order to greatly improve the performance. You can try the xxh64 interface to achieve good performance and extremely low conflict rate.

10. Change the space time. Sometimes a large number of integer-to-string operations may be required. You can use the dictionary method for preprocessing, And you can directly query the dictionary for post-order conversion. The dictionary size is limited. When a relatively large integer is encountered, the integer is first divided by a value (such as 100000), then the dictionary is divided and spliced.

11. Reduce the number of critical resources and privatize threads as much as possible. To avoid putting non-critical resources in the lock, it will only delay the lock holding time and increase the probability of lock conflict. If the adjacent critical resources only involve lightweight CPU operations, use atomic operations, spin locks, sequence locks, Cas, and so on. Intel's TBB spin lock is recommended for spin locks. When the number of threads is large, it is more efficient than pthread_spin_lock, and TBB spin lock has a read/write lock, while pthread does not.

12. Use persistent connections as much as possible when the scenario permits, and sometimes the performance can be greatly improved.

13. Try to reserve it first when using STL, which can greatly reduce the memory allocation and data copy times.

14. Reduce system calls, such as using accept4 to receive a connection and using sendfile to transfer files. Why does the accept have the corresponding accept4, while the socket does not have the corresponding socket4? If the system provides another interface, such as fd ctrl (FD, sndsize, rcvsize, sndtimeout, and rcvtimeout), the system can do four things at a time. Set the size of the send Recv buffer to reduce the number of send and Recv requests.

15. When transmitting a large amount of data, you can set tcp_cork to improve the TCP sending efficiency.

16. When the real-time requirements for small packets are relatively high, you should set tcp_nodelay to disable the Nagle algorithm.

17. When necessary, the thread is bound to the CPU, which can avoid context switching and reduce cache miss.

18. The number of workers is automatically adapted. The Internet service configurations may be different. The server needs to automatically allocate the corresponding number of threads based on the number of CPUs.

 

Traps

1. when deleting an iterator in a loop, pay special attention to the fact that erase (it ++) and sequential container (vector and list) should be used for associated containers (MAP and set) it = erase (IT) should be used ). C ++ 11 can be integrated into it = erase (it)

2. Use TCP for fast recovery with caution, which may easily cause TCP reset. It is more reliable to adjust the net. ipv4.tcp _ fin_timeout parameter to replace quick recovery.

3. The system calls select, which is a pitfall. When FD exceeds 1024, a problem occurs. Especially when the child process implicitly inherits the FD of the parent process, it is more likely to be inexplicably problematic. Use poll and epoll as much as possible.

4. when encapsulating a log function, add the _ attribute _ (_ format _ (_ printf __, x, y) attribute, so that the parameter type matching check can be performed during compilation. Otherwise, a call similar to log ("% s", 123) can be compiled, but coredump is run.

5. The timing problem of pitfall. In a multi-threaded environment, pay special attention to the timing of close (FD). If there is a global array and FD is used for indexing, the close (FD) operation must be placed at the end. Call close (FD) first, and then clean up the array [FD]. This problem may occur because the FD closed by this thread may be quickly allocated to other threads, multi-threaded operations on the array [FD] object at the same time may cause problems.

6. If the MTU value is too large, data packets may not pass through the vro. Do not adjust the value to improve performance.

7. Timeout management must be added for asynchronous connections. Previously, we had never added timeout management to asynchronous connections, but some events never notified epoll. This led to the continuous creation of sockets, which resulted in socket handle leakage.

 

Cause of TCP connection failure:

1. The network is disconnected. Check the iptables firewall rules and check whether the request is dropped.

2. Network fluctuations. Use ping to check whether a large number of packet loss results.

3. The client cannot allocate a "Port ".

A) if the log shows "can not assigne requested address", you can check it with CAT/proc/sys/NET/IPv4/ip_local_port_range and adjust the range value if necessary.

B) reduce tcp_fin_timeout (recommended) or enable quick recovery (not recommended)

4. The server is busy and the processing capability is too weak. As a result, the client cannot be connected to the accept client in a timely manner.

5. Check whether the server Queue (Kernel Parameter) configuration is too small.

CAT/proc/sys/NET/IPv4/tcp_max_syn_backlog

CAT/proc/sys/NET/CORE/somaxconn

If this problem is identified, increase the two values.

 

Debugging problems

1. Use strace to track system calls, and use strace-etrace to selectively monitor system calls.

2. Add the necessary debug log. Close at ordinary times. If a problem occurs, enable it to easily track the problem.

3. You can name the thread through prctl (pr_set_name, name) to facilitate debugging and tracking. Run PS-ELO nlwp, vsz, SZ, stat, wchan, % CPU, % MEM, ppid, PID, tid, comm = threads, lstart, and CMD to view detailed information, including the thread name.

 

Push services

1. The push server usually needs to maintain a large number of persistent connections, and the memory is often a bottleneck. Therefore, you need to adjust the kernel parameters (below are some basic ).

A) increase the file handle: fs. File-max

B) tune up the connection queue: net. ipv4.tcp _ max_syn_backlog, net. Core. somaxconn

C) reduce the size of the default receiving and sending buffer. net. Core. rmem_default and net. Core. wmem_default. The receiving buffer should be no less than 1 kb; otherwise, problems may occur.

2. Use the new kernel

The new kernel version has been optimized to convert the per socket cache into the per task cache, which can greatly reduce the occupied idle memory, making it possible to create a large number of sockets.

3. so_reuseport

To make full use of the advantages of multiple CPUs and multiple queue NICs, one access thread may not be enough. However, opening multiple ports is obviously not conducive to use. Therefore, the new kernel version (which may start with 3.9) supports port reuse (so_reuseport. So that different processes can listen to the same port, and different processes can even accept without causing the exception of the Group Effect and accept. The general principle is: hash the new connection (SIP, sport, dip, dport) tuples, and map the hash value to a listening socket in multiple processes, in this way, the connections are evenly distributed among multiple processes.

 

Security

1. Defer accept. Delay accept. After receiving the first data, three handshakes can be completed. It can also improve performance.

2. Use a firewall to block ports that do not need to be opened or block some blacklisted IP addresses and ports

3. Avoid single point of failure. Try to make the server stateless. If there is a status, you may need to provide the active/standby mode. Or the mode of the primary master, usually two are the primary, each processing different task units, intercommunication heartbeat, if one finds another down, it takes over the other task in real time, however, you need to reserve capacity.

4. Logical and read/write separation of primary and secondary nodes. The heartbeat logic can be implemented simply using UDP, regardless of the TCP connection of the business logic.

 

Monitoring and alarms

1. Detailed logs and flow logs must be added to the key points. Otherwise, problems may not be solved.

2. Report traffic, failure rate, latency distribution, and error codes. If there are many statistical items, the statistical performance with features is higher (write shared memory ).

3. Configure alarm settings for feature statistics and model tuning in the network management and model tuning systems. When the failure rate is found to be higher than a threshold value, or when the request volume increases or decreases sharply, an alarm is triggered.

 

A bit of coding tips

1. c ++ has destructor to easily clean up resources. C does not have destructor, but GCC extends C. The cleanup attribute can be used to automatically clean up resources. For example:

Void free (char ** PTR)

{

If (* PTR! = 0)

Free (* PTR );

}

 

Void func ()

{

Char * array _ attribute _ (cleanup (free) = (char *) malloc (1024 );

/* Array is automatically released when the function exits */

}

2. If you want to automatically complete initialization before entering main, you can put the initialization operation in the _ attribute _ (constructor) void Init () {...} function.

3. typeof: Use the typeof keyword of GCC to simplify encoding:

STD: Map <int, STD: Map <int, int> dict;

You can use typeof (dict. end () it = dict. begin () replaces STD: Map <int, STD: Map <int, int >:: iterator it = dict. begin ();

 

Others

1. Protocols must be extensible. You can select JSON, protobuf, and other protocols based on your business scenario.

2. Timeout management:

In complex scenarios, timeout management may be implemented using heap, red/black tree, And timerfd mechanisms.

However, in most scenarios, you can use simpler methods, which are often more efficient. For example, if you put the connection information in a map, you only need to scan the MAP regularly (for example, once a second). Note that the frequency must be properly controlled here, otherwise, frequent timeout detection will consume a lot of CPU. In general, I used to take a TSC (CPU register count) after an epoll event, and then subtract tsc_prev to determine whether the difference has reached the threshold (for example, 1 second ), timeout check is performed only when the threshold value is reached. Because reading TSC is very fast and the time-out is checked only once a second at most, the time-out detection efficiency here is relatively high.

3. The worker communicates with each other using a lock-free queue. The producer pushes messages to the queue. After the consumer processes a round of epoll events, the producer peek at the queue data. If the queue has data, all the data will be taken out. How can this problem be solved; if the queue is empty, do nothing. Continue with epol_wait or do other things in the thread loop.


Some experiences in Server programming

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.