Performance Tuning Strategy

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original address: http://www.csdn.net/article/2012-06-21/2806814

About performance Optimization This is a big topic, in the "12306.cn Talk about website performance Technology" in the business and design I said some of the available technologies and the advantages and disadvantages of those technologies, today, I would like to talk about the technical details of performance optimization, mainly a number of code-level techniques and methods. This article is something that I have some experience and knowledge, and not necessarily all right, I hope you correct and add.

Before starting this article, you can take a look at the previously published "Code optimization Summary", this article basically tells you-to optimize, you have to find performance bottlenecks first! But before I tell you how to position the system performance bottle, let me talk about the definition and testing of the system performance, because there is no such thing as positioning and optimization behind it.

I. Definition of system performance

Let us first say what is System performance. This definition is critical, and if we don't know what system performance is, we won't be able to locate it. I've seen a lot of friends who think it's easy, but in fact they don't have a more systematic approach, so here I want to tell you how to position the performance in a systematic way. Overall, system performance is two things:

Throughput throughput. That is, the number of requests that can be processed per second, the number of tasks.
Latency system delay. That is, the system's delay in processing a request or a task.

In general, the performance of a system is constrained by these two conditions, indispensable. For example, my system can be up to 1 million concurrent, but the system latency is more than 2 minutes, then this 1 million of the load is meaningless. System latencies are short, but the throughput is low, which is also meaningless. Therefore, the performance test of a good system is bound to be affected by these two conditions simultaneously. Experienced friends must know that some of these two things relate to:

The bigger the throughput, the worse the latency will become. Because the volume of requests is too large, the system is too busy, so the response speed is naturally low.
The better the latency, the higher the throughput that can be supported. Because the latency short description of the processing speed is fast, so you can handle more requests.

Second, the system performance test

After the above instructions, we know that to test the performance of the system, we need to collect the throughput and latency values of the system.

First of all, you need to define the value of latency, for example, the response time for the site system must be within 5 seconds (for some real-time systems may need to define a shorter, such as within 5ms, this is more based on different business definition)
Second, the development of performance testing tools, a tool for the production of high-strength throughput, another tool to measure the latency. For the first tool, you can refer to the "10 Free web stress test Tool", about how to measure latency, you can measure in the code, but this will affect the execution of the program, and can only test the latency inside the program, the real latency is the whole system is counted, including the operating system and network latency, you can use Wireshark to capture network packets to measure. These two tools specifically how to do, this also asked everyone to think about it.
Finally, start the performance test. You need to constantly increase the throughput of the test and then observe the load on the system, and if the system is up, observe the value of latency. This way, you can find the maximum load on the system, and you can tell what the system's response delay is.

Say some more

With respect to latency, if the throughput is very low, this value is estimated to be very stable, and when the throughput is getting larger, the latency of the system will be very violent, so when we measure the latency, we need to be aware of the latency distribution, i.e., There are a few percent of the range we allow, there are a few percent more than the percentage of the total unacceptable. Perhaps the average latency is up to par, but only 50% of that is up to our acceptable range. That doesn't make sense either.
For performance testing, we also need to define a time period. For example, it lasts 15 minutes on a certain throughput. Because the system becomes unstable when the load arrives, the system will stabilize after a two-minute time. In addition, it is possible that your system behaves normally in the first few minutes of the load, and then it is unstable and even collapsed. So it takes so long. This value, we call the peak limit.
The performance test also needs to do soak test, which means that the system can run for a week or even longer under a certain throughput. This value, which we call the system's normal operating load limit.

Performance testing has a lot of very complex things, such as: Burst test and so on. This is not a detailed description, but only some of the things that are related to performance tuning. In summary, performance testing is a deliberately and dirty.

Third, positioning performance bottleneck

With the above cushion, we can test to the performance of the system, before tuning, we first say how to find the bottleneck of performance. I've seen a lot of friends who think it's easy, but in fact they don't have a more systematic approach.

3.1 Viewing operating system load

First of all, when we have a problem with the system, we do not rush to investigate our code, this is meaningless. The first thing we need to see is the operating system report. Look at the CPU utilization of the operating system, look at the memory usage, look at the OS io, the IO of the network, the number of network links, and so on. Perfmon under Windows is a great tool, and there are a lot of related commands and tools under Linux, such as Systemtap,latencytop,vmstat,sar,iostat,top,tcpdump and so on. By observing the data, we can see where the performance of our software is basically. Like what:

1) First look at the CPU utilization, if the CPU utilization is not high, but the system throughput and latency, which means that our program is not busy computing, but busy with other things, such as IO. (in addition, CPU utilization also depends on the kernel state and the user state, the kernel state of one up, the performance of the entire system down.) For multi-core CPUs, CPU 0 is very critical, if the load of CPU 0 is high, it will affect the performance of other cores, because the CPU is required to have the inter-core scheduling, which is done by CPU0)

2) Then, we can look at the IO is not big, IO and CPU is generally reversed, CPU utilization is high, IO is small, io large CPU. About IO, we want to see three things, one is disk file Io, one is the driver Io (such as: Network card), one is the memory paging rate. These three things can affect the performance of the system.

3) Then, look at the network bandwidth usage, under Linux, you can use the Iftop,iptraf,ntop,tcpdump these commands to view. Or use Wireshark to see.

4) If the CPU is not high, Io is not high, memory usage is not high, network bandwidth usage is not high. But the performance of the system does not go. This indicates a problem with your program, for example, your program is blocked. Maybe it's because of waiting for that lock, maybe because of a resource, or a switch context.

By understanding the performance of the operating system, we know the performance of the problem, such as: insufficient bandwidth, insufficient memory, TCP buffer is not enough, and so on, many times, do not need to adjust the program, only need to adjust the hardware or operating system configuration can be.

#p #

3.2 Using the profiler test

Next, we need to use the performance detection tool, that is, using a profiler to look at the performance of our program. such as: Java jprofiler/tptp/codepro profiler,gnu gprof,ibm Purifyplus,intel vtune,amd codeanalyst, and Linux under the oprofile/ Perf, the next two can allow you to optimize your code to the CPU's micro-instruction level, if you care about the CPU's L1/L2 cache tuning, then you need to consider using VTune. Using these profiler tools, you can make the various module functions in your program even commands a lot of things, such as: Run time, number of calls, CPU utilization, and so on. These things are very useful to us.

We focus on the functions and instructions that run the most and call most times. Note here that for functions that have many calls but short periods of time, you may only need to refine them slightly, and your performance will go up (for example: a function is called 1 million times a second, you think if you let this function increase by 0.01 milliseconds, this will give you a lot of performance)

There's a problem with the profiler. We need to be aware that because profiler will slow down your program's performance, tools like Purifyplus will insert a lot of code into your code that will cause your program to run less efficiently and not test the performance of the system at high throughput. , there are generally two ways to locate a system bottleneck:

1) Make your own statistics in your code, use the Microsecond timer and function call calculator, and log the statistics to the file every 10 seconds.

2) Segment Comment Your code block, let some functions idling, do hard code mock, and then test the system throughput and latency whether there is a qualitative change, if so, then the annotated function is a performance bottleneck, and then in this function body comment code, Until you find the statement that consumes the most performance.

Finally, for performance testing, different throughput will show different test results, and different test data will have different test results. Therefore, the data used for performance testing is very important, and in performance testing, we need to view the results of different throughput.

Iv. common system bottlenecks

These are some of the things I have experienced, maybe not all, maybe not right, you can add, I am purely a point. About the performance tuning of the system architecture, we can take a look at the "12306.cn Talk about website performance technology", about the web aspects of the performance tuning things, you can see the "What to learn in Web development" article in the performance chapter. I'm not going to say anything about design or architecture here.

In general, performance optimization is also a few of the following strategies:

Use space to change time. All kinds of caches such as CPU L1/l2/ram to hard disk, are the strategy that use space to change time. The strategy is basically to save or cache the process of computation step-by-step, so that it does not need to be counted again every time, such as data buffering, CDN, etc. Such strategies also represent redundant data, such as data mirrors, load balancing, and so on.
Use time to change space. Sometimes, a small amount of space may be better performance, such as network transmission, if there are some algorithms for compressing data (such as the "Huffman Encoding Compression algorithm" and "rsync Core algorithm"), this algorithm is very time-consuming, but because the bottleneck in the network transmission, So the time to change space can save time.
Simplify the code. The most efficient program is a program that does not execute any code, so the less code you have, the higher the performance. Technology for code-level optimization There are many examples of textbooks in the university. Such as: Reduce the number of layers of the loop, reduce recursion, less declaration of variables in the loop, less allocation and release of memory operations, as far as possible to the loop in the body of the expression into the loop, conditional expression of the order of multiple conditions, as far as possible at the start of the program to prepare some things, pay attention to function call overhead Pay attention to the overhead of the temporary objects in the object-oriented language, be careful with exceptions (do not use exceptions to check for some acceptable and frequently occurring errors), and so on, which requires us to know the programming language and common libraries very well.
Parallel processing. If the CPU has only one core, you want to play multi-process, multi-threading, for compute-intensive software will be slower (because the operating system scheduling and switching overhead), the CPU is more than the core to truly reflect the advantages of multi-process multithreading. Parallel processing requires that our programs have scalability, and programs that cannot be scaled horizontally or vertically cannot be processed in parallel. In terms of architecture, the table is again--can you do it without changing the code just by Gaga machine to achieve performance improvement?

In short, according to the 2:8 principle, 20% of the code consumes 80% of your performance, find the 20% code, you can optimize the performance of that 80% . Some of the things below are some of my experiences, I just cite some of the most valuable performance tuning methods for your reference, and also welcome to add.

4.1 Algorithm Tuning

Algorithm is very important, good algorithm will have better performance. Give me some examples of the projects I've been through, and you can feel it.

One is the filtering algorithm. The system needs to filter the received request, we can be the filter in/out things configured in a file, the original filtering algorithm is the traversal filter configuration, later, we found a way to sort this filter configuration, so you can use the binary binary method to filter, System performance increased by 50%.
One is the hashing algorithm. The calculation of the function of the hash algorithm is not efficient, on the one hand, the calculation is too time-consuming, on the other hand, collisions are too high, collisions are high with one-way list of a performance (see Hash collision DoS problem). We know that algorithms are very much related to the data that needs to be processed, even if the "bubble sort", which is derided by everyone, is more efficient than all sorting algorithms in some cases (most of the data is ordered). Hash algorithms are the same, well-known hashing algorithms are used in English dictionary testing, but our business in the data has its particularity, so for the need to choose the appropriate hash algorithm according to their own data. For one of my previous projects, one of the bulls in the company sent me a hashing algorithm, resulting in a 150% increase in our system performance. (for a variety of hashing algorithms, you must see this article on Stackexchange on various hash algorithms)
Divide and conquer and preprocess. There was a program in the past in order to generate monthly reports, each time you need to calculate a long time, sometimes it takes nearly a whole day time. So we found a way to the algorithm can be sent to the incremental, that is, I daily to calculate the day's data and the previous day of the report merge, so that can greatly save the calculation time, the daily data calculation is only 20 minutes, but if I want to count the whole month, The system takes more than 10 hours (SQL statements degrade performance in the presence of large data volumes). This divide-and-conquer approach helps performance in the face of big data, just like the merge sort. Performance optimizations for SQL statements and databases are also the same strategy, such as using a nested select instead of a Cartesian product select, using views, and so on.

4.2 Code Tuning

String manipulation. This is the most system performance, whether it is Strcpy,strcat or strlen, the most important thing to note is string substring matching. Therefore, it is best to use integral type. For example, the first example was when I was a bank n years ago, and my colleagues liked to save the date as a string (for example: 2012-05-29 08:30:02), I'm going to go, a select where between statement is quite time consuming. Another example is that I used to have a colleague to deal with some status code string, his reason is that this can be displayed directly on the interface, later performance tuning, I changed these status codes to integer type, and then use the bit operation to check the status, Because there is a function called 150K per second in three places need to check the state, after improved, the overall performance of the system increased by about 30%. Another example is that one of the product programming specifications I used to work with was to define the function name in each of the functions, such as the const char fname[]= "functionname ()", which is for good logging, but why not declare it as a static type?
Multithreading tuning. Some people say that thread is evil, and this is a problem for system performance at some point. Because multithreading bottlenecks are mutual exclusion and synchronization of the lock, as well as the cost of thread context switching, how to use less lock or no lock is fundamental (such as: multi-version concurrency control (MVCC) in the application of distributed systems said optimistic lock can solve performance problems), in addition, There are also read and write locks that can solve most of the concurrency performance issues of read operations. Here's a little bit more. In C + +, we may use a thread-safe smart pointer autoptr or some other container, as long as it is thread-safe, regardless of whether 3,721 is locked, lock is a very expensive operation, using AUTOPTR will make our system performance down quickly, If you can guarantee that there is no thread concurrency problem, then you should not use Autoptr. I remember the last time our colleague removed the reference count for smart pointers, which improved system performance by more than 50%. For Java object Reference counting, if I guessed right, there are locks everywhere, so Java's performance problem has always been a problem. In addition, the more threads are not the better, the scheduling and context switching between threads is also an exaggeration, as much as possible in a thread, as far as possible do not synchronize threads. This will give you a lot of performance.
Memory allocations. Do not underestimate the memory allocation of the program. Malloc/realloc/calloc such a system tune is time consuming, especially if the memory is fragmented. My previous company has had such a problem-in the user's site, our program does not respond one day, with GDB to follow in a look, the system hang in the malloc operation, 20 seconds did not return, restart some systems just fine. This is the problem of memory fragmentation. That's why many people complain that the STL has serious memory fragmentation problems because too much small memory allocations are released. There are a lot of people who think that using a memory pool can solve this problem, but in fact they just re-invent the RUNTIME-C or operating system memory management mechanism, completely unhelpful. Of course, the problem with memory fragmentation is through a pool of memory, specifically a series of different sizes of memory pools (this is left to everyone to think about). Of course, it's best to do less dynamic memory allocation. When it comes to memory pooling, you need to talk about pooling technology. such as thread pool, connection pool, etc. Pooling technology is quite effective for some short jobs, such as HTTP services. This technique can reduce the overhead of link creation, thread creation, and thus improve performance.
An asynchronous operation. We know that UNIX file operation is a block and non-block way, like some system calls are block-type, such as: under the socket select,windows under the waitforobject and so on, if our program is synchronous operation, This can be very performance-impacting, and we could change it to asynchronous, but changing it to async will complicate your program. Asynchronous way generally through the queue, to note the performance problem of the queue, in addition, asynchronous state notification is usually a problem, such as the message event notification mode, there are callback ways, etc., these methods may also affect your performance. In general, however, asynchronous operations can make a significant increase in performance throughput (throughput), but at the expense of the system's response time (latency). This requires business support.
Language and code base. We want to be familiar with the language and the performance of the library or class library used. For example: Many of the containers in the STL allocate memory, it is afraid that you delete the element, the memory will not be recycled, it will cause a memory leak false image, and may cause memory fragmentation problems. Again, the size () ==0 and Empty () of some containers of STL are not the same, because, size () is O (n) complexity, Empty () is O (1) complexity, this should be careful. The JVM tuning in Java needs to use these parameters:-xms-xmx-xmn-xx:survivorratio-xx:maxtenuringthreshold, also need to pay attention to the JVM's GC,GC domineering everyone knows, especially the full GC (which also organizes memory fragments), he is like "the Dinosaur super-game", he runs, the whole world time has stopped.

4.3 Network tuning

About the network tuning, especially the TCP Tuning (you can find many articles on the Internet with these two keywords), there are many things to say. Just look at the many parameters of TCP/IP under Linux. (By the way, you may not like Linux, but you can't deny that Linux gives us a lot of power to tune the kernel). We strongly recommend that you look at the "TCP/IP Detailed Volume 1: Protocol" this book. I'm here to talk about only a few conceptual things.

A) TCP Tuning

We know that TCP links have a lot of overhead, one is to take up the file descriptor, the other is to open the cache, in general, a system can support the number of TCP links is limited, we need to clearly realize that the TCP link to the system overhead is very large. Because TCP is resource-intensive, many of the attacks are caused by a large number of TCP links on your system that deplete your system resources. Like the famous sync flood attack. So, we should pay attention to configure the KeepAlive parameter, this parameter means to define a time, if there is no data transmission on the link, the system will send a package at this time, if not received a response, then TCP will think the link is broken, and then the link is closed, so that the system resource costs can be recycled. (Note: There are also keepalive parameters on the HTTP layer) for short links like HTTP, it is important to set up a 1-2-minute keepalive. This can prevent Dos attacks to a certain extent. There are several parameters (the values for these parameters are for reference only):

Net.ipv4.tcp_keepalive_probes = 5
NET.IPV4.TCP_KEEPALIVE_INTVL =
Net.ipv4.tcp_fin_timeout =

For TCP time_wait This state, the active shutdown of the party into the TIME_WAIT state, Time_wait state will continue to 2 MSL (Max Segment Lifetime), the default is 4 minutes, time_wait state resources can not be recycled. A large number of time_wait links are typically on an HTTP server. For this, there are two parameters to note,

net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_tw_recycle=1

The former represents the reuse of time_wait, which represents the recycling of time_wait resources.

TCP also has an important concept called Rwin (TCP Receive Window Size), which means that I have a TCP link in the largest packet that can be received without an ACK to the sender. Why is this important? Because if sender does not receive receiver sent over Ack,sender will stop sending data and will wait for a period of time, if time out, then will be re-transmitted. This is why the TCP link is a reliable link. Retransmission is not the most serious, if there is a packet loss, TCP bandwidth usage will immediately be affected (will be blindly halved), then drop the packet, then halved, and then if not lost, then gradually recover. The relevant parameters are as follows:

Net.core.wmem_default = 8388608
Net.core.rmem_default = 8388608
Net.core.rmem_max = 16777216
Net.core.wmem_max = 16777216

In general, the theoretical rwin should be set to: Throughput * loop time. Sender end of the buffer should have the same size as the Rwin, because sender end after sending data to wait for receiver to confirm, if the network delay is very large, buffer too small, the number of confirmed will be more, so performance is not high, the utilization of the network is not high. In other words, for networks with large latencies, we need large buffer, which can be less ack, more data, and less buffer for a faster network. Because, if there is a packet loss (ack not received), buffer too large may be problematic, because this will allow TCP to retransmit all the data, but affect network performance. (Of course, in the case of poor network, do not play anything high performance) so, high-performance network is important to make the network packet loss rate is very very small (basically used in LAN), if the network is basically credible, so with a larger buffer will have better network transmission performance (to go back and forth too much impact on performance).

In addition, we think about, if the network quality is very good, basically do not lose the package, and the business we are not afraid to occasionally lose a few bags, if so, then, why do we not use faster UDP? Have you thought about this question?

B) UDP tuning

When it comes to UDP tuning, there are a few things I want to focus on, which is the mtu--Maximum transmission unit (which is actually the same for TCP, because it's something on the link layer). The so-called Maximum transmission unit, you can imagine as a bus on the road, assuming a bus can sit up to 70 people, the bandwidth is like the number of road lanes, if a road can accommodate up to 100 buses, which means that I can transport up to 7000 people, but if the bus sit dissatisfied, For example, on average, only 20 people per car, so I only transported 2000 people, so my road resources (bandwidth resources) was wasted. So, for a UDP packet, we try to get him as large as possible to the maximum MTU size and then upload it to the network, which maximizes bandwidth utilization. For this MTU, the Ethernet is 1500 bytes, the fiber is 4352 bytes, and the 802.11 wireless network is 7981. However, when we use TCP/UDP contract, our payload payload lower than this value, because the IP protocol will be added 20 bytes, UDP will be added 8 bytes (TCP plus more), so, in general, your one UDP packet maximum should be 1500-8-20= 1472, this is the size of your data. Of course, if you use optical fiber, this value can be larger. (By the way, for some NB of the thousand light to state Network network card, on the network card, network card hardware if you find that the size of your package exceeds the MTU, it will help you do fragment, to the target will help you do the reorganization, which does not require you to process in the program)

To say more, when using the socket programming, you can use setsockopt () to set the So_sndbuf/so_rcvbuf size, TTL and keepalive these key settings, of course, there are many, specifically you can check the socket manual.

Finally, one of the biggest benefits of UDP is multi-cast multicast, a technology that is very convenient and efficient for you to notify multiple nodes in your intranet. Also, multicasting is advantageous for horizontal expansion of opportunities (the need to increase the number of machines to listen for multicast information).

C) network card tuning

For the network card, we can also be tuned, which is necessary for Gigabit and network network card, under Linux, we can use Ifconfig to view the statistics on the Internet, if we see overrun on the data, We may need to adjust the size of the Txqueuelen (usually by default is 1000), we can adjust a little bit, such as: Ifconfig eth0 Txqueuelen 5000. Linux also has a command called: Ethtool can be used to set the buffer size of the network card. Under Windows, we can adjust the relevant parameters in the Advanced tab of the NIC adapter (such as: Receive buffers, transmit buffer, etc., different NICs have different parameters). Turning the buffer up is very effective for network transmissions that require large amounts of data.

D) Other network performance

With regard to multiplexing, which is to use a thread to manage all TCP links, there are three system calls to focus on: One is select, the system call only supports 1024 links, the second is poll, it can break 1024 limit, But select and poll is essentially the polling mechanism used, the polling mechanism is very poor when the link is many, because the main is an O (n) algorithm, so, Epoll appears, Epoll is the operating system kernel support, only when the link is active, the operating system will callback, This is triggered by operating system notifications, but only after Linux Kernel 2.6 is supported (accurately introduced in 2.5.44), of course, if all links are active, excessive use of epoll_ctl may also affect performance over polling, but not much.

In addition, some system calls to DNS lookup should be careful, such as: Gethostbyaddr/gethostbyname, this function can be quite time-consuming, because it wants to go to the network to find the domain name, because the DNS recursive query, will cause a serious timeout, Instead of setting any parameters to set time out, you can either configure the hosts file to speed up, or manage the tables yourself in memory, and check them out when the program starts, rather than at run time. In addition, under multi-threaded, gethostbyname will be a more serious problem, that is, if there is a thread of gethostbyname blocking, other threads will be blocked at gethostbyname, this is more perverted, be careful. (You can try the GNU Gethostbyname_r (), which is better performance) This is a lot of things to find information online, for example, if your Linux uses NIS, or NFS, some users or file-related system calls are very slow, so be careful.

#p #

4.4 System Tuning

A) I/O model

Before talking about Select/poll/epoll these three system calls, as we all know, unix/linux all the devices as files for I/O, so that three operations should be considered I/O-related system calls. Speaking of I/O models, which is very important for our I/O performance, we know that the Unix/linux classic I/O mode is (for Linux I/O models, you can read this article, "Using asynchronous I/O to improve performance"):

The first, synchronous blocking I/O, this does not say.

The second, synchronous non-blocking mode. It is done through FCTNL setup o_nonblock.

Third, for Select/poll/epoll these three are I/O not blocking, but blocking on the event, it is: I/O asynchronous, event synchronization call.

The fourth kind, AIO way. This I/O model is a model that handles I/O parallelism. The I/O request returns immediately stating that the request has been successfully initiated. When I/O operations are completed in the background, notifications are made to the application in two ways: one is to generate a signal, and the other is to execute a thread-based callback function to complete the I/O process.

The fourth is because there is no blocking, either I/O or event notifications, so it allows you to take full advantage of the CPU, compared to the second kind of non-blocking advantage of synchronization is that the second you have to poll over and over. Nginx is so efficient, it uses the Epoll and AIO way to do I/O.

Talking about the I/O model under Windows,

A) One is WriteFile system call, this system call can be synchronous blocking, or can be synchronous non-blocking, about to see the file is not opened with overlapped. About synchronous nonblocking, you need to set its last parameter overlapped, Microsoft is called overlapped I/O, you need waitforsingleobject to know that there is no writing done. The performance of this system call is conceivable.

b) Another system call called WriteFileEx, which can implement asynchronous I/O, and allows you to pass in a callback function, and so on after I/O end callback, However, this callback procedure Windows puts the callback function into the queue of APC (asynchronous Procedure Calls), and is then called back only when the application's current thread becomes a notification state (alterable). Only if your thread is using these functions waitforsingleobjectex,waitformultipleobjectsex, msgwaitformultipleobjectsex,signalobjectandwait And SleepEx, the thread will become the alterable state. Visible, this model, still have wait, so performance is not high.

c) then the Iocp–io completion PORT,IOCP will place the results of I/O in a queue, but listening to this queue is not the main thread, but one or more threads dedicated to doing it (the old platform wants you to create the thread yourself, The new platform is where you can create a thread pool. IOCP is a thread pool model. This is similar to the AIO model under Linux, but it is completely different from the way it is implemented.

Of course, the real way to improve I/O performance is to minimize the number of I/O to the peripheral, preferably not, so for reading, the memory cache can generally improve performance, because memory is much faster than peripherals. For writing, the cache to write data, less write a few times, but the problem with the cache is the real-time problem, that is, latency will become larger, we need to write the number of times and the corresponding trade-offs.

B) Multi-core CPU tuning

On the CPU multicore technology, we know, CPU0 is very critical, if the No. 0 CPU is used too hard, the other CPU performance will also fall, because CPU0 is a tuning function, so we can not let the operating system load balance, because we know our own program, so, We can manually allocate CPU cores for them without too much CPU0, or to squeeze our critical processes and a bunch of other processes together.

For Windows, we can "set dependencies ..." In the right-click menu in "Process" in Task Manager (set Affinity ...). ) to set and limit which cores the process can be run on.
For Linux, you can use the Taskset command to set up (you can install this command by installing Schedutils: Apt-get install schedutils)

Multicore CPUs also have a technology called NUMA technology (Non-uniform Memory Access). Traditional multicore operations Use the SMP (symmetric multi-processor) mode, where multiple processors share a centralized memory and I/O bus. As a result, there is a problem of consistent memory access, and consistency often means performance issues. In NUMA mode, the processor is divided into multiple node, each node has its own local memory space. For some technical details about NUMA, you can check out this article, "NUMA Technology for Linux", under Linux, the command for NUMA tuning is: Numactl. As in the following command: (Specifies that the command "MyProgram arg1 arg2" runs on node 0 and its memory is allocated on node 0 and 1)

Numactl --cpubind=0 --membind=0,1 myprogram arg1 arg2

Of course, the above command is not good, because memory spans two node, which is very bad. The best way is to just let the program Access and run the same node as yourself, such as:

$ numactl--membind 1--cpunodebind 1--localalloc myapplication

C) file system tuning

With respect to the file system, because the file system also has the cache, therefore, in order to let the file system have the maximum performance. The first thing is to allocate large enough memory, this is very important, under Linux can use the free command to view the free/used/buffers/cached, ideally, buffers and cached should have about 40%. Then there is a fast hard disk controller, SCSI will be much better. The fastest is the Intel SSD, which is super fast, but with limited write times.

Next, we can tune the file system configuration, for Linux EXT3/4, almost in all cases, one of the parameters is to turn off the file system access time, under/etc/fstab to see if your file system has noatime parameters (in general, should have), Another is Dealloc, which allows the system to optimize the writer by making the last minute decision on which block to use when writing to the file. There are three logging modes: Data=journal, data=ordered, and Data=writeback. The default setting data=ordered provides the best balance between performance and protection.

Of course, for these, the default setting for EXT4 is basically the best optimization.

This article describes a Linux view I/o command--iotop, which allows you to see the load of disk reads and writes for each process.

There are some other things about NFS and XFS that you can look at in Google search for some related optimizations. For each file system, you can read this article-"Linux log file system and performance analysis."

4.5 Database Tuning

Database tuning is not my strong point, I just use my very limited knowledge to say something. Note that these things are not necessarily correct, because in different business scenarios, different database design may get the exact opposite conclusion, so I will only do some general instructions here, specific problems to be analyzed.

A) Database Engine tuning

I am not familiar with the database engine, but there are a few things that I think must be understood.

The way the database is locked. This is very, very important. In the case of concurrency, locks are very, very performance-impacting. Various isolation levels, row locks, table locks, page locks, read and write locks, transaction locks, and various write-first or read-priority mechanisms. The highest performance is not lock, so, the sub-database table, redundant data, reduce the consistency of transaction processing, can effectively improve performance. NoSQL is the sacrifice of consistency and transactional processing, and redundant data to achieve both distributed and high performance.
The storage mechanism of the database. Not only to understand how the various types of fields are stored, more importantly, how the database data storage, how to partition, how to manage, such as Oracle's data files, table space, segments, and so on. Understanding this mechanism can alleviate a lot of I/O load. For example: Using show engines under MySQL, you can see the support of various storage engines. Different storage engines have different priorities, and for different business or database designs you will have different performance.
The distributed policy of the database. The simplest is to copy or mirror, you need to know the distributed consistency algorithm, or Master Master synchronization, master-slave synchronization. By understanding the mechanism of this technology, you can scale horizontally at the database level.

B) SQL statement optimization

For the optimization of SQL statements, the first is to use tools, such as: MySQL SQL query analyzer,oracle SQL Performance Analyzer, or Microsoft SQL Query Analyzer, basically, All Rmdb will have this tool to let you see the performance issues of SQL in your app. You can also use explain to see what the final execution plan will look like in the SQL statement.

It is also important that the various operations of the database require a lot of memory, so the memory of the server is enough, it should be good for those multi-table query SQL statement, which is quite a memory consumption.

I'll say a few SQL that will have performance issues based on my limited database SQL knowledge:

full Table search. For example: SELECT * FROM user where LastName = "XXXX", such an SQL statement is basically a full table lookup, linear complexity O (n), the more records, the worse the performance (such as: 100 Records of the lookup to 50ms, 1 million records will take 5 minutes). In this case, there are two ways to improve performance: one is to divide the table, to drop the number of records, and the other is to build an index (for LastName). The index is like a KEY-VALUE data structure, key is the field behind the where, value is the physical line number, the search complexity of the index is basically o (log (n))--with the B-tree implementation index (such as: 100 Records of the lookup to 50ms, 1 million records need 100ms).
index. for indexed fields, it is best not to do calculations, type conversions, functions, null Judgments, field join operations on the fields, which will destroy the original performance of the index. Of course, the index generally appears in the Where or ORDER BY clause, so it is best not to compute the sub-segments in the Where and order by clauses, or to add what is not, or what function to use.
multiple table queries. the most important operation of a relational database is a multi-table query, with multiple table queries having three key words, Exists,in and joins (for various joins, see Graph SQL join article). Basically, the modern data Engine optimizes SQL statements quite well, and joins and in/exists differ in results, but performance is almost the same. Some people say that exists performance is better than in,in performance better than join, I think, this also depends on your data, schema and SQL statement complexity, for the general simple case, are similar, so do not use too much nesting, do not make your SQL too complex , rather than using a few simple SQL, do not use a huge, extremely nested n-level SQL. It is also said that if the data volume of the two tables is similar, the performance of exists may be higher than in,in may be greater than the join, if the two tables are small, then the subquery, exists with a large table, in the small table. This, I have not verified, put here for everyone to discuss it. Also, there is an article on SQL Server that you can look at in vs JOIN vs EXISTS
JOIN operation. It is said that the order of join tables affects performance, as long as the result set of the join is the same, the performance is independent of the order of join. Because the database engine in the background will help us optimize it. Join has three implementation algorithms, nested loops, sort merges, and hash joins. (MySQL only supports the first type)
(1) nesting loops, as if it were our common multiple nesting loops. Note that the previous index says that the index lookup algorithm for the database is B-tree, which is an O (log (n)) algorithm, so the entire algorithm should be O (log (n)) * O (log (m)).
(2) hash join, which mainly solves the complexity of the O (log (n)) of nested loops, is marked with a temporary hash table.
(3) Sort merge, meaning that two tables are sorted by query field and then merged. Of course, indexed fields are generally well-ordered.
Or that sentence, specifically to see what kind of data, what kind of SQL statement, you know which method is the best.
partial result set. we know the limit keyword in MySQL, and the top of Rownum,sql server in Oracle is limited to the first few of the returned results. This gives us a lot of room to tune the database engine. In general, the record data that returns top n requires us to use order BY, and notice here that we need to index the fields of order by. With the indexed order by, the performance of our SELECT statement is not affected by the number of records. Using this technology, in general, our front desk will be paged to show the data, MySQL using Offset,sql server is the fetch NEXT, this fetch is not a good way to be linear complexity, so if we can know the order By the start value of the second page of the By field, we can use the >= expression directly in the Where statement to select, which is called seek, rather than Fetch,seek's performance is much higher than fetch.
string. as I said earlier, string manipulation has a very large nightmare on performance, so the data can be used in the case of numbers, such as: time, work number, etc.
Full-Text search. do not use like things to do full-text search, if you want to play full-text search, you can try to use Sphinx.
Other.
(1) Do not select *, but explicitly point out the fields, if you have more than one table, be sure to add a table name before the field name, do not let the engine to calculate.
(2) Do not use having, because it is going to traverse all the records. The performance is not so bad.
(3) Replace Union with union all as possible.
(4) If the index is too large, insert and delete will become slower. Update will also slow if you update most indexes, but if you update only one, only one index table will be affected.

Write so much first, and you are welcome to add.

Performance Tuning Strategy

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More