Performance optimization of 5.28 large-pressure measurement-related problems of thread pool

Last Update:2017-06-04 Source: Internet

Author: User

Tags redis server

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Directory:

1. Introduction to the Environment

2. Symptoms

3. Diagnostics

4. Conclusion

5. Resolve

6. Compare Java implementations

Needless to say, this article shares the Bo in the 5.28 of the main pressure to solve a performance problem during the test, I think this is more interesting, it is worth summing up to share.

The department that the blogger serves is the public service platform, the public service platform supports all business systems on the upper level (2 C, UGC, live, etc.). One of the core of the platform is the order domain-related services, the next single service, check the single service, payment callback service, of course, the billing page is still our responsibility, billing page is responsible for the connecting link to order, settlement, skip Payment Center. Each business party during the big promotion period of the platform to carry out a regular pressure test, to achieve the bottom of the heart.

In the first half of the test, there are some problems that are not too strange, and the time to locate the problem is within the plan. The next single service, check the single service, the settlement page is successfully measured through. But when the payment callback service was pressed, a strange problem arose.

1. Introduction to the Environment

We basically two times a year, 5.28, double 12. Two times during the big promotion period is only half a year or so, so every time the big pressure test will be a little low in the heart, the basic is the mapping under inspection. Because the previous test performance during this six months generally does not appear too large performance problems. This is premised on the fact that every time we launch a major project, we perform a performance test, so the pressure measurement slowly becomes routine, automated, and the missing performance problem should not be too much. Performance indicators in fact in peacetime on the attention, not the big urge to cram, so it is too late, can only pay Paul.

Application server configuration, physical machine, 32core i7 64-bit, 168g, gigabit network card, voltage measurement network bandwidth gigabit, IIS 7.5,. NET 4.0, this test server is still very strong.

We will use JMeter to troubleshoot the problem locally. Since this article is not about how to do the performance of the pressure test, so other than this article is not related to the situation is not introduced. including pressure measurement network isolation, pressure measurement of the machine configuration and number of nodes.

Our requirements, the top level service in 200 concurrent, the average response time can not exceed 50 milliseconds, TPS to about 3000. First-class services, that is, the requirements of the most low-level services, commodity systems, promotional systems, card system average response time is basically maintained within 20 milliseconds to accept. Because the responsiveness of the first-level service directly determines the responsiveness of the upper-tier service, there are some other calling costs to be removed.

2. Symptoms

The symptom of this performance problem is still quite strange, the situation is this: 200 concurrent, 2000loop,40w of the dosage. The first few seconds before the start of the speed is relatively fast, basically TPS to about 2500. The server's CPU is also about 60, or more normal, but after a few seconds processing speed drops, TPs slowly down. From the server's monitoring, the server's CPU is 0% consumed. It's scary, and it's not going to work out all of a sudden. The TPS has fallen to more than 100, and it is clear that it will fall forever. Wait for about less than 4 minutes, all of a sudden CPU came up again. The TPS can be around 2000.

We carefully analyzed to see first jmeter throughput problem, throughput is calculated according to the average response time of your request, so here it seems that TPS is slowly slowing down actually has basically stopped. If your average response time is 20 milliseconds, your throughput is basically calculated within the unit time.

The main symptom is this, and we then diagnose it.

3. Diagnostics

Start by walking through the code to see if you can find something.

This is the payment callback service, there is not too much business processing before and after the code, authentication check, order payment status modification, trigger payment completion event, call delivery, peripheral Business Notification (there is a part of the need to be compatible with old code, old interface). First, we mainly look at the external dependencies, and found that Redis read-write code, the Redis part of the code commented out on the pressure test try. All of a sudden it's normal, it's strange, Redis is our other pressure measurement services shared, before the pressure test how no problem. It's not so much, it's probably a different sequence of code execution, and in the concurrency realm, that's okay.

We re-print the time that Redis executes and see how long it takes to process it. The results show that the processing speed is uneven, the front is very fast, the back time is 5-6 seconds, although uneven but very regular.

So we all thought it was Redis's problem and started to go in and check redis. Start checking for Redis, starting with Wireshark TCP connection monitoring, checking the Slowlog view processing time for the link, Redis server. Source view of the Redis client library (Redis client excludes native Stackexhange.redis with a two-layer package, altogether three layers), with a focus on locked places and thread wait. Also troubleshoot network problems, and then ping the Redis server to see if there is a delay when you press the test. (This is about 21 o'clock in the evening, this time the brain situation we all understand.) ）

This is the carpet-type search, that is sure to be able to locate the problem. But we ignore the hierarchy of code, all of a sudden to the details of the place, ignoring the overall architecture (refers to the development of the architecture, because the code is not written by us, the code around the situation is not too familiar).

First look at the establishment of Redis server, TCP capture packet view, connection established normal, no packet loss, speed is also very fast. Redis's processing speed is no problem, slowlog view basic Get key is also less than 1 milliseconds. (It is important to note that Redis processing time also includes the waiting time in the queue.) Slowlog can only see the Redis processing time and not see the blocking time, which also includes the Redis command in the client queue time. ）

So the printed Redis processing time is very slow, not purely redis server processing time, there are several links in the middle of the need to troubleshoot.

After some tossing, troubleshooting, the problem did not locate, it is late at night, serious lack of energy, but also to the subway last bus departure time, not to be missed, home from work, up to the last train did not delay three minutes ~ ~.

Re-thinking, the next day to continue to troubleshoot.

The connection we locate to the Redis client can be warmed up first, warmed up first when the global Application_begin is started, and then the performance is all right.

The scope is further narrowed, the problem is connected, and here we reflect (the night sleep, the brain is clear), then why our previous pressure test did not appear this problem. We can't be too keen on the technology mania. At this point the problem is solved, but the related clues behind the can not be put up, always uncomfortable. (Intermission is the second day of the afternoon fast evening ~ ~.) Technical staff must be clear about this desire to conquer.

We began to restore the scene, and then began to make a big move, began the dump process file, divided into different time periods, grabbed several dump files down to the local analysis.

First look at the threading situation,!runaway, found that most thread execution time is a bit long. Then switch to ~XXS in a thread to see the thread call stack. Found in waiting for a monitor lock. Switch to a few other threads at the same time to see if it is all waiting for this lock. The result is really waiting for this lock.

The conclusion is that half of the threads are waiting for the moniter monitor lock, and as time increases, it is not all waiting for the lock. It's rather strange.

This lock is used in the third layer of the Redis library when it is used for lock acquisition of Redis Connectioin. We directly comment out this lock, continue to pressure test continue dump, and then found a monitor, this lock is Stackexchange.redis, code 1:30 will not digest, only check the main code and peripheral code situation, no time to view the global situation. (Because of the time constraints). For the moment, fully trust the third-party library, and then view the various parameters of the Redis connection string, whether you can adjust the time-out, connection pool size, and so on. But it is still not resolved.

Go back to view dump and look at the next CLR connection pool! ThreadPool, suddenly saw the problem.

650) this.width=650; "title=" 1 "style=" Background-image:none; "border=" 0 "alt=" 1 "src=" http://s3.51cto.com/wyfs02/ M02/97/d5/wkiom1kzuunzxozqaaaomodaxsw595.png "" 676 "height=" 173 "/>

Continue to view several other dump files, idle is 0, that is, the CLR thread pool does not have threads to handle requests, at least the CLR thread pool creation rate and concurrency rate do not match.

The CLR thread pool is typically created at a rate of 1 seconds and 2 threads, and the rate at which the thread pool is created is less clear when there is sliding time. The size of the thread pool can be set by C:\Windows\Microsoft.NET\Framework64\v4.0.30319\Config\machine.config configuration, which is automatically configured by default. The minimum number of threads is typically the number of CPU cores for the current machine. Of course you can also set by ThreadPool related methods, Threadpool.setmaxthreads (), Threadpool.setminthreads ().

Then we continue to troubleshoot the code and find out where the action's delegate is in the code, and this action handles the asynchronous code, which says that Redis reads and writes in this action. We understand that all the clues have been linked.

4. Conclusion

The. NET CLR thread pool is a shared thread pool, which means that a thread pool is being processed behind ASP. The thread pool is divided into two types, request thread pool, IOCP thread pool (complete port thread pool).

Now we have a clue:

1. From the very beginning of the JMeter measurement throughput slowly lower is an illusion, and at this time the processing has been completely stopped, the server CPU processing is 0%. The naked eye looks slower because the request delay time increases.

2.redis TCP link No problem, wireshark see no exception, Slowlog no problem, Redis key Comnand slow is because blocking live.

3. Other service pressure measurement all that's okay is because we're calling Redis synchronously, and the speed comes up after the first TCP connection is established.

4.Action looks like the speed is up, but all the actions are threads in the CLR thread pool, and it looks fast because there is no bottleneck to the CLR thread pools.

Action asyncaction = () =
{
Read/write Redis
Send mail
//...

};

Asyncaction ();

There is no delay in 5.JMeter pressure measurement, the program does not warm up during the pressure measurement, causing all things to be initialized, IIS,. NET, and so on. These will make the first look quick and then slowly drop the illusion.

Summary: It takes time for the first TCP connection to be established, and it is too large for all the threads to swap these threads after wait,wait, which is the obvious thread context switching process, which is a part of the overhead. When the threads of the CLR thread pool all consume light throughput begins to drop steeply. Each invocation is actually an open force of two threads, one request for processing requests, and one for the action delegate thread. When you think threads are enough, the thread pool is actually full.

5. Resolve

We have queued processing for this problem. The equivalent of an abstraction of a task force on the CLR thread pool, and then the queue's consumption thread is controlled within a certain number of times, when the default thread is initialized, it provides an interface to create at most 6 threads. This can be called when the processing speed of the queue cannot be followed. The approximate code is as follows (the appropriate changes have been made, non-source appearance, for reference only):

Service section:

private static readonly concurrentqueue<noticeparamentity> asyncnotifypayqueue = new concurrentqueue< Noticeparamentity> ();
private static int _workthread;

Static Changeorderservice ()
{
Startworkthread ();
}

public static int Getpaynoticqueuecount ()
{
return asyncnotifypayqueue.count;
}

public static int Startworkthread ()
{
if (_workthread > 5) return _workthread;

ThreadPool.QueueUserWorkItem (Waitcallbackimpl);
_workthread + = 1;

return _workthread;;
}

public static void Waitcallbackimpl (object state)
{
while (true)
{
Try
{
Paynoticeparamentity Payparam;
Asyncnotifypayqueue.trydequeue (out Payparam);

if (Payparam = = null)
{
Thread.Sleep (5000);
Continue
}

Get Order Details

Carry-over apportionment

Texting

Send Message

Distribution
}
catch (Exception Exception)
{
Log
}
}
}

Where the original call is changed directly into the queue:

private void asyncnotifypaycompleted (Noticeparamentity paynoticeparam)
{
Asyncnotifypayqueue.enqueue (Paynoticeparam);
}

Controller Code:

public class Workqueuecontroller:apicontroller
{
[Route ("Worker/server_work_queue")]
[HttpGet]
Public Httpresponsemessage Getserverworkqueue ()
{
var paynoticcount = Changeorderservice.getpaynoticqueuecount ();

var result = new Httpresponsemessage ()
{
Content = new Stringcontent (paynoticcount.tostring (), Encoding.UTF8, "Application/json")
};

return result;
}

[Route ("Worker/start-work-thread")]
[HttpGet]
Public Httpresponsemessage Startworkthread ()
{
var count = Changeorderservice.startworkthread ();

var result = new Httpresponsemessage ()
{
Content = new Stringcontent (count. ToString (), Encoding.UTF8, "Application/json")
};

return result;
}
}

The above code is not abstract encapsulated, for informational purposes only. The idea is constant, maximizing thread utilization, delaying tasks without consuming too many threads, separating CPU-intensive and IO-intensive. Let the speed mismatch move apart.

The optimized TPS can be up to 7000, nearly three times times faster than the original.

6. Compare Java implementations

This problem in fact, if in Java may not be very easy to appear, Java thread pool function is relatively powerful, concurrent libraries are richer. Two lines of code in Java can be done.

Executorservice Fiexdexecutorservice = Executors.newfixedthreadpool (Thread_count);

Directly constructs a specified number of thread pools, and of course we can also set the queue type, size of the thread pool, including the Deny policy after the queue is full and after the thread pool is full. These are still more convenient to use.

Performance optimization of 5.28 large-pressure measurement-related problems of thread pool

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More