[Go] server performance and scalability Killer

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Servers | scaling | Performance server performance and scalability Killer
George v. Reilly
Microsoft Inc.
February 22, 1999

Directory

Preface
Application Server
The flexibility and performance of IIS
10 Commandments to stifle server performance
Conclusion

--------------------------------------------------------------------------------

Preface

Server performance issues are now a problem for many people writing desktop applications. The success of the Component Object Model (Component object model,com) and Component Ware produces an unexpected result, which is that if you use an application server such as an ASP (an extension of IIS), you don't have to write host code. In fact, the previous host code is not in the real server environment to write. There are many important differences between the desktop environment and the server environment, and these differences can have unpredictable effects on performance.

Desktop Application Server

The factors that affect desktop application performance are well known. Long instruction paths mean slower code, which is a major flaw in performance. Using large amounts of resources makes your application even more bloated, so that other applications in the system will have fewer resources available. Slowing startup times can irritate users. Too many run settings increase the machine's page error rate, slowing them down and reflecting dullness. Server applications are also often affected by these factors, and other factors are described below:

Typically, there are no hundreds of or dozens of clients that the server application handles at the same time. For desktop applications, it's very quick to respond to users in 1/10 seconds. Assuming that an operation requires a full 100ms, the application can do only 10 operations in a second. Most server applications require a much larger amount of flux than 10 requests per second. High latency Network (latency = message transmission time) increases the response time, which requires the server to respond faster to meet the requirements.

Server applications often handle a large number of data settings. Inefficiencies, especially those that waste running time, cannot be used to process millions data.

Server machines are more powerful than desktop machines. The server machine has more memory, larger disks, faster CPUs, and usually has multiple processors. But these are still not enough. Desktop machines deal with sporadic bursts of business, most of the time is idle, and the server's load is continuous. Server machines are expensive and must run well.

The server application needs to have a normal running time of months. After a while, the performance of the server must not be reduced by the accumulation of resource leaks or cruft (a data structure and statistical results that require periodic cleanup).

Most server applications need to adopt a multithreaded architecture. Consider processing only one request at a time. The performance of a single-threaded server, which spends most of its time on I/O, is difficult to accept. The thread pool can process several requests at the same time using other idle processor clock cycles. To take full advantage of multiprocessor systems, server applications must be multi-threaded. Unfortunately, multithreaded applications are difficult to write, are difficult to debug, and are hard to run, especially in multiprocessor systems. But once you get it right, the performance goes far beyond the same single-threaded application, and from that point of view, it's worthwhile to use multithreaded applications.

Single-threaded applications are relatively simple and easy to understand: Only one event occurs at a time in a program. In many
Threaded applications, and issued as a result of complex interactions, whose effects are difficult to predict. In addition, these phases
Interactions, whether catastrophic or not, are hard to regenerate. Desktop applications rarely have more than one thread
, even if they are, these threads are only used for discrete background business, such as printing.

The flexibility and performance of IIS

Internet Information Server (IIS) is an application server. In many ways, it is like a virtual operating system, because there are many ASP and ISAPI applications running in the processing interval.

IIS uses an I/O thread pool to handle all incoming requests. Requests for static files (. htm,.jpg files) are immediately satisfied, and requests for dynamic content are assigned to the appropriate ISAPI extension dynamic Connection library. ASP extensions run ASP pages using a worker thread pool. Because the ASP is COM based, all components are executed during our process. This is a good and bad thing. It's great for developers because it allows for simple reuse of components, makes ASPs very flexible, and therefore makes ASP and IIS very successful. However, this flexibility results in performance problems. Because many components are written for desktop systems, many of the components that are created specifically for ASP are written by people who are not very able to write High-performance server components.

The same is true for ISAPI extensions and filters. There are serious interactions between different components and different instances of the same component.

All of the following instructions apply to IIS, most of which are also applicable to other server applications.

10 Commandments to stifle server performance

Each of the following commandments will effectively affect the performance and scalability of your code. In other words, try not to follow the precepts as much as possible! Below, I'll explain how to break them to improve performance and scalability.

Multiple objects should be allocated and freed

You should try to avoid excessive allocation of memory because memory allocation can be costly. Freeing blocks of memory can be more expensive because most assignment operators always attempt to connect the adjacent freed blocks of memory into larger chunks. Until Windows NT? 4.0 Service Pack 4.0, in multithreading, the system heap usually runs badly. The heap is protected by a global lock and is not extensible on multiprocessor systems.

You should not consider using the processor cache

Most people know that the hard page error caused by the virtual memory subsystem is costly and is best avoided. But many people think that other memory access methods are no different. This view has been wrong since 80486. Modern CPUs is much faster than RAM, RAM requires at least level two memory cache, high speed L1 cache saves 8KB of data and 8KB instructions, and slower L2 cache holds hundreds of KB of data and code, which is mixed with code. A reference to an area of memory in the L1 cache requires a clock cycle, and a reference to the L2 cache requires 4 to 7 clock cycles, while the reference to the main memory requires many processor cycles. The latter figure will soon be more than 100 clock cycles. In many ways, caching is like a small, high-speed, virtual memory system.

The basic memory unit associated with caching is not a byte but a cache column. The Pentium cache column has a width of 32 bytes. The Alpha cache column is 64 bytes wide. This means that there are only 512 slot in the L1 cache for code and data. If multiple data is used together (a time position) and not stored together (space location), performance can be poor. Arrays are well positioned, while interconnected lists and other pointers based data structures tend to be poorly positioned.

Packaging data into the same cache column can often improve performance, but it can also disrupt the performance of multiprocessor systems. The memory subsystem is difficult to reconcile the cache between processors. If a read-only data used by all processors is shared with a cached column with data that is used by one processor and frequently updated, caching will take a long time to update the copy of the cached column. This ping-pong high-speed game is often referred to as "caching sloshing". If the read-only data is in a different cache column, you can avoid sloshing.

Space optimization of code is more efficient than speed optimization. The less code you have, the less the page your code occupies, and the less you will need to run settings and the resulting page errors, and fewer cache columns to occupy. However, some core functions should be optimized for speed. You can use Profiler to identify these functions.

Never cache frequently used data.

Software caching can be used by a variety of applications. When a computational cost is high, you save a copy of the result. This is a typical space-time tradeoff: sacrificing some storage space to save time. If done well, this method can be very effective.

You must cache correctly. If the error data is cached, the storage space is wasted. If you cache too much, there will be little memory available for other operations. If you cache too little, the efficiency is low because you have to recalculate the data that is missing from the cache. If the time sensitive data is cached for too long, the data will become obsolete. Generally, servers are more concerned with speed than space, so they have more cache than desktop systems. Be sure to periodically remove unused cache, otherwise there will be run Setup problems.

Multiple threads should be created, the more the better.

It is important to adjust the number of threads that are working in the server. If the thread is i/o-bound, it will take a lot of time to wait for the I/O to complete-a blocked thread is a thread that does not do any useful work. Adding additional threads can increase the flux, but adding too many threads will degrade the performance of the server, because context switching will be a significant overhead. The context exchange rate should be low for three reasons: Context switching is a simple overhead that does nothing to benefit the application's work; the context Exchange runs out of valuable clock cycles; Worst of all, context swapping fills the processor's cache with useless data, which is costly to replace.

There are a lot of things that depend on your threading structure. One thread per client is absolutely inappropriate. Because for a large number of users, it is not scalable. The context exchange became unbearable and Windows NT ran out of resources. The thread pool model works better, and in this way a worker thread pool will process a request column because Windows 2000 provides the appropriate APIs, such as QueueUserWorkItem.

Global locks should be used on data structures

The easiest way to make data thread safe is to put it on a large lock. For the sake of simplicity, all things are locked with the same lock. There is a problem with this approach: serialization. In order to get a lock, each thread that handles data must be queued. If the thread is blocked by a lock, it is not doing anything useful. This problem is not common when the server is lighter, because only one thread at a time may require a lock. In heavy loads, a fierce scramble for locks could be a big problem.

Imagine an accident on a multiple-lane highway where all the vehicles on the freeway were diverted to a narrow road. If there are few vehicles, the effect of this conversion on the rate of traffic flow can be ignored. If there are many vehicles, traffic jams can stretch for several miles when the vehicle is slowly merged into the single channel.

There are several technologies that can reduce lock competition.

· Don't be overly protective, that is to say, it is not necessary not to lock the data. Only need to hold the lock, and time not too long. It is important not to use locks unnecessarily in code around large pieces of code or frequently executed.
· The data is segmented so that it can be protected with a single set of locks. For example, a symbol table can be separated by the first letter of the identifier, so that when you modify the value of a symbol whose first name begins with Q, you do not read the value of the symbol whose name begins with H.
· Use the APIs ' interlocked series (interlockedincrement,interlockedcompareexchangepointer, etc.) to automatically modify the data without the need for a lock.
· Multiple reader/single author (multi-reader/single-writer) locks can be used when data is not often modified. You will get better concurrency, even though the cost of the lock operation will be higher and you may risk starving the author.
· Use the loop counter in the critical section. See the Setcriticalsectionspincount API in Windows NT 4.0 Service Pack 3.
· If you can't get the lock, use the Tryente

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Go] server performance and scalability Killer

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[Go] server performance and scalability Killer

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support