A collection of Windows stack knowledge

Source: Internet
Author: User
Basic stack knowledgeHeap and stack are two basic concepts that C/C ++ programming will inevitably encounter. First, both concepts can be found in the data structure book. They are both basic data structures, although the stack is simpler. In a specific C/C ++ programming framework, these two concepts are not parallel. The Research on the underlying machine code reveals that the stack is the data structure provided by the machine system, while the stack is provided by the C/C ++ function library. Specifically, modern computers (serial execution mechanisms) directly support the stack data structure at the bottom of the Code. This is reflected in the fact that there are dedicated registers pointing to the address of the stack, and dedicated machine commands to complete the operations of data in and out of the stack. This mechanism is characterized by high efficiency and limited data types supported by systems such as integers, pointers, and floating point numbers. It does not directly support other data structures. Due to the characteristics of stack, the use of stack is very frequent in the program. The call to the subroutine is done directly using the stack. The call command of the machine implicitly pushes the return address into the stack, and then jumps to the subprogram address. The RET command in the subprogram implicitly pops up the return address and jumps from the stack. The automatic variables in C/C ++ are examples of direct use of stacks, which is why the automatic variables of the function become invalid when the function returns. Unlike the stack, the stack data structure is not supported by the system (whether a machine system or an operating system), but provided by the function library. The basic malloc/realloc/free function maintains an internal heap data structure. When the program uses these functions to obtain new memory space, this function first tries to find available memory space from the internal heap, if there is no available memory space, the system calls are used to dynamically increase the memory size of the program data segment. The newly allocated space is first organized into the internal heap and then returned to the caller in an appropriate form. When the program releases the allocated memory space, this memory space is returned to the internal Heap Structure and may be processed properly (for example, merged into a larger idle space with other idle space ), it is more suitable for the next memory allocation application. This complex allocation mechanism is actually equivalent to a buffer pool (cache) for memory allocation. There are several reasons for using this mechanism: 1. system calls may not support memory allocation of any size. Some system calls only support fixed memory requests and their multiples (allocated by PAGE). This will cause a waste for a large number of small memory categories. 2. System Call memory application may be expensive. System calls may involve switching between user and core states. 3. Unmanaged memory allocation can easily cause memory fragmentation when a large amount of complex memory is allocated and released. Compared with Stack, we can see from the above knowledge that stack is a function provided by the system, featuring fast and efficient, with limitations and inflexible data. Stack is a function provided by function libraries, the feature is flexible and convenient, and the data is widely adapted, but the efficiency is reduced. The stack is the system data structure, which is unique for processes/Threads. The heap is the internal data structure of the function library, which is not necessarily unique. Memory allocated by different heaps cannot be operated on each other. Stack space is divided into static allocation and dynamic allocation. Static allocation is completed by the compiler, such as automatic variable allocation. Dynamic Allocation is completed by the malloc function. The stack does not need to be released dynamically (automatically), so there is no release function. For the sake of portable programs, dynamic stack allocation is not encouraged! Heap space allocation is always dynamic. Although all data spaces are released back to the system at the end of the program, precise memory application/release matching is the basic element of a good program. Use Stack1. Notes for implementing heap: Traditionally, the operating system and Runtime Library coexist with the implementation of heap. At the beginning of a process, the operating system creates a default heap called "process Heap ". If no other heap is available, the "process Heap" is used for block allocation ". You can also create a separate heap in the process when the language is running. (For example, create its own heap when running C .) In addition to these dedicated heaps, applications or one of many loaded Dynamic Link Libraries (DLL) can create and use separate heaps. Win32 provides a complete set of APIs to create and use private stacks. For detailed instructions on heap functions, see msdn. When an application or DLL creates a private heap, these stacks are stored in the process space and accessible within the process. Data allocated from the given heap will be released on the same heap. (It cannot be allocated from one heap but released from another .) In all virtual memory systems, the heap resides at the top of the "virtual memory manager" of the operating system. The Language Runtime heap also resides on the top of the virtual memory. In some cases, these stacks are layers in the operating system heap, while the language runtime heap manages its memory by allocating large blocks. Instead of operating system heap, virtual memory functions are more conducive to heap allocation and block usage. A typical heap implementation consists of the front and back-end allocation programs. The front-end allocation Program maintains the idle list of fixed size blocks. For an allocation call, the heap tries to find a free block from the front-end list. If the heap fails, the heap is forced to allocate a large block from the backend (Reserve and submit the virtual memory) to satisfy the request. The general implementation has the overhead of each allocation, which consumes the execution cycle and reduces the available storage space. Knowledge Base Article q120058, "Managing memory with calloc () and malloc ()" (search article number), contains more background knowledge about these topics. In addition, a detailed discussion of heap implementation and design can also be found in the following works: "Dynamic Storage Allocation: asurvey and Critical Review", by Paul R. wilson, Mark S. johnstone, michaelneely, and David boles; "International Workshop on Memory Management" by Kinross, Scotland, UK, September 1995 (http://www.cs.utexas.edu/users/oops/papers.html ). The implementation of Windows NT (Windows NT version 4.0 and later) uses 127 8-byte alignment block free list and a "Bulk" list in size from 8 to 1,024 bytes. The "Bulk" list (idle list [0]) Stores blocks larger than 1,024 bytes. The idle list contains objects linked together with a two-way linked list. By default, process heap collects data. (Collection combines adjacent idle blocks into a large block .) Collection takes an additional period, but reduces the internal fragmentation of the heap block. A single global lock protection heap prevents multithreading. (See the first note in server performance and scalability killers, by George Reilly on the msdn online web workshop site: http://msdn.microsoft.com/workshop/server/iis/tencom.asp ).) A single global lock is essentially used to protect the heap data structure and prevent random access across multiple threads. If the heap operation is too frequent, a single global lock will adversely affect the performance. 2. What are common heap performance problems? The following are the most common problems you may encounter when using a heap: The speed slows down caused by the allocation operation. It takes a long time to allocate light. The most likely cause is that there are no blocks in the idle list, so assigning program code during the runtime will take a long time to find large idle blocks, or allocate new blocks from the backend allocation program. The release operation slows down. The release operation takes many cycles, mainly because collection is enabled. During the collection, each release operation "finds" its adjacent blocks, retrieves them and constructs them into larger blocks, and then inserts the larger blocks into the idle list. During the search, the memory may be randomly hit, resulting in high-speed cache failure and performance degradation. Slow Down Caused by heap competition. When two or more threads access data at the same time, and one thread must wait for the other thread to complete before continuing, competition will occur. Competition is always troublesome; this is also the biggest problem encountered by the current multi-processor system. When a large number of applications or DLL that use memory blocks run in multiple threads (or run on a multi-processor system), the speed slows down. The use of a single lock-a common solution-means that all operations using the heap are serialized. When waiting for the lock, the serialization will cause the thread to switch the context. It can be imagined that the speed slows down due to the intermittent stop at the red light at the intersection. Competition usually leads to context switching between threads and processes. The overhead of context switching is large, but the overhead is greater: data is lost from the processor's high-speed cache, and Data Reconstruction when the thread is later revived. The speed caused by Heap Corruption slows down. The cause of heap corruption is that the application uses the heap block incorrectly. Typical problems include releasing released blocks, using released blocks, and overwriting blocks out of bounds. (Destruction is beyond the scope of this article. For details about memory rewriting and leakage, see Microsoft Visual C ++ (r) debugging documentation .) Slow down due to frequent allocation and redistribution. This is a common phenomenon in scripting. For example, strings are repeatedly allocated, increasing and releasing with redistribution. Do not do this. If possible, try to allocate large strings and use the buffer zone. Another method is to use the connection as little as possible. Competition slows down the allocation and release operations. Ideally, you want to use a heap without competition and fast distribution/release. Unfortunately, no such universal heap is available now, and it may be available in the future. In all server systems (such as IIS, msproxy, databasestacks, network server, exchange? Others), heap locking is a big bottleneck. The more processors, the worse the competition. Try to reduce the usage of the heap. Now you understand the problems that exist when using the heap. Don't you want to have a super magic wand that can solve these problems? I want. But there is no magic to speed up the heap operation-so don't expect a big change in the last week before the product shipment. If you plan the heap policy in advance, the situation will be greatly improved. Adjusting the heap usage method and reducing heap operations is a good way to improve performance. How can I reduce the number of heap operations? The location in the data structure can be used to reduce the number of heap operations. Consider the following instances: struct objecta {// objecta data} struct objectb {// objectb data} // use both objecta and objectb // Use Pointer // struct objectb {struct objecta * pobja; // objectb data} /// use embedded // struct objectb {struct objecta pobja; // objectb data} // set-use objecta and objectb in another object // struct objectx {struct objecta obja; struct objectb objb ;} avoid associating two data structures with pointers. If you associate two data structures with pointers, objects A and B in the previous instance will be allocated and released separately. This will increase additional overhead-we should avoid this approach. Embed child objects with pointers into parent objects. When the object contains pointers, it means that the object contains dynamic elements (80%) and new positions not referenced. Adding locations for embedding reduces the need for further allocation/release. This will improve the performance of the application. Merge small objects to form large objects (aggregation ). Aggregate The number of blocks to be allocated and released. If there are several developers and different parts of their respective development and design, many small objects will eventually need to be merged. The challenge of integration is to find the correct aggregation boundary. The Inline buffer can meet 80% of the requirements (aka 80-20 Rules ). In some cases, the memory buffer is required to store string/binary data, but the total number of bytes is unknown in advance. It is estimated that a buffer with a size of 80% can be inline. For the remaining 20%, you can allocate a new buffer and a pointer to the buffer. This reduces the allocation and release of calls and increases the Data Location space, fundamentally improving the code performance. Allocate objects to blocks ). Block is a way to assign multiple objects at a time in a group. If you continuously track the items in the list, for example, for a list of {name, value} pairs, you can select either to assign a node to each "name-value" pair; the second option is to allocate a structure that can accommodate (for example, five) "name-value" pairs. For example, if four pairs are stored, the number of nodes can be reduced. If additional space is required, an additional linked list pointer is used. Block-based caching is a friendly processor for high-speed caching, especially for L1-high-speed caching, because it provides added locations-needless to say, for block allocation, many data blocks are stored on the same virtual page. Use _ amblksiz correctly. C Runtime (CRT) has its custom front-end allocation program, which allocates blocks with a size of _ amblksiz from the backend (Win32 heap. Setting _ amblksiz to a high value can potentially reduce the number of backend calls. This is only applicable to widely used CRT programs. The benefits of using the above technology will vary with the object type, size, and workload. However, performance and scalability can always be achieved. On the other hand, the Code is a bit special, but it is easy to manage the code after careful consideration. 3. Other performance improvement technologies: using Windows nt5, Microsoft Windows (r) in early 1998, thanks to the efforts and hard work of several colleagues) 2000 has several major improvements: improved the lock in the heap code. Heap code locks each heap. The global lock protects the heap data structure and prevents multithreading. However, unfortunately, in the case of high traffic, the heap is still stuck in global locks, resulting in high competition and low performance. In Windows 2000, the critical section of the lock code minimizes the possibility of competition, thus improving scalability. Use the "lookaside" list. The heap data structure uses a fast and high-speed cache of 8 to 1,024 bytes (incrementing by 8-bytes) for all idle items in the block. The quick cache is initially protected in a global lock. Now, you can use the lookaside list to access these idle lists of quick cache. These lists do not require locking, but use 64-bit mutual lock operations, thus improving performance. The internal data structure algorithm has also been improved. These improvements avoid the need for high-speed cache allocation, but do not exclude other optimizations. Evaluate your code using the Windows nt5 heap; it is best for blocks smaller than 1,024 bytes (1 kb) (blocks from front-end allocators. Globalalloc () and localalloc () are built on the same stack, which is a general mechanism for accessing each process heap. To achieve high local performance, use the heap (r) API to access each process heap, or create your own heap for the allocation operation. You can also use virtualalloc ()/virtualfree () to perform bulk operations. The above improvements have been used in Windows 2000 Beta 2 and Windows NT 4.0 SP4. After improvement, the heap lock competition rate is significantly reduced. This benefits all Win32 heap users directly. The CRT heap is built on the top of the Win32 heap, but it uses its own small heap, so it cannot benefit from Windows NT improvements. (Visual C ++ version 6.0 also has an improved heap allocation program .) Use the allocated high-speed cache to allocate high-speed cache blocks that can be allocated by high-speed cache for future reuse. This reduces the number of times the process heap (or global heap) is allocated/released, and allows you to reuse previously allocated blocks to the maximum extent. In addition, the allocated cache allows statistics to be collected to better understand the usage of objects at a higher level. Typically, the custom heap allocation program is implemented at the top of the Process heap. The custom heap allocation program is similar to the system heap behavior. The main difference is that it provides high-speed cache for allocated objects at the top of the Process heap. The cache is designed as a set of fixed sizes (such as 32 bytes, 64 bytes, and 128 bytes ). This is a good policy, but this custom heap allocation program loses the "semantic information" related to the assigned and released objects ". In contrast to the custom heap allocation program, "allocate cache" is implemented as each type of allocation cache. In addition to providing all the benefits of a custom heap allocation program, they can retain a large amount of semantic information. Each allocated cache handler is associated with a target binary object. It can use a set of parameters for initialization. These parameters indicate the concurrency level, object size, and number of elements that are kept in the idle list. Allocate the cache handler object to maintain its own private idle entity pool (no more than the specified threshold value) and use the private protective lock. In combination, the allocation of High-speed cache and private locks reduces the traffic with the Master System Heap, thus providing increased concurrency, maximum reuse, and higher scalability. You need to use a cleanup program to regularly check the activities of all allocated cache handlers and reclaim unused resources. If no activity is found, the pool of allocated objects will be released to improve performance. Each allocation/release activity can be reviewed. Level 1 information includes the total number of objects, allocation, and release calls. You can view their statistical information to obtain the semantic relationship between objects. Using one of the technologies described above, this relationship can be used to reduce memory allocation. Cache allocation also serves as a debugging assistant to help you track the number of objects that are not completely cleared. By viewing the trace returned by the State stack and the signature other than the uncleared object, you can even find the exact caller that fails. MP heap is a multi-processor-friendly distributed distribution package, which can be obtained in Win32 SDK (Windows NT 4.0 and later. Initially implemented by jvert, the heap abstraction is built on the top of the Win32 heap package. The MP heap creates multiple Win32 heap and tries to distribute the allocated calls to different heap to reduce competition in all single locks. This package is a good step-an improved MP-friendly custom heap allocation program. However, it does not provide semantic information and lacks statistical functions. Generally, the MP heap is used as the SDK library. You will greatly benefit from using this SDK to create reusable components. However, if you create this SDK library in each DLL, the working settings will be added. To rethink that algorithms and data structures must be scaled on a multi-processor machine, algorithms, implementations, data structures, and hardware must be dynamically scaled. See the data structure that is most frequently allocated and released. Q: "Can I use different data structures to do this ?" For example, if a list of read-only items is loaded during application initialization, this list does not need to be a list of linear links. It would be good if the array is dynamically allocated. The dynamically allocated array reduces heap blocks and fragments in the memory, thus enhancing performance. Reduce the number of small objects required to reduce the load of the heap allocation program. For example, we use five different objects in the key processing path of the server. Each object is allocated and released separately. These objects are cached at a high speed to reduce the number of heap calls from five to one, significantly reducing the heap load, especially when more than 1,000 requests are processed per second. If you use a large number of "Automation" structures, consider removing "automation BSTR" from the main line code or at least avoiding repeated BSTR operations. (BSTR connections lead to excessive reallocation, allocation, and release operations .) Abstract: stack implementation often exists on all platforms, resulting in huge overhead. Each individual code has specific requirements, but the design can use the basic theory discussed in this article to reduce the interaction between stacks. 1. Evaluate the usage of heap in your code. 2. Improve your code to use fewer heap calls: analyze key paths and fixed data structures. 3. Use the method to quantify heap call costs before implementing a custom package. 4. If you are not satisfied with the performance, ask the OS Group to improve the heap. More such requests mean more attention to improving the heap. 5. The C Runtime group is required to create a small allocation package for the heap provided by the OS. With the improvement of the OS heap, the heap call cost for C Runtime will be reduced. The operating system (Windows NT family) is constantly improving the heap. Please pay attention to and use these improvements at any time.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.