To improve real-time performance, design and optimize Microsoft Windows CE. Net (medium)

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Document directory

Memory and real-time performance
Virtual Memory
Heap memory
Stack memory
Page table pool size
Iltiming)
Iltiming command prompt Parameters
Measurement of external interrupt response
Scheduler Timing Analysis (osbench)

(Switch) to improve real-time performance, design and optimize Microsoft Windows CE. Net (medium)

Memory and real-time performance

The kernel supports several types of kernel objects, such as processes, threads, key sections, mutex, events, and semaphores. Because the OS uses virtual memory, all kernel objects are allocated in the virtual memory, so the memory of these objects is allocated as needed. Because on-demand memory allocation may affect performance, kernel objects should be allocated whenever the process starts. Note: Once the kernel has allocated memory for the kernel object, it will not release the memory back to the system after the object is released. The kernel keeps the memory pool available. It reuses the memory in the pool when needed and allocates more memory when the memory pool is insufficient.

There are three types of memory that can affect real-time performance: Virtual, heap, and stack.

Virtual Memory

Windows CE is a virtual memory-based OS that fully utilizes the memory management unit (MMU. Because memory allocation and unallocation are based on virtual memory, real-time performance may be affected. The lowest level of memory is virtual memory, specifically the virtual memory API, such as virtualalloc. During the virtual memory allocation operation, the kernel searches for available physical memory and associates the memory with the virtual address space of the process. Because the virtual memory is allocated at the boundary of 64 KB, the virtual memory API should not be used for small allocation. During the processing of virtual memory allocation requests, the kernel searches for physical memory from its physical memory pool. Depending on the amount of memory being used and memory fragments, the time for searching memory varies. In the actual hosting system, all processes access the memory from the same physical memory pool.

To reduce the impact of virtual memory allocation, the process should allocate and submit all virtual memory before proceeding to normal processing.

Heap memory

The next level above the virtual memory is the heap memory and Its Related APIs. Heap APIs rely on Virtual Memory APIs for low-level services. The heap memory allocation is not based on the 64-kb boundary, but is specially used to allocate memory smaller than 192 KB. Any allocation exceeding 192 KB will cause the heap Management System to directly use the virtual memory to create a separate heap for the allocation operation.

The default heap is created when a process is created. The application can use the process heap for its memory allocation, and the heap will increase and decrease accordingly. However, if the number and type of memory allocated in the default heap cause the heap to be divided into many fragments, the performance will be reduced. When the heap is divided into many fragments, it will spend more time trying to find space for the new memory allocation, which will affect the performance. To prevent fragments in the process, the OEM should create a separate heap for similar objects to control the memory allocation process. If a heap contains objects of the same size, the heap manager can easily find enough available memory blocks to accommodate new allocations. However, even if fragments can be managed through a separate heap, as long as the memory is allocated, there are still memory-related overhead.

To reduce the impact of heap memory allocation, the process should allocate heap memory before proceeding to normal processing.

Stack memory

When a new thread is created in the system, the kernel reserves memory for the stack. The amount of memory reserved for the stack is determined by the/stack parameter passed to the linker when the module is generated. The default stack size of the Windows CE component is 65 KB. When the thread is scheduled for the first time, the stack memory of the submitted thread is submitted for one page at a time, and only submitted as needed.

To prevent the initial stack memory submission from affecting performance, ensure that the thread is scheduled at least once before real-time processing is executed. In the IST case, this usually happens becauseWaitforsingleobjectBefore the loop, the thread will perform several operations. If the thread only needs the initial stack allocation, further stack submission can be avoided. Because there is usually a small amount of memory in the Windows CE environment, sometimes the kernel will reclaim the stack memory from the thread that no longer needs the stack memory. The OEM controls the timing of the process and transmits itSetoomevent. IfSetoomeventThe value triggers a search for the stack memory. When the memory is deleted from the stack, the kernel suspends each thread with a available stack. To avoid this situation, please be careful when settingSetoomeventMemory usage parameters.

Page table pool size

In Windows CE. Net, a new OEM variable is created, calledDwoemptpoolsize. On the X86 platform, this variable allows the OEM to define the size of the virtual memory page table data structure used by the kernel to maintain the Memory Page List of each process. The default value is 16 pages, and the maximum value is 512 pages. Depending on the system, the OEM must add this value until the correct size is found. The size of the page table pool structure may affect the frequency at which the page pool must be reset to zero to map memory to different regions in the same process or to different processes. When no pointer is available in the data structure of the page table, the page returns to zero. For security reasons, the page must be restored first before it can be assigned to a new process. Returning several pages to zero may negatively affect the IST lag time of the real-time platform. However, the OEM must also balance the problem by increasing the structure, which will increase the average process switching time, because the kernel needs to reset the permissions of each page pointer in the structure. Because this variable is only used in Windows CE.. NET and later versions are available. Therefore, a quick fix Project (QFE) topic is created for developers using Windows CE 3.0. This topic sets the static size to the page table pool size. For more information, see Microsoft Web site, open the readme.txt file, and search for QFE 33.

Real-time measurement tools

Windows CE 3.0 Kernel updates include two kernel-level tools: the interrupt Timing Analysis (iltiming) and the scheduler Timing Analysis (osbench ). They are used to test the real-time performance of the kernel and measure the specific lag time. Performance numbers are hardware-specific, depending on the CPU type and speed, memory architecture, and Cache Organization and size.

Iltiming)

Measurements of ISR and ist latencies have been combined in the iltiming test tool (formerly called intrtime), which is provided in source code and distributed together with Microsoft platform builder. The system clock timer will be used for measurement so that iltiming can be used on all hardware platforms running Windows CE, because some platforms do not support independent and useless timers.

Under normal circumstances, the system clock will interrupt the kernel at a fixed interval. Then, the related system timer ISR will process the clock and return sysintr_nop to instruct the kernel to ignore the clock, or return the sysintr_resched wake-up scheduler.

The iltiming test tool measures the lag time by taking the N value of the default value of the system clock every n times to 5, that is, every five times of the system clock cycle, then a special sysintr_timing interrupt identifier event is notified. The main thread of the iltiming application waits for the sysintr_timing interruption event and becomes the ist. ISR and ist measurements are the counter values of the high-resolution timer since the last system clock cycle, derived from the timestamp.

Because iltiming only needs to make special modifications to oal without modifying the kernel, it can be easily adapted and run on any OEM platform.

Iltiming command prompt Parameters

The following changes can be made to the prompt parameters of the iltiming command:

•	Set ist to run on various priorities
•	Refresh or not refresh the cache after each interruption
•	Change ISR speed and number of captured interruptions
•	Print or output the collected results to a file.

You can use the following iltiming command prompt parameters:

Usage: iltiming [ options ]Options:   -p num           Priority of the IST (default 0 ; highest)   -ni              no idle priority thread    -i0              no idle thread (same as -ni)   -i1              Run idle thread type 1   -i2              Run idle thread type 2   -i3              Run idle thread type 3   -i4              Run idle thread type 4   -i5              Run idle thread type 5   -t num           SYSINTR_TIMING interval (default 5)   -n num           number of interrupts (default 10)   -all             print all data (default: print summary only)   -o file          output to file (default: output to debug)   -h               Display the help screen

Ist can run at different priority levels (-P. By default, the application refreshes the cache before each operation. Option-NCSDisableCachesyncCall.-TOption to set ISR rate, everyTTime clock cycle, system clock ISR returnSysintr_timing.

Iltiming can also create one or more Idle threads running in the background. By allowing the kernel to be in a non-preemptible kernel call that must be completed before running ist, this will affect the IST lag time. Five types of Idle threads are available:

•	Idle thread 1: A running thread that does nothing
•	Idle thread 2: one executionSetthreadpriority (idle)Thread
•	Idle thread 3: two alternate threadsSeteventAndWaitforsingleobjectFunction thread. The timeout value is 10 seconds.
•	Idle thread 4: two alternate executionsSeteventAndWaitforsingleobjectThe function thread has unlimited timeout.
•	Idle thread 5: one callVirtualallocAndVirtualfreeThe timeout is infinite. Used to refresh cache and TLB.

Measurement of external interrupt response

To quickly evaluate the system's daily real-time performance, the use of Interrupt timing analysis tools is sufficient to determine the ISR and ist interrupt latencies. This convenient method works for all supported processors, but relies on the timer of the device itself, which may affect the measurement.

Therefore, you can use a more detailed setting to accurately measure ISR and ist latencies. You can set the following two machines:

•	Workstation, used to generate external interruptions and measure the time taken to receive the validation signal from ISR and ist routines.
•	Devices in the test are used to receive external interruptions and switch the output line when ISR and ist routines arrive.

The test is executed at various levels of pressure. The test runs anywhere from one to several hundred threads with different priorities on the test device.

Workstation based on Microsoft Windows NT 4.0 (equipped with national instruments PC-TIO-10 digital I/O Timer/Counter card) to generate interrupt and time response, A cepc target platform with the same card is used to respond to these interruptions. Windows NT software utilizes the driver library provided by national instruments, while Windows CE software is written by Microsoft.

The operating theory is very simple: the PC-TIO-10 card has two sets of timers, each has five timers. Each group contains a timer that provides a resolution of 200 million in one second, while other timers have a granularity of microseconds. In addition, the card contains two groups (eight in each group) of digital I/O lines, each of which provides a line that can be used for edge or level-triggered interruptions. An output line of a Windows NT-based machine is connected to the External Interrupt pin of the cepc target platform, and a timer on the card of the Windows NT-based workstation is received.

When a Windows NT-based workstation declares an output line, it generates an interruption on the cepc target platform and starts the ISR and ist timers on the Windows NT card. The ISR on the target cepc platform confirms that the interruption was received by declaring an output line on the card, which will stop the ISR timer on the Windows NT-based workstation and notify the kernel to schedule the ist. When ist starts running, it will declare different output lines to stop the second timer on a Windows NT-based workstation. At this time, the workstation based on Windows NT can read the value on the timer counter to determine the interval between the generated interrupt and the response of the target cepc platform. Once the Windows NT-based workstation has read the counter value, it immediately sends out another interrupt. The target cepc platform uses this interrupt to set all output lines to the STANDBY state, so as to prepare for another loop.

The accuracy of the iltiming test results was validated using the preliminary results collected by the preceding measurement method.

Scheduler Timing Analysis (osbench)

Osbench (formerly known as ceench) is a performance tool attached to Windows CE 3.0 and later. It is used to test the performance timing of the scheduler and focuses on the time required to perform basic kernel operations (for example, the following synchronization operations: how long does it take to get the key section, how long does it take for the scheduling thread to wait for the event just set by another thread, and so on. Once appropriate, the test will run two sets of indicators: the threads in the process to the thread, and the threads that span the process to the thread. If appropriate, you can apply the stress package during the test.

In Windows CE, osbench collects timing examples for the following performance indicators:

•	GET/release key sections (quick and traditional)
•	Wait/notification event (wait separately and reset automatically)
•	Semaphores
•	Mutex
•	UseSleep (0)Automatic Generation

Different from the above generation/operation situations, the indicators are the timing of the interlock API and System Call overhead. These indicators are increment/decrease of mutual locks, interlock switching, and system API call overhead.

The following sample code illustrates the osbench command prompt parameters.

Usage: osbench [ options ]Options:   -all          Run all tests (default: run only those specified by -t option)   -t num        ID of test to run (need separate -t for each test)   -n num        Number of samples per test (default = 100)   -m addr       Virtual address to write marker values to (default = )   -list         List test IDs with descriptions   -v            Verbose : show all measurements   -o file       Output to CSV file (default: output only to debug)   -h            Display the help screenOSBench –listTestId 0 : CriticalSectionsTestId 1 : Event set-wakeupTestId 2 : Semaphore release-acquireTestId 3 : MutexTestId 4 : Voluntary yieldTestId 5 : PSL API call overhead

Same as the iltiming method,QueryperformancecounterFunction calls are used to obtain timing information. In additionQueryperformancecounterYou can specify to write the specific tag value to the virtual address. You can enable the hardware verification function by providing a virtual address at the command prompt when osbench starts. Then, the mark written in the virtual address can be monitored by the analyzer (with an external device timing), and the results are carefully checked.QueryperformancecounterTiming accuracy. External measurement settings similar to the interruption lagging time can be used for this purpose.

UseQueryperformancecounterWhen you call a function to obtain a timestamp, you must consider the counter frequency on a specific platform and the overhead of calling the function. Be careful when correctly excluding the measurement overhead in the last timing number. Before each test,QueryperformancecounterThe call repeats multiple iterations and deducts the average value from the final result.

When the operation can be completed in a very short time,QueryperformancecounterThe overhead of function calls will become very important. In these cases, the Operation cyclically performs a fixed number of times (that is, each sample iteration (IPS) times). The number of times is clearly given by each test, and then the average value of the result is obtained. If hardware verification is enabled, special sub-tag values are provided for these cases. The side effect of this loop is that the cache cannot be refreshed between each iteration of the operation. For other tests with IPs equal to 1, the test runs twice, and the cache for each iteration is not refreshed once.

The following sample code shows the osbench test output.

==================================================|   1.00   |   IP =   NO   |   CS =   NO   |        1 IPS--------------------------------------------------Event intraprocess :Time from SetEvent in one thread to a blocked                WaitForSingleObject() waking in another thread in the same process.--------------------------------------------------|   Max Time =          10.057 us|   Min Time =           5.867 us|   Avg Time =           6.823 us==================================================

In the example where the test number is 1.00 (the output is shown above), the operation times the event synchronization object in the process. Ips1 is 1; not executed after each runCachesync(CS); The status (IP) between processes shows that the second process is not using two threads in the same process. The maximum, minimum, and average values of the 100 operation are given in microseconds if no content is specified at the command prompt. The basic suite of tests and the overall layout of the osbench program make it easy to add new test cases and measurement methods, which supplement the implementation of specific kernel functions that may have special significance.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

To improve real-time performance, design and optimize Microsoft Windows CE. Net (medium)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

To improve real-time performance, design and optimize Microsoft Windows CE. Net (medium)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support