Multithreading and memory (HEAP)

Source: Internet
Author: User

Pay enough attention to the memory issue in multithreading. Otherwise, the program will always encounter inexplicable errors. If you fantasizing that "should" not appear in certain situations, or think that some steps are in the normal order, then the error is very serious, because there is a simple fact that you have not followed the facts, you want to compete with the CPU!

Xx sdk's error and exit problems have not been effectively solved since last year. In the recent period, they were frequently exposed, so it was time to face them directly, I went to XX n times in the previous two weeks. Till now, I am not sure whether the problem has been solved.

SDK symptom:
There is no problem in your own test demo, but there is a problem on the platform.
Tests later proved that the demo was not correct, but that the original demo was only running the device that was tested by itself. The number of devices was small, the environment was not real, and the actual situation could not be effectively tested, therefore, the demo will not have any problems.

Running the program directly is the kind of debugging error. In the VC debug mode of the platform, the error that triggers the user breakpoint occurs. In the sdk vc debug mode, the essence of errors is exposed:
Heapdomainpingtai.exe]: Heap: Free heap block 2be3000 modified at 2be3200 after it was freed
This type of error has never been encountered in the past. After investigation, it indicates that an error is generated because the heap is damaged. There are indications that the content of a memory is modified after it is deleted. This means that the heap is destroyed and the error occurs. The simulation program is as follows:

Void fun1 ()
{
Char * P = new char [128];
Delete [] P;

Strcpy (P, "abcdefghijklmnopqrstuvwxyz ");
}

During my simulation, the situation is as follows:
1. If my strcpy () has less than 13 bytes, no error is reported. This situation is not absolute
2. When fun1 () is returned, no error occurs. When the program exits, an error is returned.
3. If you call any other function after fun1 () is returned, an error is returned.

To facilitate the subsequent descriptions, the error is described as follows:
Heap [<EXE>]: Heap: Free heap block <addr1> modified at <addr2> after it was freed

Analyze the errors reported by the simulation program and draw the following conclusions:
The memory <addr1> is deleted, and the content in the memory <addr2> is modified. <addr2> is included in <addr1> in the following format:
| ------------------------------ |
| <Addr1> | <addr2> |
| ------------------------------ |
In the simulation program, <addr1> = P, <addr2> = P and <addr2> <p + 128
In the real memory, <addr1> is not equal to P, but smaller than P. It does not represent any memory in the program code, but in the new, memory pointer generated when P memory is saved. This pointer is saved and maintained by Windows (Note: In my opinion, the address of P can be obtained through computation ), the actual mode is as follows:
| ------------------------------ |
| <Addr1> | p |
| ------------------------------ |
<Addr1> it is the memory occupied by windows, while p is the memory occupied by programs.

Since the memory operation is clear and you know how to generate this error, you need to find out which part of the program has this problem based on <addr2> and adopt the reverse push method, that is, first, find the memory to be deleted based on <addr2> and re-attach the value (or somewhere in it is re-appended ), then, do not conflict the deletion with the value.

There should be no problem with the program sequence, because the values are not added after deletion, or any other possibility of use, start from the variable for memory allocation (mainly char * type memory allocation. Print all the new char * pointer addresses and their lengths, and search for them one by one from tens of thousands of lines of print information, but no <addr2>
This is strange. Is there a problem with the execution sequence of the program? Impossible
Print all the new class addresses and their memory ranges, and find them one by one ......
This is the case for the caccept class! Is it still in use after it is deleted? According to the above memory analysis, this is definitely the case, otherwise..., find!
Because the caccept class is rarely used, it is easier to find it. It is mainly used in two places. One is to complete the dead loop of the port, and the other is to check the dead loop of the task, there must be conflicts between the two locations.
Working Mode of port completion:
While (true)
{
If (! Getcompleteio ())
Break;
If (isaccept ())
{
Caccept * PACC = getaccept (); // obtain from the list
// Use PACC with a value
......
// Delete PACC
Safe_delete (PACC)
}
Else
{
// Send and receive data and other messages on the socket Port
}
}

Task Check Mode:
While (true)
{
Wait (5000); // wait for 5 seconds to start the task check
Lock ();
While (not_end_accept_list)
{
Caccept * PACC = nextaccept ();
If (pACC-> toolongtime)
Safe_delete (PACC) // Delete PACC
}
Unlock ();

// Connection check
...
}

The sub-functions of accept (such as createaccept (), getaccept (), and deleteaccept () are all placed in the critical section. Check carefully and find that, the use of caccept on the completion port is not put into the critical section. Although its getaccept () has an operation in the critical section, although the operation on the accept list in the task check is also carried out in the critical section, the problem must have occurred here. For verification, the accept deletion at the job check and the use of accept on the completion port are printed. It is found that when the caccept obtained on the completion port is not used up, the cycle that has been checked by the task has been deleted!
This situation is difficult to occur normally. In order to make it easy to appear, the operation in the critical section is delayed, that is to say, each operation is waiting for a long time (there was no waiting in the past), such as 100 milliseconds, or 50 milliseconds. In this way, many operations can be suspended, when the number of connections to a device reaches a certain level, the more the number of pending operations will be. Using this method, some problems can be easily reproduced.
After finding the problem, it should be easier to solve it (some are tricky, such as processing the socket messages in the else branch of the port ), you only need to add the use of caccept to the corresponding critical section in the completion port. After modification and testing, there is no caccept heap problem. Now, this Part of the problem has been solved perfectly. That's a pleasure ---------, don't mention it ......

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>
There are two other problems: one is about the Upper-layer SDK call and the other is about the SDK.
These two problems are essentially the same as those above, that is, they are still used after they are deleted!
1. Upper-layer SDK calls
The SDK has notified the upper-layer device to be disconnected. However, after the upper-layer receives the message, it still calls the corresponding interface. This is not a problem because when the SDK is called on the upper layer, the SDK obtains the corresponding connection from the list of related devices and then operates on the devices using the connection, if the device is disconnected, the device will be deleted from the corresponding list. When you obtain the connection from the API called at the upper layer, the device will definitely fail to be obtained, naturally, an error is returned directly. That's right! You are right. That's it ....... Stop! You want to, think about it carefully, and then look down.
======================================
In the SDK, I do not know whether the upper layer calls the SDK and receives messages in one thread or two threads, the SDK does not know the order of SDK message processing and API calling in the upper layer. In one case, it should be easy to understand. When the SDK sends a lot of notifications to the upper layer, the upper layer may put the message into the message queue and then process it slowly one by one. The message it processes may be a message 10 seconds ago (exaggerated ), if this is the case, and the message is just a disconnected message, you need to issue a command to the device three seconds before it processes the message, the device is online, that is, the command fails to be issued, and the error message displayed is that the device is not online! In this way, the program will not go wrong, because 7 seconds, too long, the SDK should have deleted this connection from the connection list.
Unfortunately, this is not always the case, but a bad situation may happen. When the API function obtains the corresponding connection and sends a command to the device, the connection list is deleted cyclically by the Connection check in the task check (for example, no data is received for one minute). If the function that sends the command reads the data, if you do not perform any value-added operations, the error should be caused by an invalid memory read exception. Otherwise, the error should be the same as the above heap error. This kind of error may be easier to solve, for example, when used, it is also placed in the critical section. This is feasible, but when there are 100 APIs, you need to perform 100 operations in the critical section. This is not important. More importantly, because you have added operations in the critical section in the outermost layer, you must be careful to ensure that operations in the critical section do not conflict with each other. Otherwise, the program will not respond, and the task manager must be used to end the operation. If a value-included operation exists, this may be the only solution. However, if there is no value-included operation, it may be relatively simple, for example, I only saw this error in the API outer layer at that time, and added try... catch ()..., in this way, if the instance is deleted during execution, the API will return a failure directly. The specific solution can be different based on the actual situation. Here, I just want to clarify my own ideas.
2. Another internal problem of the SDK
It is exactly the same as caccept, but it is quite complex. because it involves almost all aspects of the program, it is quite tricky to solve the problem. Besides, it is not careful because it needs to communicate with the upper layer, this will cause a critical section conflict (except for the code at the upper layer, the code in the SDK is included in the critical section ):
[Receive socket message] --> [upper-layer Notification] --> [upper-layer sdk api call] --> [API calls SDK internal functions]
In this process, after the SDK notifies the upper layer, it needs to wait until the upper layer returns for the next step. The upper layer needs to call the API to return the result, while the API calls the internal function, you may need to enter the critical section before you can perform operations. Therefore, a conflict occurs in the critical section.
Therefore, when processing received socket messages, it is dangerous to be placed in the critical section. Therefore, it is tricky to find a good solution here.

Due to time issues, I will only talk about it.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.