The most recent week is busy, the main work is to do a call "Keyboard Wizard" things, simply put a lot of data into memory, the data for quick retrieval, and then find the input conditions to match the best 10 records and show. Specific and the following two stocks of software related functions similar to:
Data in the form of text exists in the file, and the volume of data is large, there are nearly 200,000, each record has several fields, separated by delimiters. Using 60,000 records of the test data, the text file will be nearly 10M, the module loaded into memory and set up the cache, will probably occupy nearly 70-80m memory. After self takeover, the main task is to reduce memory consumption and improve the matching efficiency.
First, avoid creating unnecessary objects
Get the code, the first step is to look at the design document, and then breakpoint step-by-Step look at the code, probably understand the logic, found that there are some problems. The previous code processing process is probably the following:
1. Read the file to memory, instantiate
2. Retrieve the file according to the condition and store it in the result set 1
3. Matching the results in result set 1 and storing them in the result sets 2
4. Sort the match by the result set 2, fetch the most matched 10 records, and return
The process is in the same. But there are a lot of problems, the biggest problem is that the temporary variables store too many intermediate processing results, and these objects are immediately discarded after a query is completed, a large number of temporary objects bring a large amount of GC pressure. For example, when the user enters 1 in the input box, assuming the use of contains to match, then from 60,000 records to find 1 records may have more than 40,000, and then need to store more than 40,000 records in a temporary variable processing, to further calculate the matching of 40,000 records, It is then stored in a similar keyvaluepair set, the key is matched, then the set is sorted by key, then the first 10 best records are taken. As you can see, a large number of temporary variables are created in the middle, resulting in a surge in memory, which is reclaimed immediately after a large number of temporary objects are created, and the GC is stressed.
In a design document, only the 10 most matched records are required to return, and the previous solution does not seem to notice this. So after taking over, the first step is to streamline the process. The following are streamlined:
Read the file to memory, instantiate
Retrieves a file based on a condition, if it exists:
Calculates the matching degree.
To match the key, stored in a sortlist of only 11 capacity.
If the Sortlist collection adds more than 10 after the record is added, the last element is removed and the first 10 minimum (best match) record is always maintained.
After the traversal completes, returns the collection object
After this modification, reduce the amount of temporary data used to memory, the whole process, I just use a capacity of 11 sortlist structure to store the middle of the process, each time inserting an element, Sortlist help us arrange the order, and then remove the most mismatched one, that is, the last element ( From small to large sort, the more matches, the smaller the value. This consumption is mainly sortlist insertion, internal sorting and removal of records. When it comes to the choice of sortlist or sortdictionary, and then find some information, sortdictionary in the internal use of red-black tree implementation, sortlist using an ordered array implementation, in the internal sort are O (logn) under the premise , the time complexity of Sortdictionary O (LOGN) inserts and deletes elements is better than sortlist, but sortdictionary consumes more memory than sortlist. Basically, this is a balance between query speed and memory allocation, since there are only 11 objects to store, so there is little difference between the two. In fact, even if there is no such structure, they can be achieved, nothing more than a collection, each time add a, row a good order, and then remove the largest one. NET is easy to use because there are a lot of these powerful built-in data structures.
After this little modification, the memory footprint was reduced 1 time times, from the original 70-80m, reduced to 30-40m, in fact, this is to reduce the memory cost of one of the most basic principle, that is to avoid creating unnecessary objects.
II. optimization data types and algorithms
It becomes more and more difficult to lower the memory back. After looking at the code, there are some other problems in the code, such as a large number of objects instantiated into memory at the outset, and then kept. There are a lot of information in each record, but the only four fields that are really useful for searching matches are the following, but the whole instantiation also serializes the other fields that are not used. Causes a lot of memory to be occupied by useless fields.
"Stock code stock Chinese name Chinese Pinyin market type ...
600000 Pudong FA Bank pfyh Shanghai A-share ...
So the first step is to store only the top four key fields that need to be retrieved in memory, each record starts with string[] data, rather than using a class or other structure to save it, and tries to save it using a struct, but because of the four fields, the data is large and the middle is passed as a parameter. So it's bigger than using a class, and here's just a simple use of arrays.
In addition to these above, in order to improve search efficiency, the data in accordance with the beginning of the 0-9,a-z data segmentation block cache, so that when the user input 0 o'clock, directly from the 0-key block to read data, so the speed is faster, but a large number of caches also increase the consumption of memory. The cached data is basically as large as the raw data loaded into memory. And in the search process, is also the use of full search, for 170,000 data four fields, each query to do 170000*4 traversal comparison, to find the most matching 10 data.
To this end, the introduction of incomplete search, that is, in advance of various types of securities, such as stocks, funds, bonds classification, for each category by the securities code to sort. When the user sets the priority of the search, look up in each class in turn, and if you find 10 records that meet the criteria, return immediately, because the data has been sorted in advance according to the type and code of the securities, so there's definitely not a high level of matching found later, and this improves the efficiency of search queries directly. Searching for ordered data is generally more efficient than unordered data lookup. Some of our common search algorithms, such as the binary lookup method, is the premise of the set to be found in an orderly arrangement.
Third, the use of unmanaged code or modules to write data processing logic
The above two operations, while reducing the memory footprint of nearly 50-60%, but still not up to the requirements of the leadership, and then try and compare the various memory footprint that uses different data structures to load data into memory, including directly reading files by type into strings, arrays, structs, and classes. Memory consumption of the smallest directly read the file as a string, 10M of data files read into memory will occupy 20-30m space, but also do not talk about the process of processing some of the temporary variables of memory consumption. After checking with tools such as Dottrace and CLR profile, it is found that the memory footprint is also the raw data. Then search the Internet for a reduction in "How to reduce the memory usage of. NET applications". NET memory footprint, I see this answer on the StackOverflow:
The classmate pointed out. NET application can have a larger memory footprint than other programs written in local code, if you care more about memory overhead. NET may not be the best choice. The memory of the. NET application is partly affected by garbage collection. And pointed out that some data structures such as list, the system will allocate redundant space. You can use value types instead of reference types, and do not create large objects to avoid memory fragmentation, and so on, to reduce memory footprint.
After all this has been considered, memory is still not up to the requirements, so start looking for ways to call unmanaged code to control the allocation and destruction of memory more flexibly. But the whole procedure is adopted. NET written, all switch to C or C + + is not realistic, so there are only two scenarios, one is the use of unsafe code, the second is the data loading and retrieval module using C or C + +, in. NET using P/invoke technology calls.
Just start to use unsafe code, the data loading and retrieval directly in the unsafe code. Later found that the code is a bit messy, different styles of code mixed together is not very good, and data loading and retrieval of the logic is also more complex. So directly using the second scheme, the use of C + + to write data loading and retrieval logic. And then in. NET inside the call.
Before the start, some evaluations were done, such as loading the same 10M of data into memory, and storing it in a string. NET will occupy 20-30m memory, and in C + + only 9-10m appearance, and the change is very small. This is the result of the need.
Not familiar with C + +, cramming, flipping through the relevant chapters on strings and STL in C + + Primier Plus, and requesting other development teams to give some assistance in defining the basic interface. To demonstrate, I created two projects, a C + + Win32 DLL project called Secudata, and a C # WinForm program that tests the class library, called Secudatatest.
I have defined 4 methods in C + +, an initialization load data, a set search priority, a find matching method and an unload data method, the specific algorithm for work reasons inconvenient to post, here just to give a simple example, method name and engineering structure as shown below:
And then in. NET to introduce methods defined in C + + DLLs using P/invoke technology.
So you can be in. NET, it is necessary to note that the passed-in value of the method is used here with string type, and that the second StringBuilder type is the true return value of the method, and the whole int return value of the method indicates whether the method executed successfully. When a lookup method is invoked, the second StringBuilder parameter must initialize the size of a maximum query result, because the result is written to this object in C + +, and an exception is thrown if it is not initialized or too small. Of course, you can return the structure directly, which requires an extra definition, which returns a string. Finished myself in. NET inside to parse it.
Note that when debugging, if you need to debug the code in C + +, you need to specify the DLL's build directory and start the target, and set the C + + project as the startup project, where I set the build The directory of the DLL is the directory where the SecuDataTest.exe files generated by the Secudatatest project are located, and the debug startup target is set to SecuDataTest.exe so that breakpoints are set in the C + + project, and the. NET WinForm program is started when p/ When invoke triggers a breakpoint, it is possible to gradually debug C + + code.
At the time of release, it is best to modify the default dynamic library configuration to a static library, so that VS will package dependent C + + libraries into the generated DLLs, and deployment to the client machine will not cause problems. The properties of the Secudata Class Library project are set to the following figure:
After the change to this p/invoke mode, 10M data is loaded into memory, memory footprint is only about 10M, compared with the previous. NET 30-40m memory has been reduced a lot, and memory fluctuations are relatively small, to meet the memory footprint requirements.
There are some benefits to adopting this "mix-and-match" approach. NET's rapid development, but also has C + + flexible memory allocation destroy mode and code security protection. In many cases, some of the memory footprint can be more sensitive, large data processing logic, in C + + processing, the use of flexible manual memory management mode to reduce the memory footprint, the core of the data structure and algorithm using C + + to write, you can improve the security of code, improve the program's decompile difficulty.
Four, the conclusion
. NET applications need to load the CLR and some common class libraries, and have a garbage collection mechanism that is more footprint than other native languages such as c,c++. NET to create a simple WinForm may occupy nearly 10M of memory, so as the development of the memory footprint will be relatively large. Of course, these are many times due to the developers themselves. NET underlying mechanisms, such as the use of reference types in some places with value types, the creation of a large number of temporary objects with short periods, the use of too many static variables and members that cause memory to be long occupied without recovery; NET internal mechanisms, such as collection objects in the internal allocation of extra space, and so on. Most of the time because there are. NET's GC mechanism, so that we do not have to focus on the destruction of objects and very "generous" to create new objects, to use some heavy built-in objects, resulting in too much memory footprint. Solve these problems, can actually reduce. NET application, which is an unnecessary memory footprint for a significant portion of the
In addition to understanding. NET Framework, good ideas, efficient data structures and algorithms can also make the problem simple and reduce the overhead of memory.
Finally, it is more sensitive to memory requirements, you can use the manual flexible memory management language to write the corresponding modules, in. NET using P/invoke technology to make calls to reduce some memory.
Above is my pair of lower. NET application memory footprint is a little bit of practice and summary, I hope to help you.