If programmers are still people, there is no stranger to string processing. Both low-level and high-level languages cannot be separated from string processing, common development tools provide as detailed as possible help for character operations. Various string functions strcpy, strcmp, and strcat in Standard C ......), The C ++ Standard Library provides strings with more powerful functions, while the CString provided by Microsoft's MFC is even more powerful. Is there any reason to write your own string processing class? Happy programmers are similar. You can choose from existing class libraries, but my elderly programmers are very unfortunate, because I have too many reasons to write my own string processing class ...... On the surface, it is for ultimate optimization. In fact, it is because I am poor because I have no money to buy more and better servers.
First, let me explain the background. I suddenly had a deep interest in search engine technology one day in a certain period of time. For me, this passion is hard to suppress, I reported my thoughts to my wife and strongly demanded financial support. In her opinion, my wife has always had a bad hobby for me. So this time she was a little more generous, with a large amount of money for 10 thousand yuan ". 10 thousand yuan of funds, including purchasing servers and server hosting, is the cost of server hosting more than seven thousand a year? The price of the server is only over two thousand. I don't know if a server that is cheaper than a normal PC can be called a server ). For a complete search engine, several major modules are needed: data Capturing, data analysis, Chinese word segmentation, index creation, information deduplication, automatic information clustering, big data storage, data compression and decompression, query and sorting, WEB page generation, WEB server...... If you have no idea about the computing workload, I can tell you that a company needs to use two servers for three days for automatic clustering Calculation of hundreds of thousands of products ...... I have to complete all this work in real time on a low-end "shanzhai server". This is tantamount to loading elephants into the refrigerator. No matter how Yuan Fang sees them, I am afraid it will not work with the conventional approach, only optimization and optimization can survive. For the entire work of search engines, data capturing, data analysis, Chinese word segmentation, WEB page assembly, and so on are all operations on strings. It can be said that more than 90% of the time is related to string operations, therefore, in addition to the framework design, the string processing efficiency directly affects the overall performance of the search engine. So I have to answer two questions: How is the efficiency of the ready-made string classes? Is there any room for optimization? After I have made a detailed study on various string classes, I found that although the ready-made string classes do well in terms of performance, they are only good, and there is a lot of room for optimization. Next, let's take a look at my specific applications.
1. Memory Management is one of the most important tasks for servers.For Search Engine services, various types of data, especially string construction, pulling, clearing, substring replacement, substring extraction, and length resetting, are particularly frequent, the memory allocation and release operations occur almost every millisecond. What will happen? I once mentioned in Sina Weibo that "How can a chaotic system naturally achieve order? This is impossible. without human intervention, the development of all things is a process from order to chaos. For a server system that runs continuously for a long time, memory resources are repeatedly requested and released, which inevitably forms memory fragments. You cannot realize this, but you don't even realize it ." When you realize the impact of memory fragmentation on the server, you first think of the memory pool. Good. A well-designed memory pool can reduce memory fragments and greatly improve performance. In other words, the primary consideration for our string class is the integration with the memory pool technology. What should we do? Many ready-made string classes provide custom memory management interfaces. There are a lot of information available on the Internet. If you are interested, you can search.) This is also required for string classes. Because of the working mechanism of the search engine, various threads work independently, and there is very little mutual access between threads. Therefore, all operations on string objects are performed in one thread. Therefore, it is quite easy to use the lock-free programming technology for performance reasons in the memory pool design, edge memory application and release should not use the system's new and delete because these two operations are locked), but instead use the private heap method. Although this has a lot to do with the performance of the string, it is the memory management content. I will talk about it in detail later.
2. Supporting the copy-on-write technology is essential to improving performanceWhat is copy during write? COW (copy-on-write), short for copy-on-write, simply means that when copying an object, it does not really copy the original object to another location in the memory, instead, a pointer is set in the memory ing table of the new object to indicate the location of the source object, and the read-only mark is made inside the memory. In this way, when the read operation is performed on the new object, the memory data does not change, and the read operation is executed directly, but the write operation on the new object changes, to copy the real object to the new memory address, modify the memory ing table of the new object to point to this new location, and perform write operations on the new memory location. This write-at-time Copy technology is common at the underlying layer of the operating system. For example, if multiple processes are run in the same system, the same object data is stored in multiple processes) there is only one physical storage space in the physical storage. When a process tries to write to the region, the kernel will open up a new physical page in the physical storage, copy the content of the area to be written to the new physical page, and then write the new physical page. In this case, operations on different processes are implemented without affecting other processes, and a lot of physical memory is saved. Although the string is not at the same level as the operating system, it also has a common principle. In each function, it is inevitable that the string object is used as a parameter or as a return value, and in a large number of string operations, during writing, the copy technology will greatly reduce unnecessary memory application, release, and copy. Among the string classes I can find, there are basically three types. One is that some databases provide character string classes that do not support write-time replication; the other is that they support write-time replication, however, the lock technology is used to ensure thread security, resulting in slow performance. 3. It also supports the write-time replication technology instead of locking, but the memory management interface is actually global, "lockless programming" is not supported ". As I have always said before, the premise is that we already have a private heap-based lockless and efficient memory pool. If our string classes do not take advantage of this result, it is a private action.
Iii. Optimized formatting FunctionsString formatting is almost ubiquitous and involves small operations, such as integer, floating point, date, time, and so on. Large operations, such as filling in real data in the template in the format when generating WEB pages. The Format and AppendFormat methods are provided in Traditional string processing classes, such as the CString in MFC. the application performance is low, these two functions lurk inside the CPU: GetFormattedLengt. What is this? Calculate the length of the storage area before formatting. This function is quite time-consuming. I used the concept of changing the space time to add the Length Estimation parameter for the Format and AppendFormat methods. Before external calls, I first estimated that the length is generally much larger than the possible length of the implementation, although it may cause a small waste of space, it is also worthwhile to get a return on performance). In this way, you do not need to perform any further computation when actually executing formatting internally, and the performance is more than doubled.
4. Optimize the "+ =" OperationDo not tell me that you have never seen the "+ =" operation on strings. There are more String concatenation operations than beautiful women on the street, if you use the form of "Str1 = Str1 + Str2" instead of the form of "Str1 + = Str2", it is tantamount to letting the beauty give up elegance and spit with you. Despite the support of the copy-at-write technology, although the operations in the form of "Str1 = Str1 + Str2" are not elegant, the performance is not too bad, after all, there are two operations, "add" and "assign value", which bring difficulties to subsequent optimization. For WEB servers, splicing a WEB page of dozens of K or even hundreds of K is a matter of minutes and seconds. The traditional string class does not perform any special processing on this, and frequent and continuous "+ =" operations cause the memory to be continuously applied, released, and copied. In this application scenario, I have adopted two policies: one is the aforementioned "space for Time" policy, which allocates enough memory for string objects at the beginning. The second reason is that it is hard to evaluate "large enough". To avoid too frequent large memory movement operations, I have made some strategic Optimizations to "+ =" operations, that is, the newly added data is first saved in the small memory block list instead of immediately spliced to the back of the original string, the memory is allocated and the real splicing process is completed at a certain scale or when other operations are required. In the application environment, it is found that the performance is improved by an order of magnitude.
5. Special processing for small data stringsIn shopping search, small data strings are scattered all over the world like stars. These guys are small but numerous, such as millions of keywords and tens of millions of user query phrases, if we use the "space for Time" concept mentioned above, the wasted memory space will make you very ugly. For example, it doesn't matter if you buy a house and spend more than eight hundred yuan. If you spend more than eight hundred yuan for dinner every day, wait for your wife to cut you down ). In the case of small data, on the one hand, the growth granularity of memory allocation needs to be high, on the other hand, we also need to adjust the allocation and release of small memory in the processing of the memory pool. It seems quite simple to say. in the implementation process, there is no fixed pattern, and it is an empirical model. The above is the optimization process of the string processing class that is close to the conventional method, and the effect in practical application is quite satisfactory. If you are interested, you can check that your work is not a product)
Grain shell shopping search. The reason for the above optimization is close to the conventional approach, because I am still crazy about creating a "zero-memory" string class to process a large number of string data in a way that barely increases the memory, for example, you can create a search tree for keywords to improve search performance. Since it is not a conventional method, it is only applicable to specific application scenarios.
This article from the "Blue bee" blog, please be sure to keep this source http://bluebee.blog.51cto.com/661175/1033101