Xe (2009 and later) in the string is the unicodestring, using the WINNT series of system kernel Character Set UTF-16 (or ucs2 ). Like the UTF-16 le, UTF-8 is also an encoding for the Unicode Character Set, with the same expression range. The main difference between the two lies in the encoding method, the former can basically be regarded as a fixed length, while the length of the latter is not fixed: In the UTF-16, a character at least two 2 bytes, there are also some very biased characters in 4 bytes (only 2 and 4 characters in length, the latter is not commonly used, and even the supported fonts are hard to find ); while UTF-8 occupies at least 1 byte, check the 6.0 (http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt) just came out in October, Unicode characters are now ranked U + 10 fffd, this means that at present, up to 4 bytes are occupied (generally Chinese characters occupy 3 bytes ). The two are both Unicode character sets, but their expressions are different. No one is more compatible with each other.
Because the UTF-8 encoding method has nothing to do with the machine's endian and is easy to verify the integrity, UTF-8 has a great advantage in transmission. In addition, because many texts contain ASCII characters, therefore, in some cases (especially for the US and Europe), UTF-8 also has an advantage.
If the Windows platform wants to display characters through winapi, it will eventually use a UTF-16 string (*** version A converts to Unicode inside the API ). Therefore, in general, it is most convenient to directly use unicodestring, such as Visual controls. It is only possible to convert it to UTF-8 during storage and transmission.
Using utf8string in the library must be carefully weighed, rather than simply saying which one is better than the other. In some cases, the core processing functions are inside the database, which takes a lot of time. It is difficult to migrate the Unicode version or occupy more memory, but the effect is not necessarily good. At this time, it is better to retain the original ANSI version library. After a slight modification, UTF-8 is supported, not only Unicode is supported, but the memory usage remains unchanged and the code is not modified much.
The most common application is string matching, such as B-M-based Algorithm The substring matching is determined by the principle of the algorithm, if the space of integer * 64 K is not used to match the UTF-16, for the pure ASCII character search, it is impossible to ensure that every movement is the maximum moving distance, and the performance of the algorithm itself cannot be used out. However, a 256 K space usage is a waste of memory and the initialization constant takes a long time, in addition, the performance of subsequent searches is also affected when the CPU cache is loaded with so much data. If UTF-8 is used, the space complexity is still 256*4, and the constant initialization time remains unchanged. Each movement can basically ensure a long length. Therefore, the more complex string algorithm in Unicode support, the direct use of UTF-16 is few, and the use of UTF-8 is more.
Again at the same time to consider the memory usage scenario, under normal circumstances, the vast majority of the content is ASCII characters, the use of UTF-16 is too waste of space. Common applications are lexical processing, such as compilers or lexical scanners such as HTML and XML.
Regular Expressions conform to all the applications described above. The structure and conversion of automatic machines are very complex (lexical scanning also uses automatic machines), which is difficult to migrate from ANSI to UTF-16, the space complexity is also greatly increased.
Problem from http://bbs.2ccc.com/topic.asp? Topicid = 369604