[Problem]-Delphi introduction to utf8string

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Xe (2009 and later) in the string is the unicodestring, using the WINNT series of system kernel Character Set UTF-16 (or ucs2 ). Like the UTF-16 le, UTF-8 is also an encoding for the Unicode Character Set, with the same expression range. The main difference between the two lies in the encoding method, the former can basically be regarded as a fixed length, while the length of the latter is not fixed: In the UTF-16, a character at least two 2 bytes, there are also some very biased characters in 4 bytes (only 2 and 4 characters in length, the latter is not commonly used, and even the supported fonts are hard to find ); while UTF-8 occupies at least 1 byte, check the 6.0 (http://www.unicode.org/Public/6.0.0/ucd/UnicodeData.txt) just came out in October, Unicode characters are now ranked U + 10 fffd, this means that at present, up to 4 bytes are occupied (generally Chinese characters occupy 3 bytes ). The two are both Unicode character sets, but their expressions are different. No one is more compatible with each other.
Because the UTF-8 encoding method has nothing to do with the machine's endian and is easy to verify the integrity, UTF-8 has a great advantage in transmission. In addition, because many texts contain ASCII characters, therefore, in some cases (especially for the US and Europe), UTF-8 also has an advantage.
If the Windows platform wants to display characters through winapi, it will eventually use a UTF-16 string (*** version A converts to Unicode inside the API ). Therefore, in general, it is most convenient to directly use unicodestring, such as Visual controls. It is only possible to convert it to UTF-8 during storage and transmission.
Using utf8string in the library must be carefully weighed, rather than simply saying which one is better than the other. In some cases, the core processing functions are inside the database, which takes a lot of time. It is difficult to migrate the Unicode version or occupy more memory, but the effect is not necessarily good. At this time, it is better to retain the original ANSI version library. After a slight modification, UTF-8 is supported, not only Unicode is supported, but the memory usage remains unchanged and the code is not modified much.
The most common application is string matching, such as B-M-based Algorithm The substring matching is determined by the principle of the algorithm, if the space of integer * 64 K is not used to match the UTF-16, for the pure ASCII character search, it is impossible to ensure that every movement is the maximum moving distance, and the performance of the algorithm itself cannot be used out. However, a 256 K space usage is a waste of memory and the initialization constant takes a long time, in addition, the performance of subsequent searches is also affected when the CPU cache is loaded with so much data. If UTF-8 is used, the space complexity is still 256*4, and the constant initialization time remains unchanged. Each movement can basically ensure a long length. Therefore, the more complex string algorithm in Unicode support, the direct use of UTF-16 is few, and the use of UTF-8 is more.
Again at the same time to consider the memory usage scenario, under normal circumstances, the vast majority of the content is ASCII characters, the use of UTF-16 is too waste of space. Common applications are lexical processing, such as compilers or lexical scanners such as HTML and XML.
Regular Expressions conform to all the applications described above. The structure and conversion of automatic machines are very complex (lexical scanning also uses automatic machines), which is difficult to migrate from ANSI to UTF-16, the space complexity is also greatly increased.

Problem from http://bbs.2ccc.com/topic.asp? Topicid = 369604

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Problem]-Delphi introduction to utf8string

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[Problem]-Delphi introduction to utf8string

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support