Code and principle for splitting UTF-8-encoded strings into single words and getting the number of characters in UTF-8 strings

Last Update:2014-05-20 Source: Internet

Author: User

Tags 0xc0

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. character encoding

1. ASCII code

In a computer, all information is eventually represented as a binary string. Each binary bit has two states: 0 and 1. Therefore, eight binary bits can combine 256 states, which is called a byte ). That is to say, a single byte can be used to represent 256 different States. Each State corresponds to one symbol, that is, 256 symbols, from 0000000 to 11111111.
In the 1960s s, the United States developed a set of character codes to define the relationship between English characters and binary characters. This is called ASCII code, which has been used till now.
The ASCII code consists of A total of 128 characters. For example, the SPACE is 32 (Binary 00100000), and the uppercase letter A is 65 (Binary 01000001 ). These 128 symbols (including 32 control symbols that cannot be printed) only occupy the last seven digits of one byte, and the first one digit is set to 0.

2. Unicode

It is enough to encode English with 128 symbols, but it is not enough to represent other languages. Therefore, many European countries have invented many non-ASCII codes, also using a byte, with the highest bit of 1 range (both 128 ~ 255) to extend the original ASCII Code, one of the more famous is IBM character encoding. In this way, the encoding systems used by these European countries can represent a maximum of 256 symbols. However, there are new problems. Different countries have different letters. Therefore, even if they all use 256 characters, they represent different letters. For example, 130 represents é in French encoding, but in Hebrew encoding represents the letter Gimel (delimiter). In Russian encoding, it represents another symbol. However, in all these encoding methods, 0-represents the same symbol, but the difference is only the 128-255.

As for Asian countries, more characters are used, and about 0.1 million Chinese characters are used. A single byte can only represent 256 types of symbols. It must be expressed by multiple bytes. For example, the common encoding method for simplified Chinese is GB2312, which uses two bytes to represent a Chinese character. Therefore, it can theoretically represent a maximum of 256x256 = 65536 characters.

There are multiple encoding methods in the world. The same binary number can be interpreted as different symbols. Therefore, to open a text file, you must know its encoding method. Otherwise, garbled characters may occur when you use an incorrect encoding method. Why do emails often contain garbled characters? It is because the sender and receiver use different encoding methods.
As you can imagine, if there is an encoding, all the symbols in the world will be included. Every symbol is given a unique encoding, so the garbled problem will disappear. This is Unicode, as its names all represent. This is the encoding of all symbols.

Unicode is, of course, a large collection. The current size can contain more than 1 million characters. Each symbol is encoded differently. It should be noted that Unicode is only a collection of symbols. It only specifies the binary code of the symbol, but does not specify how the binary code should be stored.

3. UTF-8

With the popularity of the Internet, a unified encoding method is strongly required. UTF-8 is the most widely used unicode implementation method on the Internet. Other implementations also include UTF-16 and UTF-32, but are basically not needed on the Internet. Repeat, the relationship here is that UTF-8 is one of the Unicode implementation methods.
The biggest feature of UTF-8 is that it is a variable length encoding method. It can use 1 ~ The six bytes indicate a symbol, and the length of the byte varies according to the symbol.
UTF-8 coding rules are very simple, only two:
1) for a single-byte symbol, the first byte is set to 0, and the last seven digits are the unicode code of this symbol. Therefore, for English letters, the UTF-8 encoding and ASCII code are the same.
2) for the n-byte symbol (n> 1), the first n bits of the first byte are set to 1, and the n + 1 bits are set to 0, the first two bytes are set to 10. The remaining unmentioned binary bits are all unicode codes of this symbol.
For example:
1 byte 0 xxxxxxx
2 bytes 110 xxxxx 10 xxxxxx
3 bytes 1110 xxxx 10 xxxxxx 10 xxxxxx
4 bytes 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
5 bytes 111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
6 bytes 1111110x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
Therefore, the UTF-8 can be used to indicate that the actual number of digits of the character encoding is up to 31 characters, that is, the number of digits indicated by x in the above table. Except for the control bits (10 at the beginning of each byte), the bits indicated by x correspond to UNICODE encoding in a one-to-one manner, with the same order of bits.
The actual conversion of UNICODE to UTF-8 encoding should first remove the high 0, then according to the number of digits of the remaining encoding determine the minimum number of digits of the UTF-8 encoding required.
Therefore, the characters in the basic ASCII character set (UNICODE-compatible ASCII) can be expressed by only one byte of UTF-8 encoding (7 bits.

According to this rule, you can easily split the string encoded by the UTF-8 into a single word set, the Code is as follows:

 1   size_t utf8_to_charset(const std::string &input, std::set<std::string> &output) { 2     std::string ch;  3     for (size_t i = 0, len = 0; i != input.length(); i += len) { 4       unsigned char byte = (unsigned)input[i]; 5       if (byte >= 0xFC) // lenght 6 6         len = 6;   7       else if (byte >= 0xF8) 8         len = 5; 9       else if (byte >= 0xF0)10         len = 4;11       else if (byte >= 0xE0)12         len = 3;13       else if (byte >= 0xC0)14         len = 2;15       else16         len = 1;17       ch = input.substr(i, len);18       output.insert(ch);19     }   20     return output.size();21   }

Here, I want to convert a string to a set of single words because of the needs of the Application Scenario. if you want to maintain the position of a single word in the string, you can easily replace the set with a vector.

The following code gets the number of characters in a UTF-8 string (note, not the string length:

 1   size_t get_utf8_length(const std::string &input) { 2     size_t length = 0; 3     for (size_t i = 0, len = 0; i != input.length(); i += len) { 4       unsigned char byte = input[i]; 5       if (byte >= 0xFC) // lenght 6 6         len = 6;   7       else if (byte >= 0xF8) 8         len = 5; 9       else if (byte >= 0xF0)10         len = 4;11       else if (byte >= 0xE0)12         len = 3;13       else if (byte >= 0xC0)14         len = 2;15       else16         len = 1;17       length ++;18     }   19     return length;20   }

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More