A detailed description of string segmentation

Last Update:2015-08-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Dionysoslai 20150817

Some time ago, there was a requirement to split a string of strings into a single character. For example, "Today is Valentine's Day!" , the result of segmentation is "Today", "Day", "yes", "yes", "sentiment", "person", "festival", "! " ". Because the string contains both Chinese and English, special characters, and so on, and each character is not a uniform byte encoding, for example, English is a single byte encoding, Chinese is two bytes encoded. Therefore, it is necessary to judge the type of coding.

before dealing with this problem, it is necessary to understand the basic knowledge of character encoding, to understand ascii unicode utf-8 And so on, here is a brief description, specific article: http://www.ruanyifeng.com/blog/2007/10/ascii_ Unicode_and_utf-8.html

ascii encode , which is the earliest character encoding, The length is one byte, 00000000~11111111 can be expressed 256 characters. Because 60 The United States, when enacted, does not consider other languages, Therefore, only the 128 0

non ascii encoded , because there are some different letters in other languages, so from 128~255 was re-encoded. However, because the characters of each country are not uniform, the remaining 128 characters, is not enough, which resulted in different countries, there are different encoding methods, of course, from the 0~128 The characters of

Unicode, due to the different coding methods in each country, caused the communication aspect. Therefore, it is necessary to have a coding method of the same one, giving each symbol a uniform encoding. Unicode That 's how it's born, like u+0639 denotes the Arabic alphabet Ain , u+0041 an uppercase letter representing English A , U+4e25 representing Chinese characters " Strict "

But to this moment,UnicdoeIt only specifies the encoding method, and does not specify the way of storage. As a result, there are two problems. 1.How to differentiateUnicodeand theASCII? For example, "strict" use of two2byte encoding, how the computer recognizes whether the two bytes represent a symbol or two symbols. 2.storage issues, ifUnicodeprescribed use3byte represents a symbol, so for the English alphabet, the front2bytes must be all0, which creates unnecessary waste.

utf-8 unicode , the corresponding implementation also has utf-16 utf-32 utf-8 1~4 utf-8 coding rules have 2 bar: ( Note: We solve the split-string principle, which is what we get from this

1. For a single-byte symbol, the first bit of the byte is set to 0, and the subsequent 7 is the Unicode encoding of the symbol . Therefore, for English characters, theUTF-8 encoding is consistent with the ASCII encoding, and its first byte size 0~127 (The code itself is represented by a single byte);

2. ForNa byte of a symbol(n > 1),the first n is set to 1(focus), and n+1 is set to 0 ,The front of the back byte2to be set asTen. The remaining bits, all of which are of this symbolUnicodecode.

For example, 2 bytes of encoded, the UTF-8 encoding method is 110xxxxx,10xxxxxx. The first byte size is 192~223. You can encode 3 bytes in turn.

According to this, the following table can be listed:

1 byte encoding: The first byte size 0~127;

2 byte encoding: The first byte size 192~223;

3 byte encoding: The first byte size 224~239;

4 byte encoding: The first byte size 240~247;

So the code looks like this: (PS, which was recently developed using LUA, was written in lua ):

function Stringcut (str)    local strcut = {};     Local cutindex = 1;    While true does        if Cutindex > String.len (str) then break            ;        End        Local curbyte = String.byte (str, cutindex)         local byteCount = 1;        If Curbyte>=0 and curbyte<=127 then          --1 byte encoding            byteCount = 1;        ElseIf curbyte>=192 and curbyte<=223 then    --2 byte encoding            byteCount = 2;          ElseIf curbyte>=224 and curbyte<=239 then    --3 byte encoding            byteCount = 3;        ElseIf curbyte>=240 and curbyte<=247 then   --4 byte encoding            byteCount = 4;        End        Local value = String.sub (str, cutindex, cutindex+bytecount-1);        Table.insert (strcut, value);        Cutindex = Cutindex + byteCount;    End     return strcut;end

Note that the encoding method here is UTF-8, if the encoding is UTF-16,UTF-32, then please Google , the principle is similar.

Extended Reading

* Http://www.joelonsoftware.com/articles/Unicode.html(the most basic knowledge about character sets)

A detailed description of string segmentation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

A detailed description of string segmentation

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support