A detailed description of string segmentation

Source: Internet
Author: User

A detailed description of string segmentation

Dionysoslai 20150817

Some time ago, there was a requirement to split a string of strings into a single character. For example, "Today is Valentine's Day!" , the result of segmentation is "Today", "Day", "yes", "yes", "sentiment", "person", "festival", "! " ". Because the string contains both Chinese and English, special characters, and so on, and each character is not a uniform byte encoding, for example, English is a single byte encoding, Chinese is two bytes encoded. Therefore, it is necessary to judge the type of coding.

         before dealing with this problem, it is necessary to understand the basic knowledge of character encoding, to understand ascii unicode utf-8 And so on, here is a brief description, specific article: http://www.ruanyifeng.com/blog/2007/10/ascii_ Unicode_and_utf-8.html

         ascii encode , which is the earliest character encoding, The length is one byte, 00000000~11111111 can be expressed 256 characters. Because 60 The United States, when enacted, does not consider other languages, Therefore, only the 128 0

         non ascii encoded , because there are some different letters in other languages, so from 128~255 was re-encoded. However, because the characters of each country are not uniform, the remaining 128 characters, is not enough, which resulted in different countries, there are different encoding methods, of course, from the 0~128 The characters of

Unicode, due to the different coding methods in each country, caused the communication aspect. Therefore, it is necessary to have a coding method of the same one, giving each symbol a uniform encoding. Unicode That 's how it's born, like u+0639 denotes the Arabic alphabet Ain , u+0041 an uppercase letter representing English A , U+4e25 representing Chinese characters " Strict "

But to this moment,UnicdoeIt only specifies the encoding method, and does not specify the way of storage. As a result, there are two problems. 1.How to differentiateUnicodeand theASCII? For example, "strict" use of two2byte encoding, how the computer recognizes whether the two bytes represent a symbol or two symbols. 2.storage issues, ifUnicodeprescribed use3byte represents a symbol, so for the English alphabet, the front2bytes must be all0, which creates unnecessary waste.

         utf-8 unicode , the corresponding implementation also has utf-16 utf-32 utf-8 1~4 utf-8 coding rules have 2 bar: ( Note: We solve the split-string principle, which is what we get from this

1. For a single-byte symbol, the first bit of the byte is set to 0, and the subsequent 7 is the Unicode encoding of the symbol . Therefore, for English characters, theUTF-8 encoding is consistent with the ASCII encoding, and its first byte size 0~127 (The code itself is represented by a single byte);

2. ForNa byte of a symbol(n > 1),the first n is set to 1(focus), and n+1 is set to 0 ,The front of the back byte2to be set asTen. The remaining bits, all of which are of this symbolUnicodecode.

For example, 2 bytes of encoded, the UTF-8 encoding method is 110xxxxx,10xxxxxx. The first byte size is 192~223. You can encode 3 bytes in turn.

According to this, the following table can be listed:

1 byte encoding: The first byte size 0~127;

2 byte encoding: The first byte size 192~223;

3 byte encoding: The first byte size 224~239;

4 byte encoding: The first byte size 240~247;

So the code looks like this: (PS, which was recently developed using LUA, was written in lua ):

function Stringcut (str)    local strcut = {};     Local cutindex = 1;    While true does        if Cutindex > String.len (str) then break            ;        End        Local curbyte = String.byte (str, cutindex)         local byteCount = 1;        If Curbyte>=0 and curbyte<=127 then          --1 byte encoding            byteCount = 1;        ElseIf curbyte>=192 and curbyte<=223 then    --2 byte encoding            byteCount = 2;          ElseIf curbyte>=224 and curbyte<=239 then    --3 byte encoding            byteCount = 3;        ElseIf curbyte>=240 and curbyte<=247 then   --4 byte encoding            byteCount = 4;        End        Local value = String.sub (str, cutindex, cutindex+bytecount-1);        Table.insert (strcut, value);        Cutindex = Cutindex + byteCount;    End     return strcut;end


Note that the encoding method here is UTF-8, if the encoding is UTF-16,UTF-32, then please Google , the principle is similar.

Extended Reading

* Http://www.joelonsoftware.com/articles/Unicode.html(the most basic knowledge about character sets)

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

A detailed description of string segmentation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.