UTF8 string in Lua interception and word count "reprint"

Last Update:2016-03-16 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reproduced from: GitHub:pangliang/pangliang.github.com

Demand is truncated by its literal number.

function (string, start position, intercept length) utf8sub ("Hello 1 world haha",2,5)    =good 1 World ha utf8sub ("1 Hello 1 world haha",2,5)    =Hello 1 World utf8sub ("Hello World 1 haha",1,5)    =Hello World 1utf8sub ("12345678",3,5)    =34567Utf8sub ("øpø Hello pix",2,5) = Pø Hello P

Error method

Find some algorithms on the net, are not very correct; Either is garbled, or is only considered the case of 4 byte Chinese, not comprehensive

1. String.sub (s,1, intercept length)

Online a lot of direct use "'" "String.sub (s,1, intercept length) '" is certainly wrong, because if the Chinese and English mixed strings, such as ' Hello 1 world ' character length is ' 4,4,1,4,4 ', if intercept 4 words, 4*4=4+4+1+4+3, that ' The word "world" will be taken to the first 3 bytes, it will appear garbled

2. If byte>128 then index = index + 4

Key issues

1. UTF8 characters are variable-length characters

2. Character length is regular

UTF-8 character Law

The first byte of the string represents the length of the UTF8 character

0xxxxxxx-1 byte

110yxxxx-192, 2 byte

1110yyyy-225, 3 byte

11110zzz-240, 4 byte

Correct algorithm

1 --2 --Lua3 --determine the UTF8 character byte length4 --0xxxxxxx-1 byte5 --110yxxxx-192, 2 byte6 --1110yyyy-225, 3 byte7 --11110zzz-240, 4 byte8 Local functionchsize (char)9     if  notChar ThenTen         Print("Not Char") One         return 0 A     ElseIfChar > -  Then -         return 4 -     ElseIfChar >225  Then the         return 3 -     ElseIfChar >192  Then -         return 2 -     Else +         return 1 -     End + End A  at --calculates the number of UTF8 string characters, each of which is calculated by one character - --For example Utf8len ("1 hello") = 3 - functionUtf8len (str) -     LocalLen =0 -     LocalCurrentindex =1 -      whileCurrentindex <= #str Do in         Localchar =String.byte(str, currentindex) -Currentindex = Currentindex +chsize (char) toLen = len +1 +     End -     returnLen the End *  $ --Intercept UTF8 stringPanax Notoginseng --str: The string to intercept - --Startchar: Start Word Poute, starting from 1 the --numChars: Length of characters to intercept + functionutf8sub (str, Startchar, NumChars) A     LocalStartIndex =1 the      whileStartchar >1  Do +         Localchar =String.byte(str, startIndex) -StartIndex = StartIndex +chsize (char) $Startchar = Startchar-1 $     End -  -     LocalCurrentindex =StartIndex the  -      whileNumChars >0  andCurrentindex <= #str DoWuyi         Localchar =String.byte(str, currentindex) theCurrentindex = Currentindex +chsize (char) -NumChars = NumChars-1 Wu     End -     returnStr:sub (StartIndex, Currentindex-1) About End $  - --Self Test - functionTest () -     --Test Utf8len A     assert(Utf8len ("Hello 1 world haha") ==7) +     assert(Utf8len ("Hello World 1 haha") ==8) the     assert(Utf8len ("Hello World 1 haha") ==9) -     assert(Utf8len ("12345678") ==8) $     assert(Utf8len ("øpø Hello pix") ==8) the  the     --Test Utf8sub the     assert(Utf8sub ("Hello 1 world haha",2,5) =="Good 1 world, huh?") the     assert(Utf8sub ("1 Hello 1 world haha",2,5) =="Hello 1 World") -     assert(Utf8sub ("Hello 1 world haha",2,6) =="Hello 1 World") in     assert(Utf8sub ("Hello World 1 haha",1,5) =="Hello World 1") the     assert(Utf8sub ("12345678",3,5) =="34567") the     assert(Utf8sub ("øpø Hello pix",2,5) =="pø Hello P") About  the     Print("All test succ") the End the  +Test ()

UTF8 string in Lua interception and word count "reprint"

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

UTF8 string in Lua interception and word count "reprint"

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

UTF8 string in Lua interception and word count "reprint"

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support