UTF8 string in Lua interception and word count "reprint"

Source: Internet
Author: User

Reproduced from: GitHub:pangliang/pangliang.github.com

Demand is truncated by its literal number.
function (string, start position, intercept length) utf8sub ("Hello 1 world haha",2,5)    =good 1 World ha utf8sub ("1 Hello 1 world haha",2,5)    =Hello 1 World utf8sub ("Hello World 1 haha",1,5)    =Hello World 1utf8sub ("12345678",3,5)    =34567Utf8sub ("øpø Hello pix",2,5) = Pø Hello P
Error method

Find some algorithms on the net, are not very correct; Either is garbled, or is only considered the case of 4 byte Chinese, not comprehensive

1. String.sub (s,1, intercept length)

Online a lot of direct use "'" "String.sub (s,1, intercept length) '" is certainly wrong, because if the Chinese and English mixed strings, such as ' Hello 1 world ' character length is ' 4,4,1,4,4 ', if intercept 4 words, 4*4=4+4+1+4+3, that ' The word "world" will be taken to the first 3 bytes, it will appear garbled

2. If byte>128 then index = index + 4

Key issues

1. UTF8 characters are variable-length characters

2. Character length is regular

UTF-8 character Law

The first byte of the string represents the length of the UTF8 character

0xxxxxxx-1 byte

110yxxxx-192, 2 byte

1110yyyy-225, 3 byte

11110zzz-240, 4 byte

Correct algorithm
1 --2 --Lua3 --determine the UTF8 character byte length4 --0xxxxxxx-1 byte5 --110yxxxx-192, 2 byte6 --1110yyyy-225, 3 byte7 --11110zzz-240, 4 byte8 Local functionchsize (char)9     if  notChar ThenTen         Print("Not Char") One         return 0 A     ElseIfChar > -  Then -         return 4 -     ElseIfChar >225  Then the         return 3 -     ElseIfChar >192  Then -         return 2 -     Else +         return 1 -     End + End A  at --calculates the number of UTF8 string characters, each of which is calculated by one character - --For example Utf8len ("1 hello") = 3 - functionUtf8len (str) -     LocalLen =0 -     LocalCurrentindex =1 -      whileCurrentindex <= #str Do in         Localchar =String.byte(str, currentindex) -Currentindex = Currentindex +chsize (char) toLen = len +1 +     End -     returnLen the End *  $ --Intercept UTF8 stringPanax Notoginseng --str: The string to intercept - --Startchar: Start Word Poute, starting from 1 the --numChars: Length of characters to intercept + functionutf8sub (str, Startchar, NumChars) A     LocalStartIndex =1 the      whileStartchar >1  Do +         Localchar =String.byte(str, startIndex) -StartIndex = StartIndex +chsize (char) $Startchar = Startchar-1 $     End -  -     LocalCurrentindex =StartIndex the  -      whileNumChars >0  andCurrentindex <= #str DoWuyi         Localchar =String.byte(str, currentindex) theCurrentindex = Currentindex +chsize (char) -NumChars = NumChars-1 Wu     End -     returnStr:sub (StartIndex, Currentindex-1) About End $  - --Self Test - functionTest () -     --Test Utf8len A     assert(Utf8len ("Hello 1 world haha") ==7) +     assert(Utf8len ("Hello World 1 haha") ==8) the     assert(Utf8len ("Hello World 1 haha") ==9) -     assert(Utf8len ("12345678") ==8) $     assert(Utf8len ("øpø Hello pix") ==8) the  the     --Test Utf8sub the     assert(Utf8sub ("Hello 1 world haha",2,5) =="Good 1 world, huh?") the     assert(Utf8sub ("1 Hello 1 world haha",2,5) =="Hello 1 World") -     assert(Utf8sub ("Hello 1 world haha",2,6) =="Hello 1 World") in     assert(Utf8sub ("Hello World 1 haha",1,5) =="Hello World 1") the     assert(Utf8sub ("12345678",3,5) =="34567") the     assert(Utf8sub ("øpø Hello pix",2,5) =="pø Hello P") About  the     Print("All test succ") the End the  +Test ()

UTF8 string in Lua interception and word count "reprint"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.