Reproduced from: GitHub:pangliang/pangliang.github.com
Demand is truncated by its literal number.
function (string, start position, intercept length) utf8sub ("Hello 1 world haha",2,5) =good 1 World ha utf8sub ("1 Hello 1 world haha",2,5) =Hello 1 World utf8sub ("Hello World 1 haha",1,5) =Hello World 1utf8sub ("12345678",3,5) =34567Utf8sub ("øpø Hello pix",2,5) = Pø Hello P
Error method
Find some algorithms on the net, are not very correct; Either is garbled, or is only considered the case of 4 byte Chinese, not comprehensive
1. String.sub (s,1, intercept length)
Online a lot of direct use "'" "String.sub (s,1, intercept length) '" is certainly wrong, because if the Chinese and English mixed strings, such as ' Hello 1 world ' character length is ' 4,4,1,4,4 ', if intercept 4 words, 4*4=4+4+1+4+3, that ' The word "world" will be taken to the first 3 bytes, it will appear garbled
2. If byte>128 then index = index + 4
Key issues
1. UTF8 characters are variable-length characters
2. Character length is regular
UTF-8 character Law
The first byte of the string represents the length of the UTF8 character
0xxxxxxx-1 byte
110yxxxx-192, 2 byte
1110yyyy-225, 3 byte
11110zzz-240, 4 byte
Correct algorithm
1 --2 --Lua3 --determine the UTF8 character byte length4 --0xxxxxxx-1 byte5 --110yxxxx-192, 2 byte6 --1110yyyy-225, 3 byte7 --11110zzz-240, 4 byte8 Local functionchsize (char)9 if notChar ThenTen Print("Not Char") One return 0 A ElseIfChar > - Then - return 4 - ElseIfChar >225 Then the return 3 - ElseIfChar >192 Then - return 2 - Else + return 1 - End + End A at --calculates the number of UTF8 string characters, each of which is calculated by one character - --For example Utf8len ("1 hello") = 3 - functionUtf8len (str) - LocalLen =0 - LocalCurrentindex =1 - whileCurrentindex <= #str Do in Localchar =String.byte(str, currentindex) -Currentindex = Currentindex +chsize (char) toLen = len +1 + End - returnLen the End * $ --Intercept UTF8 stringPanax Notoginseng --str: The string to intercept - --Startchar: Start Word Poute, starting from 1 the --numChars: Length of characters to intercept + functionutf8sub (str, Startchar, NumChars) A LocalStartIndex =1 the whileStartchar >1 Do + Localchar =String.byte(str, startIndex) -StartIndex = StartIndex +chsize (char) $Startchar = Startchar-1 $ End - - LocalCurrentindex =StartIndex the - whileNumChars >0 andCurrentindex <= #str DoWuyi Localchar =String.byte(str, currentindex) theCurrentindex = Currentindex +chsize (char) -NumChars = NumChars-1 Wu End - returnStr:sub (StartIndex, Currentindex-1) About End $ - --Self Test - functionTest () - --Test Utf8len A assert(Utf8len ("Hello 1 world haha") ==7) + assert(Utf8len ("Hello World 1 haha") ==8) the assert(Utf8len ("Hello World 1 haha") ==9) - assert(Utf8len ("12345678") ==8) $ assert(Utf8len ("øpø Hello pix") ==8) the the --Test Utf8sub the assert(Utf8sub ("Hello 1 world haha",2,5) =="Good 1 world, huh?") the assert(Utf8sub ("1 Hello 1 world haha",2,5) =="Hello 1 World") - assert(Utf8sub ("Hello 1 world haha",2,6) =="Hello 1 World") in assert(Utf8sub ("Hello World 1 haha",1,5) =="Hello World 1") the assert(Utf8sub ("12345678",3,5) =="34567") the assert(Utf8sub ("øpø Hello pix",2,5) =="pø Hello P") About the Print("All test succ") the End the +Test ()
UTF8 string in Lua interception and word count "reprint"