Copy Code code as follows:
---Get the correct length of the UTF8 encoded string
--@param str
--@return Number
function Utfstrlen (str)
Local len = #str;
Local left = Len;
Local cnt = 0;
Local ARR={0,0XC0,0XE0,0XF0,0XF8,0XFC};
While left ~= 0 do
Local Tmp=string.byte (Str,-left);
Local i= #arr;
While Arr[i] do
If Tmp>=arr[i] then Left=left-i;break;end
I=i-1;
End
cnt=cnt+1;
End
return CNT;
End
The LUA string library does not support the processing of UTF-8 encoded Chinese characters. It's still hard to use LUA to deal with Chinese characters.
Code rules for UTF8:
1. The first byte range of the character: 0x00-0x7f (0-127), or 0xc2-0xf4 (194-244); UTF8 is compatible with ASCII, so 0~127 is exactly the same as ASCII
2 0xc0, 0xc1,0xf5-0xff (192, 193, and 245-255) does not appear in the UTF8 encoding
3.0X80-0XBF (128- 191) will only appear in the second and subsequent encodings (for multibyte encodings, such as kanji)
So we can use Lua's powerful pattern matching to achieve the results we want, with two key processing:
1. Local _, Count = String.gsub (str, "[^\128-\193]", "") to get the number of characters in STR
2. For Uchar in String.gfind (str, "[%z\1-\127\194-\244][\128-\191]*") does tab[#tab +1] = Uchar end, which maps each character in Str to the tab