This is a relatively large number of C ++ functions. I converted them into the Delphi version:
Function istextutf8 (lpstrinputstream: pchar; ilen: integer): Boolean;
VaR
I: integer;
Coctets: DWORD; // octets to go in this UTF-8 encoded character
CHR: uchar;
Ballascii: Boolean;
Begin
Coctets: = 0;
Ballascii: = true;
For I: = 0 to ilen-1 do
Begin
CHR: = ord (lpstrinputstream [I]);
If (CHR and $80) <> 0) then
Ballascii: = false;
If (coctets = 0) then
Begin
//
// 7 bit ASCII after 7 bit ASCII is just fine. Handle start of encoding case.
//
If (CHR >=$ 80) then
Begin
//
// Count of the leading 1 bits is the number of characters encoded
//
CHR: = CHR * 2;
Coctets: = coctets + 1;
While (CHR and $80) <> 0) Do
Begin
CHR: = CHR * 2;
Coctets: = coctets + 1;
End;
Coctets: = coctets-1; // count between des this character
If (coctets = 0) then
Begin
Result: = false; // must start with 11 xxxxxx
Exit;
End;
End;
End
Else begin
// Non-leading bytes must start as 10 xxxxxx
If (CHR and $ C0) <> $80) then
Begin
Result: = false;
Exit;
End;
Coctets: = coctets-1; // processed another octet in Encoding
End;
End;
//
// End of text. Check for consistency.
//
If (coctets> 0) Then // anything left over at the end is an error
Begin
Result: = false;
Exit;
End;
If ballascii then // not UTF-8 if all ASCII. Forces caller to use code pages for conversion
Begin
Result: = false;
Exit;
End;
Result: = true;
End;
The following is the original C ++:
/* Istextutf8
*
* UTF-8 is the encoding of Unicode based on Internet Society rfc2279
* (See http://www.cis.ohio-state.edu/htbin/rfc/rfc2279.html)
*
* Basicly:
* 0000 0000-0000 007f-0 xxxxxxx (ASCII converts to 1 octet !)
* 0000 0080-0000 07ff-110 XXXXX 10 xxxxxx (2 octet format)
* 0000 0800-0000 FFFF-1110 XXXX 10 xxxxxx 10 xxxxxx (3 octet format)
* (This keeps going for 32 bit Unicode)
*
*
* Return value: True, if the text is in UTF-8 format.
* False, if the text is not in UTF-8 format.
* We will also return false is it is only 7-bit ASCII, so the right code page
* Will be used.
*
* Actually for 7 bit ASCII, it doesn' t matter which code page we use,
* Notepad will remember that it is UTF-8 and "save" or "Save as" will store
* The file with a UTF-8 Bom. Not cool.
*/
Int istextutf8 (lpstr lpstrinputstream, int ilen)
{
Int I;
DWORD coctets; // octets to go in this UTF-8 encoded character
Uchar CHR;
Bool ballascii = true;
Coctets = 0;
For (I = 0; I <ilen; I ++ ){
CHR = * (lpstrinputstream + I );
If (CHR & 0x80 )! = 0) ballascii = false;
If (coctets = 0 ){
//
// 7 bit ASCII after 7 bit ASCII is just fine. Handle start of encoding case.
//
If (CHR> = 0x80 ){
//
// Count of the leading 1 bits is the number of characters encoded
//
Do {
CHR <= 1;
Coctets ++;
}
While (CHR & 0x80 )! = 0 );
Coctets --; // count between des this character
If (coctets = 0) return false; // must start with 11 xxxxxx
}
}
Else {
// Non-leading bytes must start as 10 xxxxxx
If (CHR & 0xc0 )! = 0x80 ){
Return false;
}
Coctets --; // processed another octet in Encoding
}
}
//
// End of text. Check for consistency.
//
If (coctets> 0) {// anything left over at the end is an error
Return false;
}
If (ballascii) {// not UTF-8 if all ASCII. Forces caller to use code pages for conversion
Return false;
}
Return true;
}
However, this code has a bug. The famous "Unicom" bug in Microsoft notepad is also caused by this Code. In addition, I just found that there is a problem with Lhasa's pull, do not believe that you have created a new text on the desktop. After opening it, enter "pull" and save it. Then, you will find that there is nothing, at present, we have not found a better way to better identify utf8 text without Bom.