No BOM UTF-8 text judgment

Source: Internet
Author: User

This is a relatively large number of C ++ functions. I converted them into the Delphi version:

Function istextutf8 (lpstrinputstream: pchar; ilen: integer): Boolean;
VaR
I: integer;
Coctets: DWORD; // octets to go in this UTF-8 encoded character
CHR: uchar;
Ballascii: Boolean;

Begin
Coctets: = 0;
Ballascii: = true;
For I: = 0 to ilen-1 do
Begin
CHR: = ord (lpstrinputstream [I]);

If (CHR and $80) <> 0) then
Ballascii: = false;

If (coctets = 0) then
Begin
//
// 7 bit ASCII after 7 bit ASCII is just fine. Handle start of encoding case.
//
If (CHR >=$ 80) then
Begin
//
// Count of the leading 1 bits is the number of characters encoded
//
CHR: = CHR * 2;
Coctets: = coctets + 1;
While (CHR and $80) <> 0) Do
Begin
CHR: = CHR * 2;
Coctets: = coctets + 1;
End;

Coctets: = coctets-1; // count between des this character
If (coctets = 0) then
Begin
Result: = false; // must start with 11 xxxxxx
Exit;
End;
End;
End
Else begin
// Non-leading bytes must start as 10 xxxxxx
If (CHR and $ C0) <> $80) then
Begin
Result: = false;
Exit;
End;
Coctets: = coctets-1; // processed another octet in Encoding
End;
End;

//
// End of text. Check for consistency.
//

If (coctets> 0) Then // anything left over at the end is an error
Begin
Result: = false;
Exit;
End;

If ballascii then // not UTF-8 if all ASCII. Forces caller to use code pages for conversion
Begin
Result: = false;
Exit;
End;

Result: = true;
End;

The following is the original C ++:

/* Istextutf8
*
* UTF-8 is the encoding of Unicode based on Internet Society rfc2279
* (See http://www.cis.ohio-state.edu/htbin/rfc/rfc2279.html)
*
* Basicly:
* 0000 0000-0000 007f-0 xxxxxxx (ASCII converts to 1 octet !)
* 0000 0080-0000 07ff-110 XXXXX 10 xxxxxx (2 octet format)
* 0000 0800-0000 FFFF-1110 XXXX 10 xxxxxx 10 xxxxxx (3 octet format)
* (This keeps going for 32 bit Unicode)
*
*
* Return value: True, if the text is in UTF-8 format.
* False, if the text is not in UTF-8 format.
* We will also return false is it is only 7-bit ASCII, so the right code page
* Will be used.
*
* Actually for 7 bit ASCII, it doesn' t matter which code page we use,
* Notepad will remember that it is UTF-8 and "save" or "Save as" will store
* The file with a UTF-8 Bom. Not cool.
*/

Int istextutf8 (lpstr lpstrinputstream, int ilen)
{
Int I;
DWORD coctets; // octets to go in this UTF-8 encoded character
Uchar CHR;
Bool ballascii = true;

Coctets = 0;
For (I = 0; I <ilen; I ++ ){
CHR = * (lpstrinputstream + I );

If (CHR & 0x80 )! = 0) ballascii = false;

If (coctets = 0 ){
//
// 7 bit ASCII after 7 bit ASCII is just fine. Handle start of encoding case.
//
If (CHR> = 0x80 ){
//
// Count of the leading 1 bits is the number of characters encoded
//
Do {
CHR <= 1;
Coctets ++;
}
While (CHR & 0x80 )! = 0 );

Coctets --; // count between des this character
If (coctets = 0) return false; // must start with 11 xxxxxx
}
}
Else {
// Non-leading bytes must start as 10 xxxxxx
If (CHR & 0xc0 )! = 0x80 ){
Return false;
}
Coctets --; // processed another octet in Encoding
}
}

//
// End of text. Check for consistency.
//

If (coctets> 0) {// anything left over at the end is an error
Return false;
}

If (ballascii) {// not UTF-8 if all ASCII. Forces caller to use code pages for conversion
Return false;
}

Return true;
}

 
However, this code has a bug. The famous "Unicom" bug in Microsoft notepad is also caused by this Code. In addition, I just found that there is a problem with Lhasa's pull, do not believe that you have created a new text on the desktop. After opening it, enter "pull" and save it. Then, you will find that there is nothing, at present, we have not found a better way to better identify utf8 text without Bom.
 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.