No BOM UTF-8 text judgment

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This is a relatively large number of C ++ functions. I converted them into the Delphi version:

Function istextutf8 (lpstrinputstream: pchar; ilen: integer): Boolean;
VaR
I: integer;
Coctets: DWORD; // octets to go in this UTF-8 encoded character
CHR: uchar;
Ballascii: Boolean;

Begin
Coctets: = 0;
Ballascii: = true;
For I: = 0 to ilen-1 do
Begin
CHR: = ord (lpstrinputstream [I]);

If (CHR and $80) <> 0) then
Ballascii: = false;

If (coctets = 0) then
Begin
//
// 7 bit ASCII after 7 bit ASCII is just fine. Handle start of encoding case.
//
If (CHR >=$ 80) then
Begin
//
// Count of the leading 1 bits is the number of characters encoded
//
CHR: = CHR * 2;
Coctets: = coctets + 1;
While (CHR and $80) <> 0) Do
Begin
CHR: = CHR * 2;
Coctets: = coctets + 1;
End;

Coctets: = coctets-1; // count between des this character
If (coctets = 0) then
Begin
Result: = false; // must start with 11 xxxxxx
Exit;
End;
End;
End
Else begin
// Non-leading bytes must start as 10 xxxxxx
If (CHR and $ C0) <> $80) then
Begin
Result: = false;
Exit;
End;
Coctets: = coctets-1; // processed another octet in Encoding
End;
End;

//
// End of text. Check for consistency.
//

If (coctets> 0) Then // anything left over at the end is an error
Begin
Result: = false;
Exit;
End;

If ballascii then // not UTF-8 if all ASCII. Forces caller to use code pages for conversion
Begin
Result: = false;
Exit;
End;

Result: = true;
End;

The following is the original C ++:

/* Istextutf8
*
* UTF-8 is the encoding of Unicode based on Internet Society rfc2279
* (See http://www.cis.ohio-state.edu/htbin/rfc/rfc2279.html)
*
* Basicly:
* 0000 0000-0000 007f-0 xxxxxxx (ASCII converts to 1 octet !)
* 0000 0080-0000 07ff-110 XXXXX 10 xxxxxx (2 octet format)
* 0000 0800-0000 FFFF-1110 XXXX 10 xxxxxx 10 xxxxxx (3 octet format)
* (This keeps going for 32 bit Unicode)
*
*
* Return value: True, if the text is in UTF-8 format.
* False, if the text is not in UTF-8 format.
* We will also return false is it is only 7-bit ASCII, so the right code page
* Will be used.
*
* Actually for 7 bit ASCII, it doesn' t matter which code page we use,
* Notepad will remember that it is UTF-8 and "save" or "Save as" will store
* The file with a UTF-8 Bom. Not cool.
*/

Int istextutf8 (lpstr lpstrinputstream, int ilen)
{
Int I;
DWORD coctets; // octets to go in this UTF-8 encoded character
Uchar CHR;
Bool ballascii = true;

Coctets = 0;
For (I = 0; I <ilen; I ++ ){
CHR = * (lpstrinputstream + I );

If (CHR & 0x80 )! = 0) ballascii = false;

If (coctets = 0 ){
//
// 7 bit ASCII after 7 bit ASCII is just fine. Handle start of encoding case.
//
If (CHR> = 0x80 ){
//
// Count of the leading 1 bits is the number of characters encoded
//
Do {
CHR <= 1;
Coctets ++;
}
While (CHR & 0x80 )! = 0 );

Coctets --; // count between des this character
If (coctets = 0) return false; // must start with 11 xxxxxx
}
}
Else {
// Non-leading bytes must start as 10 xxxxxx
If (CHR & 0xc0 )! = 0x80 ){
Return false;
}
Coctets --; // processed another octet in Encoding
}
}

//
// End of text. Check for consistency.
//

If (coctets> 0) {// anything left over at the end is an error
Return false;
}

If (ballascii) {// not UTF-8 if all ASCII. Forces caller to use code pages for conversion
Return false;
}

Return true;
}

However, this code has a bug. The famous "Unicom" bug in Microsoft notepad is also caused by this Code. In addition, I just found that there is a problem with Lhasa's pull, do not believe that you have created a new text on the desktop. After opening it, enter "pull" and save it. Then, you will find that there is nothing, at present, we have not found a better way to better identify utf8 text without Bom.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

No BOM UTF-8 text judgment

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

No BOM UTF-8 text judgment

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support