Unicode, UTF-8, gb2312 encoding Recognition

Source: Internet
Author: User

Unicode, UTF-8, gb2312 encoding Recognition

There is a character named "Zero Width no-break space" in the UCS encoding, and its encoding is feff. Fffe does not exist in the UCs, so it should not appear in actual transmission. We recommend that you transmit the character "Zero Width no-break space" before transmitting the byte stream in the UCS specification ".

In this way, if the receiver receives feff, it indicates that the byte stream is big-Endian; if it receives fffe, it indicates that the byte stream is little-Endian. Therefore, the character "Zero Width no-break space" is also called Bom.

The UTF-8 does not need BOM to indicate the byte order, but BOM can be used to indicate the encoding method. The UTF-8 code for the character "Zero Width no-break space" is ef bb bf (the reader can verify it with the encoding method we described earlier ). So if the receiver receives a byte stream starting with ef bb bf, it will know that this is UTF-8 encoding.

UNICODE: FF fe

Unicode big_endian: EF FF

UTF-8: EF BB BF

Gb2312 is the top node, big_endian

 

So now we can perform a test to clearly verify the above.
1. Input "Han a" using notepad and put it in the C root directory.

2. Verify with the following procedure


Using system;
Using system. collections;
Using system. IO;

Public class myclass
{
Private Static void writefile (string path)
{
Filestream FS = NULL;
Try {
FS = new filestream (path, filemode. Open );
Byte [] BS = new byte [fs. Length];
FS. Read (BS, 0, BS. Length );
WL (bitconverter. tostring (BS ));
Sixttwo (bitconverter. tostring (BS ));
}
Catch (exception ex)
{
WL (ex. tostring ());
}
Finally
{
If (FS! = NULL)
FS. Close ();
}
}
 
Public static void main ()
{
String path;
WL ("ANSI File Format byte stream ﹕");
Path = "C: // ansi.txt ";
Writefile (PATH );

WL ("byte stream in Unicode file format ﹕");
Path = "C: // unicode.txt ";
Writefile (PATH );

WL ("Unicode-big-Endian File Format byte stream ﹕");
Path = "C: // unicode_ B .txt ";
Writefile (PATH );

WL ("UTF-8 File Format byte stream ﹕");
Path = "C: // utf8.txt ";
Writefile (PATH );
RL ();
}
 
Public static void sixttwo (string sixstr)
{
String [] TMP = sixstr. Split (New char [] {'-'});
Foreach (string s in TMP)
{


Console. Write (convert. tostring (convert. tobyte (S, 16), 2). padleft (8, '0') +"

");
}
WL ("");
}
 
Private Static void WL (string text, Params object [] ARGs)
{
Console. writeline (text, argS );
}
 
Private Static void RL ()
{
Console. Readline ();
}
 
Private Static void break ()
{
System. Diagnostics. Debugger. Break ();
}
}

3. The output format is as follows ﹕
Bytes stream in ANSI file format ﹕
BA-BA-41
10111010 10111010 01000001
Byte stream in Unicode file format ﹕
FF-FE-49-6C-41-00
11111111 11111110 01001001 01101100 01000001 00000000
Byte stream in Unicode-big-Endian file format ﹕
FE-FF-6C-49-00-41
11111110 11111111 01101100 01001001 00000000 01000001
Byte stream in UTF-8 file format ﹕
EF-BB-BF-E6-B1-89-41
11101111 10111011 10111111 11100110 10110001 10001001 01000001

From the above results, we can easily see that Baba is exactly the gb2312 encoding of the "Han" character. Of course, my operating system is traditional. If I double-click it, we can see "Jia" this is garbled because baba in our system is looking for big5, while Baba's big5 code is exactly "hidden"

However, there are many other programs, such as IE, which can use meta tags to identify file encoding, XML can also be used to describe the file encoding through the encoding attribute, so the identification methods of these programs are somewhat different from those of ordinary ones.

Similarly, when writing a text file, writing these tokens will also help notepad identify the encoding of these files (of course. net specifically provides some categories, such as streamwriter, which can be directly saved into a certain encoding format ).

As for the conversion between various encoding types, I don't have to mention it. The encoding convert, getbytes, and getstring methods are very easy to convert.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.