Character encoding Analysis

Source: Internet
Author: User
Tags ultraedit

I have been studying character encoding and have learned a lot over the past few days. Now I will make a detailed summary here. Indicate the source for reprinting.

First, I will reference an interesting explanation of character encoding on the Internet:


Q: How does a computer represent text? He won't. He will say 0 and 1.
As a result, a group of people invented ASCII to represent letters and symbols, which is actually fixed. An eight-byte fixed 01 arrangement to represent fixed characters.
For example. the ASCII code of letter A is. 65. this arrangement is. 01000001. of course, this is only ASCII encoding. Likewise, there are many types of encodings. Of course, the truth is the same. when you want to send this string "A" to other machines, remember that what you send is not text "", you sent a string of 01000001 (ASCII ).


Now we need to understand the concept of characters and bytes. A character can be expressed by any integer in bytes. That is to say, a character can be represented by one byte, it can also be represented by multiple bytes. For example, a character in ASCII is composed of one byte, while gb2312 is compatible with ASCII, and two bytes are used to represent Chinese. gbk is the superset of gb2312, this means that the character encoding of the gb class is extended. Unicode (the default implementation method is Small Header storage, that is, ucs-2) each character is stored by two bytes, and the second byte of the character is placed in front of the storage, in addition, false compatibility and unicode are formed by complementing 0 and ascii.
Big endian is the so-called big data storage, that is, the first byte of the character is placed in front. These two unicode storage methods cause a waste of storage space and are not compatible with ascii. UTF-8 solves the preceding two problems. Currently, the default encoding method in windows is ansi, that is, gbk, Which is ansi under vc6.0, while vs2010 adopts the UTF-8 encoding method.

Here, we can use the ultraedit hexadecimal viewing tool to view the storage structure of different encoding files. Save a text file with the same text content as an ansi, unicode, unicode big endian, and UTF-8 encoded file and open it with ultraedit. Switch to the hexadecimal view,. As you can see, the ansi file has no prefix, And the ascii character uses one byte, while the Chinese character uses two bytes. Unicode-encoded files have a prefix FF.
FE, unicode big endian has the file prefix fe ff, and it can be seen that no matter whether any character is expressed in two bytes, there are many 00 bytes, this will inevitably cause great inconvenience to the string operations in the C language. UTF-8 has the file prefix ef bb bf, and Chinese characters are represented in three bytes. You can view different encoding methods by viewing the files generated by vc and.


Now I have a UTF-8 encoded file in both Chinese and English. To print each character line by line, you can use the following program:

String path = "C :\\ Users \ Yelbosh \ Desktop \ tem.txt ";
System. IO. StreamReader sr = new System. IO. StreamReader (path, Encoding. UTF8 );
String tempstr = sr. ReadToEnd ();
Foreach (char tempc in tempstr)
{
Console. WriteLine (tempc. ToString ());
}
Other encoding methods can also be used here. We mainly use the foreach function. here we can use streamreader to parse the file content according to different encoding methods.


C # also provides many methods to convert byte arrays between different encoding methods. The encoding convert method can be used to convert different encoding methods.

String str = "How are you? I am actually good at h ";
System. Text. Encoding gb2 = System. Text. Encoding. GetEncoding ("GB2312 ");
Byte [] b1 = gb2.GetBytes (str );
/*
Foreach (byte B in b1)
{
Console. Write (B. ToString ("X2"); Console. Write ("\ t ");
}

Console. Write ("\ n ");
***/
System. Text. Encoding uni = System. Text. Encoding. GetEncoding ("Unicode ");
String msg = uni. GetString (System. Text. Encoding. Convert (gb2, uni, b1 ));
Console. Write (msg );

Through experiments, we can draw correct conclusions. In fact, we can also see that the display of the string is out of the underlying layer and is no longer in the encoding mode.


But c # does not provide the source code. How can a computer convert different encoding methods? How does it identify variable-length storage? How can I use c ++ to compile conversions between different character encoding methods? Continue to follow this blog ~

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.