ZZ transcoding problem

Source: Internet
Author: User
Tags 0xc0 uppercase letter

http://blog.csdn.net/tge7618291/article/details/7599902

The main original article, without the permission of Bo Master not reproduced.

<<unicode and UTF-8 (C language Implementation) >> Tags:encoding,c
1. Basic
1.1 ASCII code
We know that inside the computer, all the information is ultimately represented as a binary string. Every single bits(bit) has 0 and 12 states, so eight bits can be combined in 256 states, which is called a byte(byte). In other words, a byte can be used to represent 256 different states, each of which corresponds to a symbol, which is 256 symbols, from 0000000 to 11111111.
In the 60 's, the United States developed a set of character encodings, which made a uniform provision for the relationship between English characters and bits. This is known as ASCII code and has been used so far.
The ASCII code specifies a 128-character encoding, such as a space " SPACE"is 32(binary 00100000), uppercase letter A is 65(binary 01000001). These 128 symbols(including 32 control symbols that cannot be printed), only takes up the next 7 bits of one byte, and the first 1-bit uniform is 0.
1.2 Non-ASCII encoding
It is enough to encode 128 symbols in English, but 128 symbols are not enough to represent other languages. For example, in French, where there is a phonetic symbol above the letter, it cannot be represented by an ASCII code. As a result, some European countries decided to use the highest bits of the bytes that were idle to incorporate new symbols. For example, the code for E in French is 130.(binary 10000010). In this way, the coding system used in these European countries can represent a maximum of 256 symbols.
However, there are new problems. Different countries have different letters, so even if they are encoded using 256 symbols, the letters are not the same. For example, 130 represents E in French encoding, but it represents the letter Gimel in Hebrew encoding.(?), which in the Russian language will also represent another symbol.
Note: But anyway, in all of these encodings, 0-127 represents the same symbol, and the difference is just 128-255 of this paragraph.MMMMM
As for Asian countries, the use of symbols is more, the Chinese character is about 100,000. A byte can represent only 256 symbols, which is certainly not enough, and must be expressed using multiple bytes to express a symbol. For example, the common encoding method in Simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so it is theoretically possible to represent a maximum of 256x256=65536 symbols.
2. Unicode
2.1 Definition of Unicode
As mentioned in the previous section, there are many coding methods in the world, and the same binary number can be interpreted as a different symbol. Therefore, to open a text file, you must know its encoding, or the wrong way to interpret the code, there will be garbled. Why do e-mails often appear garbled? It is because the sender and the addressee use different encoding methods.
It can be imagined that if there is an encoding, all the symbols in the world are included. Each symbol is given a unique encoding, then the garbled problem disappears. This is Unicode, as its name indicates, which is an encoding of all symbols.
Unicode is also a character encoding method, but it is designed by international organizations and can accommodate all languages in the world coding scheme. The scientific name for Unicode is " Universal multiple-octet Coded Character Set", referred to as UCS. UCS can be seen as " Unicode Character Set"The abbreviation.
Unicode is of course a large collection, and now the scale can accommodate the 100多万个 symbol. Each symbol is encoded differently, for example, u+0639 represents the Arabic letter Ain, u+0041 represents the English capital letter A, and u+4e25 denotes the kanji " Strict". The specific Symbol correspondence table, may query unicode.org, or the specialized Chinese character correspondence table.
2.2 Issues with Unicode
It is important to note that the " Unicode is just a set of symbols, which only specifies the binary code of the symbol, but does not specify thatHow a binary code should be stored. "
For example, Chinese characters " Strict"Unicode is hexadecimal number 4E25, converted to binary number full 15 bits(100111000100101), meaning that the representation of this symbol requires at least 2 bytes. Representing other larger symbols, it may take 3 bytes or 4 bytes, or more.
There are two serious problems here, and the first question is, how do you differentiate between Unicode and ASCII? How does the computer know that three bytes represent a symbol, instead of representing three symbols? The second problem is that we already know that the English alphabet is enough for one byte. If Unicode is unified, and each symbol is represented by three or four bytes, then each letter must have two to three bytes before it is 0, which is a huge waste for storage, and the size of the text file will be two or three times times larger, which is unacceptable.
They result in:
1)   There is a variety of Unicode storage methods, i.e. there are many different binary formats,     can be used to represent Unicode. 2)   Unicode could not be promoted for a long time until the advent of the Internet
3. UTF-8
The popularity of the internet has strongly demanded a unified coding approach. UTF-8 is the most widely used form of Unicode implementation on the Internet. Other implementations include UTF-16 and UTF-32, but they are largely unused on the Internet. Again, the relationship here is that UTF-8 is one of the ways Unicode is implemented. One of the biggest features of the
UTF-8 is that it is a variable-length encoding. It can use 1~6 bytes to represent a symbol, varying the length of the byte depending on the symbol.
3.1 UTF-8 encoding rules
UTF-8 Encoding rules are simple, only two:
1) for a single-byte symbol, the first bit of the byte is set to 0, and the next 7 bits are the Unicode code for the symbol. Therefore, for    English letters, UTF-8 encoding and ASCII code are the same.
2) for n-byte notation The following table summarizes the encoding rules, and the letter x represents the bits that are available for encoding.
#txt---   |  Unicode symbol Range      |  UTF-8 encoding Mode n |  (hex)           | (binary)---+-----------------------+------------------------------------------------------1 | 0000 0000-0000 007F |                                              0xxxxxxx 2 | 0000 0080-0000 07FF |                                     110xxxxx 10xxxxxx 3 | 0000 0800-0000 FFFF |                            1110xxxx 10xxxxxx 10xxxxxx 4 | 0001 0000-0010 FFFF |                   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 5 | 0020 0000-03ff FFFF |          111110XX 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 6 | 0400 0000-7fff FFFF | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx                    table 1. Coding rules for UTF-8//#txt---End

Below, or with Chinese characters " Strict"As an example, demonstrates how to implement UTF-8 encoding.
Known Strict"The Unicode is 4E25(1001110 00100101), according to the table above, you can find 4E25 in the range of the third row(0000 0800-0000 FFFF), so " Strict"The UTF-8 encoding requires three bytes, i.e. the format is" 1110xxxx 10xxxxxx 10xxxxxx". Then, from the Strict"The last bits begins by filling in the X in the format from the back, followed by the extra bit 0. That's it, " Strict"The UTF-8 code is" 11100100 1011100010100101 ", converted into 16 binary is e4b8a5.
4. Little endian and Big endian
As mentioned in the previous section, Unicode codes can be stored directly in the UCS-2 format. In Chinese characters " StrictFor example, the Unicode code is 4E25 and needs to be stored in two bytes, one byte is 4E and the other byte is 25. Storage time, 4E in front, 25 in the back, is the big endian way; 25 in front, 4E in the rear, is little endian way.Big Endian (4E25) Little Endian (254E)
Therefore, the first byte is in front of the " Big Head way " large endian, the second byte is in front of" small head mode " 4.1 How does the computer know which encoding to use for a particular file? ( 0 width non-wrapping space (FEFF))
The Unicode specification defines a character that represents the encoding order at the top of each file, the name of which is called " 0 Width non-wrapping space "//Big Endian (FEFF)     Little Endian (FFFE)
note: If the first two bytes of a text file are FE FF, it means that the document adopts a large head way; If the first two bytes are FF FE, it means that the file is in a small way.
5. Conversion between Unicode and UTF-8
from table 1 We can clearly learn about the relationship between Unicode and UTF-8, The conversion between the two is implemented in the C language below.
1) converts the Unicode (UCS-2 and UCS-4) encoding of a character to UTF-8 encoding.//#c---/************************************************************* * Converts Unicode (UCS-2 and UCS-4) encoding of one character to UTF-8 encoding. * * Parameter: * Unicode encoded value of UNIC character * poutput pointer to output buffer for storing UTF8 encoded value * Outsize poutput buffer size * * Return Value: * Returns the converted Word The number of bytes of the UTF8 encoding of the character, and returns 0 if an error occurs. * * NOTE: * 1. UTF8 does not have a byte order problem, but Unicode has a byte order requirement; * Byte order is divided into big Endian and small end (Little Endian) two kinds; * The small-end method is used in the Intel processor, which is represented by the small-end method. (Low address Save low) * 2. Make sure that the poutput buffer has a minimum space size of 6 bytes! /int Enc_unicode_to_utf8_one (    unsigned long unic, unsigned char *poutput, int outsize) {assert (Poutput! = NULL);    ASSERT (outsize >= 6);        if (UNIC <= 0x0000007F) {//* u-00000000-u-0000007f:0xxxxxxx *poutput = (unic & 0x7F);    return 1; } else if (UNIC >= 0x00000080 && unic <= 0x000007ff) {//* u-00000080-u-000007ff:110xxxx X 10xxxxxx * (poutput+1) = (UNIC & 0x3F) |        0x80 *poutput = ((UNIC >> 6) & 0x1F) |        0xC0;    return 2; } else if (UNIC >= 0x00000800 && unic <= 0x0000ffff) {//* u-00000800-u-0000ffff:1110xxx X 10xxxxxx 10xxxxxx * (poutput+2) = (UNIC & 0x3F) |        0x80 * (poutput+1) = ((UNIC >> 6) & 0x3F) |        0x80 *poutput = ((UNIC >>) & 0x0F) |        0xE0;    return 3; } else if (UNIC >= 0x00010000 && unic <= 0x001fffff) {//* U-00010000-U-001FFFFF:11110XX X 10xxxxxx 10xxxxxx 10xxxxxx * (poutput+3) = (UNIC & 0x3F) |        0x80 * (poutput+2) = ((UNIC >> 6) & 0x3F) |        0x80 * (poutput+1) = ((UNIC >>) & 0x3F) |        0x80 *poutput = ((UNIC >>) & 0x07) |        0xF0;    return 4; } else if (UNIC >= 0x00200000 && unic <= 0x03ffffff) {//* u-00200000-u-03ffffff:111110xX 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx * (poutput+4) = (UNIC & 0x3F) |        0x80 * (poutput+3) = ((UNIC >> 6) & 0x3F) |        0x80 * (poutput+2) = ((UNIC >>) & 0x3F) |        0x80 * (poutput+1) = ((UNIC >>) & 0x3F) |        0x80 *poutput = ((UNIC >>) & 0x03) |        0xF8;    return 5; } else if (UNIC >= 0x04000000 && unic <= 0x7FFFFFFF) {//* u-04000000-u-7fffffff:1111110 X 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx * (poutput+5) = (UNIC & 0x3F) |        0x80 * (POUTPUT+4) = ((UNIC >> 6) & 0x3F) |        0x80 * (poutput+3) = ((UNIC >>) & 0x3F) |        0x80 * (poutput+2) = ((UNIC >>) & 0x3F) |        0x80 * (poutput+1) = ((UNIC >>) & 0x3F) |        0x80 *poutput = ((UNIC >>) & 0x01) |        0xFC;    return 6; } return 0;} #c---End
< Span style= "color: #ff00ff;" >< Span style= "color: #2e8b57;" >< Span style= "color: #ff00ff;" >2) converts the UTF8 encoding of a character to Unicode (UCS-2 and UCS-4) encoding.
#c---/***************************************************************************** * Converts the UTF8 encoding of a character to Unicode ( UCS-2 and UCS-4) encoding.  * Parameter: * pinput points to the input buffer, UTF-8 encoded * Unic points to the output buffer, its saved data is Unicode encoded value, * type is unsigned long . * * Return Value: * Success returns the number of bytes consumed by the UTF8 encoding of the character; The failure returns 0. * * NOTE: * 1. UTF8 does not have a byte order problem, but Unicode has a byte order requirement; * Byte order is divided into big Endian and small end (Little Endian) two kinds; * The small-end method is used in the Intel processor, which is represented by the small-end method. (Low address deposit) ****************************************************************************/int Enc_utf8_to_unicode_    One (const unsigned char* pinput, unsigned long *unic) {assert (Pinput! = NULL && Unic! = null);    B1 represents a high byte in UTF-8 encoded Pinput, B2 represents a secondary high byte, ... char B1, B2, B3, B4, B5, B6; *unic = 0x0;    Initialize *unic to full 0 int utfbytes = enc_get_utf8_size (*pinput);    unsigned char *poutput = (unsigned char *) Unic;            Switch (utfbytes) {case 0: *poutput = *pinput;            Utfbytes + = 1;        Break Case 2: B1 = *pinput;            b2 = * (Pinput + 1);            if ((B2 & 0xE0)! = 0x80) return 0;            *poutput = (B1 << 6) + (B2 & 0x3F);            * (poutput+1) = (B1 >> 2) & 0x07;        Break            Case 3:B1 = *pinput;            b2 = * (Pinput + 1);            B3 = * (Pinput + 2); if (((B2 & 0xC0)! = 0x80) | |                ((B3 & 0xC0)! = 0x80))            return 0;            *poutput = (B2 << 6) + (B3 & 0x3F);            * (poutput+1) = (B1 << 4) + ((B2 >> 2) & 0x0F);        Break            Case 4:B1 = *pinput;            b2 = * (Pinput + 1);            B3 = * (Pinput + 2);            B4 = * (Pinput + 3); if (((B2 & 0xC0)! = 0x80) | | ((B3 & 0xC0)! = 0x80) | |                ((B4 & 0xC0)! = 0x80))            return 0;            *poutput = (B3 << 6) + (B4 & 0x3F); * (poutput+1) = (B2 << 4) + ((B3>> 2) & 0x0F);            * (poutput+2) = ((B1 << 2) & 0x1C) + ((B2 >> 4) & 0x03);        Break            Case 5:B1 = *pinput;            b2 = * (Pinput + 1);            B3 = * (Pinput + 2);            B4 = * (Pinput + 3);            B5 = * (Pinput + 4); if (((B2 & 0xC0)! = 0x80) | | ((B3 & 0xC0)! = 0x80) | | ((B4 & 0xC0)! = 0x80) | |                ((B5 & 0xC0)! = 0x80))            return 0;            *poutput = (B4 << 6) + (B5 & 0x3F);            * (poutput+1) = (B3 << 4) + ((B4 >> 2) & 0x0F);            * (poutput+2) = (B2 << 2) + ((B3 >> 4) & 0x03);            * (poutput+3) = (B1 << 6);        Break            Case 6:B1 = *pinput;            b2 = * (Pinput + 1);            B3 = * (Pinput + 2);            B4 = * (Pinput + 3);            B5 = * (Pinput + 4);            B6 = * (Pinput + 5); if (((B2 & 0xC0)! = 0x80) | | ((B3 & 0xC0)! =0x80) | | ((B4 & 0xC0)! = 0x80) | | ((B5 & 0xC0)! = 0x80) | |                ((B6 & 0xC0)! = 0x80))            return 0;            *poutput = (B5 << 6) + (B6 & 0x3F);            * (poutput+1) = (B5 << 4) + ((B6 >> 2) & 0x0F);            * (poutput+2) = (B3 << 2) + ((B4 >> 4) & 0x03);            * (poutput+3) = ((B1 << 6) & 0x40) + (B2 & 0x3F);        Break            Default:return 0;    Break } return utfbytes;} #c---End

ZZ transcoding issues

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.