Explanation of common codes
Author: Li Jinnan Abstract: This article describes the conversion algorithms of common encodings in detail after sorting out various types of data. I. general character set (UCS) ISO/IEC 10646-1 [ISO-10646] defines a character set of more than 8 bits, known as the general character set (UCS), which contains most of the world's writeable character systems. Two types of multi-8-bit encoding have been defined, and each character adopts four 8-bit encoding called UCS-4, and each character adopts two 8-bit encoding called UCS-2. They can only edit the first 64 k characters of the UCMS, and other parts beyond this range have not yet been assigned. Ii. Basic multi-language (BMP) ISO 10646 defines a 31-bit character set. However, in this huge encoding space, only the first 65534 code bits (0x0000 to 0 xfffd) have been allocated so far ). This 16-bit subset of UCS is called Basic multilingual plane (BMP ). Iii. Unicode encoding Historically, there were two independent attempts to create a single character set. One is the ISO 10646 project of the International Organization for Standardization (ISO), and the other is the Unicode project organized by the association consisting of (mostly American) multilingual software manufacturers. Fortunately, around 1991, participants from both projects realized that the world does not need two different single character sets. They combine the work of both parties and work together to create a single encoding table. Both projects still exist and their respective standards are published independently, but Unicode Association and ISO/IEC JTC1/SC2 both agree to maintain compatibility with Unicode and ISO 10646 standard code tables, and closely adjust any future expansion. The Unicode Standard defines a number of characters-related semantic Enis, which is generally a better reference for high-quality printing and publishing systems. 4. UTF-8 Coding UCS-2 and UCS-4 encoding are hard to use in many current applications and protocols, and these applications and Protocols assume the character is an 8 or 7-bit byte. A new system that can process 16-bit characters cannot process UCS-4 data. This situation leads to the development of a format known as the UCS conversion (UTF), each of which has different features. UTF-8 (RFC 2279), which uses all bits of 8 bits, maintains the nature of the full US-ASCII value range: US-ASCII characters are encoded with an 8 bits, using the usual US-ASCII value, therefore, any 8-bit byte under this value represents only one US-ASCII character, not any other character. It has the following features: 1) The UTF-8 to the UCS-4, either of the UCS-2 is easier to convert each other. 2) the first 8-bit byte in the Multi-8-bit sequence specifies the number of 8-bit bytes in the series. 3) 8-bit values Fe and FF will never appear. 4) it is easier to find the character boundary in an 8-bit dense stream. UTF-8 definition: In the UTF-8, characters are encoded in a sequence of 1 to 6 8-bit bytes. In a single 8-bit sequence, the byte height is 0, and the other 7 characters are used for character value encoding. In a sequence of n (n> 1) 8-bit bytes, the Top N bits in the initial 8-bit bytes are 1, followed by 0, the remaining bit of this Byte contains the bit of the encoded character value. The maximum bit of all the 8-bit bytes is 1, and the next bit is 0. The remaining 6 bytes contain the characters encoded. The following table summarizes these different 8-bit byte formats. The letter X indicates that this digit comes from the encoded UCS-4 character value. UCS-4 range (hexadecimal) UTF-8 series (Binary) 0000 0000 <-> 0000 007f 0 xxxxxxx 0000 0080 <-> 0000 07ff 110 XXXXX 10 xxxxxx 0000 0800 <-> 0000 FFFF 1110 XXXX 10 xxxxxx 10 xxxxxx 0001 0000 <-> 001f FFFF 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx 0020 0000 <-> 03ff FFFF 111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx 0400 0000 <-> 7fff FFFF 1111110x 10 xxxxxx... 10 xxxxxx Encoding Rules from UCS-4 to UTF-8 are as follows: 1) determine the number of 8 bits from the character value and the first column in the preceding table. It is highlighted that the rows in the above table are mutually exclusive, that is, for a given UCS-4 character, there is only one valid encoding. 2) Prepare an 8-bit high level for each row in the second column in the table above. 3) Fill in the position marked as X from the low position of the character value of the UCS. Fill in the last byte from the utf8 sequence, and then put the remaining character values in the first byte in sequence. This is repeated until all the characters marked with X are filled. Here we only implement Unicode to utf8 conversion, Unicode is two bytes, defined: Typedef usigned short wchar // The output utf8 encoding is up to three bytes. Int unicodetoutf8 (wchar ucs2, unsigned char * buffer) {memset (buffer, 0, 4); If (0x0000 <= ucs2) & (ucs2 <= 0x007f )) // One char of utf8 {buffer [0] = (char) ucs2; return 1;} If (0x0080 <= ucs2) & (ucs2 <= 0x07ff )) // two char of utf8 {buffer [1] = 0x80 | char (ucs2 & 0x003f); buffer [0] = 0xc0 | char (ucs2> 6) & 0x001f); return 2;} If (0x0800 <= ucs2) & (ucs2 <= 0 xFFFF )) // three char of utf8 {buffer [2] = 0x80 | char (ucs2 & 0x003f); buffer [1] = 0x80 | char (ucs2> 6) & 0x003f); buffer [0] = 0xe0 | char (ucs2> 12) & 0x001f); return 3 ;}return 0 ;} Theoretically, simply by extending each UCS-2 character with two 0-value 8-bit bytes, the algorithm from UCS-2 to UTF-8 encoding can be obtained from above. However, UCS-2 value pairs from d800 to dfff (which is a proxy pair in UNICODE) are actually UTF-16 character conversion through UCS-4, so special treatment is needed: UTF-16 conversion must not be completed, first convert to the UCS-4 character, then convert according to the above process. The process from UTF-8 to UCS-4 decoding is as follows: 1) Initialize all bits of 4 8 bits of UCS-4 characters to 0. 2) determine the encoding used for character values based on the 8-bit bytes in the sequence and the second column (marked as X-bit) in the table above. 3) from encoding sequence allocation bit to UCS-4 character. First, start from the last 8-bit bytes of the sequence, and then proceed to the left until all the bits marked as X are complete. If the length of the UTF-8 sequence is not greater than 3 8-bit bytes, the decoding process can be directly assigned to the UCS-2. WCHAR UTF8ToUnicode(unsigned char *buffer){ WCHAR temp = 0; if (buffer[0] < 0x80) // one char of UTF8 { temp = buffer[0]; } if ((0xc0 <= buffer[0]) && (buffer[0] < 0xe0)) // two char of UTF8 { temp = buffer[0] & 0x1f; temp = temp << 6; temp = temp | (buffer[1] & 0x3f); } if ((0xe0 <= buffer[0]) && (buffer[0] < 0xf0)) // three char of UTF8 { temp = buffer[0] & 0x0f; temp = temp << 6; temp = temp | (buffer[1] & 0x3f); temp = temp << 6; temp = temp | (buffer[2] & 0x3f); } if ((0x80 <= buffer[0]) && (buffer[0] < 0xc0)) // not the first byte of UTF8 character return 0xfeff; // 0xfeff will never appear in usual return temp; // more than 3-bytes return 0} Note: the actual implementation of the above decoding algorithm should be protected to handle the series of invalid decoding. For example, an implementation that may (incorrectly) decode an invalid UTF-8 series 0xc0 0x80 is the character U + 0000, which may cause security issues or other problems (such as treating 0 as an array End sign ). More detailed algorithms and formulas can be found in [fss_utf], [Unicode], or [ISO-10646] Appendix R. V. UTF-7 Coding UTF-7: a mail-safe Transformation Format of Unicode (rfc1642 ). This is an encoding that converts Unicode codes using 7-bit ASCII codes. It is still designed to pass information in the mail gateway that can only pass 7 as encoding. The UTF-7 is directly displayed for English letters, numbers, and common symbols, while base64 encoding is corrected for other symbols. Symbol + and-control start and pause of the encoding process. So garbled if there are English words in the folder, and accompanied by a + number and-number, this may be a UTF-7 code. Conversion rules defined in the Protocol: 1) Unicode characters in set D can be directly encoded as ASCII equivalent bytes. The characters in the set O can be selected and directly encoded as ASCII equivalent bytes. However, remember that many of the characters in the set O are invalid in the header field or cannot pass through the mail gateway correctly. 2) by adding the conversion character "+" to the front, any unicode sequence can use character encoding in Set B (changed base64. "+" Means that the subsequent bytes will be parsed as elements in the modified base64 alphabet until a character that is not in the alphabet is encountered. These characters contain control characters, such as carriage return and line feed. Therefore, a unicode conversion sequence always ends on one line. Note: There are two special cases: "+-" indicates ''+'', "+ ...... -- "Indicates that a real''-''character appears. In most cases, the end is not marked. 3) spaces, tabs, carriage returns, and line breaks can be expressed in ASCII equivalent bytes. Then we can define the algorithm. First, we define the array related to the character set: typedef unsigned char byte// 64 characters for base64 codingbyte base64Chars[] = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/"; // 8 characters are safe just as base64 characters for MAIL gatesbyte safeChars[] = "''(),-.:?"; // 4 characters all means spacebyte spaceChars[] = " /t/n/r"; Note: During encoding, we need to determine the types of characters in a byte to determine the processing rules. If the range comparison method is simple, the efficiency is very low, we use the hash table idea: to create a 256-long array, you can define a type for each byte value. When determining, the value of the array is taken for each character. // Mask value defined for indenied the type of a byte # definebase640x01 # definesafe0x02 # definespace0x04byte bytetype [256]; // hash table used for find the type of a bytebool firsttime = true; // The first time to use the Lib, wait for init the table // Note: A hash table is required to decode base64 characters, for a base64 character, you can directly obtain a number between 0 and 64: byte base64value [128]; The two hash tables must be initialized before use: void initutf7tables () {byte * s; if (! Firsttime) return; // not necessary, but shocould do it to be robust memset (bytetype, 0,256); memset (base64value, 0,128); For (S = base64chars; * s! = ''/0''; s ++) {bytetype [* s] | = base64; base64value [* s] = s-base64chars; // The offset, it is a 6 bits value, 0-64} For (S = safechars; * s! = ''/0''; s ++) bytetype [* s] | = safe; For (S = spacechars; * s! = ''/0''; s ++) bytetype [* s] | = space; firsttime = false ;} During UTF-7 encoding conversion, the current character is related to the state, that is: 1) it is in the base64 encoding status. 2) is in direct encoding status 3) now in the buffer of the UTF-7, the current character is the conversion switch "+" Therefore, you need to define related fields: // the state of current character #defineIN_ASCII0#defineIN_BASE641#define AFTER_PLUS2 When using Rule 2 for encoding, you need to use base64, which requires two global auxiliary variables: int state; // state in which we are workingint nbits; // number of bits in the bit bufferunsigned long bitBuffer; // used for base64 coding Converts a Unicode character into a sequence of UTF-7: returns the number of bytes written to the buffer. The function affects three global variables: State, nbits, and bitbuffer. Here we first implement a simple helper function, which converts a Unicode Character and writes it to the provided buffer zone to return the number of bytes written. When encoding the first character in the Unicode character array, the State, nbits, and bitbuffer global variables must be initialized: state = IN_ASCII;nbits = 0;bitBuffer = 0;int UnicodeToUTF7(WCHAR ucs2, byte *buffer){ byte *head = buffer; int index; // is an ASCII and is a byte in char set defined if (((ucs2 & 0xff80) == 0)) && (byteType[(byte)u2] & (BASE64|SAFE|SPACE))) { byte temp = (byte)ucs2; if (state == IN_BASE64) // should switch out from base64 coding here { if (nbits > 0) // if some bits in buffer, then output them { index = (bitBuffer << (6 - nbits)) & 0x3f; *s++ = base64[index]; } if ((byteType[temp] & BASE64) || (temp == ''-'')) *s++ = ''-''; state = IN_ASCII; } *s++ = temp; if (temp == ''+'') *s++ = ''-''; } else { if (state == IN_ASCII) { *s++ = ''+''; state = IN_BASE64; // begins base64 coding here nbits = 0; bitBuffer = 0; } bitBuffer <<= 16; bitBuffer |= ucs2; nbits += 16; while(nbits >= 6) { nbits -= 6; index = (bitBuffer >> nbits) & 0x3f; // output the high 6 bits *s++ = base64[index]; } } return (s - head);} (For valid UNICODE character array, You can input characters in the array one by one, continuously call the above function, get a UTF-7 byte sequence. It should be noted that the last UNICODE character should be the equivalent of a character in the above three byte arrays. Below, we implement a simple description function, the function is: enter a UTF-7 byte, may get and return a valid UNICODE character; may not get, for example, if ''+'' is encountered or a character has not been assembled, a 0 xfeff flag character is returned. This character is often used to mark unicode encoding. Note: The function affects three global variables: State, nbits, and bitbuffer. When processing the first byte, the variable needs to be initialized: state = IN_ASCII;nbits = 0;bitBuffer = 0;#define RET0 0xfeffWCHAR UTF7ToUnicode(byte c){ if(state == IN_ASCII) { if (c == ''+'') { state = AFTER_PLUS; return RET0; } else return (WCHAR)c; } if (state == AFTER_PLUS) { if (c == ''-'') { return (WCHAR)''+''; } else { state = IN_BASE64; nbits = 0; bitBuffer = 0; // it is not necessary // don''t return yet, continue to the IN_BASE64 mode } } // state == Base64 if (byteType[c] & BASE64) { bitBuffer <<= 6; bitBuffer |= base64Value[c]; nbits += 6; if (nbits >= 16) { nbits -= 16; return (WCHAR)((bitBuffer >> nbits) & 0x0000ffff); } return RET0; } // encount a byte which is not in base64 character set, switch out of base64 coding state = IN_ASCII; if (c != ''-'') { return (WCHAR)c; } return RET0;} (For a UTF-7 sequence, you can input bytes consecutively and call the above function to determine the return value, get a Unicode character array. 6. Determination of Chinese Characters in gb2312 Encoding The earliest code that represents Chinese characters is divided into 94 areas, 94 Chinese Characters in each area, 1-15 areas are Spanish characters, graphics, etc., 16-5 is a level-1 Chinese character, and 56-87 is a level-2 Chinese character, zone 87 and above are new words. In Windows, we use the default encoding: gb2312 ("basic set of character sets for information exchange encoding" promulgated by the State in 1981: Country code = location code + 2020 H In order to ensure the non-obfuscation between the ASCII code and the Chinese character encoding when the Chinese characters are expressed in the computer, another conversion is made: Chinese character internal code = Chinese character code + 8080 H Therefore, the actual gb2312 Chinese character encoding on Windows is an internal machine code. The above two formulas can be obtained: Chinese character machine internal code = location code + a0a0h A Chinese character must be encoded at least a0a0h. Therefore, when identifying Chinese Characters in cstring, we can consider that a character must be a part of a Chinese character when its encoding is greater than A0. However, in special cases, not all two bytes of a Chinese character are encoded larger than a0h. For example, the encoding of 'signature' is 'e946 ', the following parts do not meet conditions greater than a0h. VII. Multi-byte encoding and Unicode conversion in Windows Windows provides API functions to convert Unicode Character arrays to gb2312 strings. The last Unicode array is 0, which is the so-called null termidated string. Obtain the size, request space, and actual conversion of the byte string to be returned in the function, release the pointer after external use, or add other operations in the class for processing, for example, release in the destructor. The return value is the number of bytes written to the string. int StringEncode::UnicodeToGB2312(char **dest, const WCHAR *src){char* buffer;int size = ::WideCharToMultiByte(CP_ACP, 0, src, -1, NULL, 0, NULL, NULL); // null termidated wchar''s bufferbuffer = new char[size];int ret = ::WideCharToMultiByte(CP_ACP, NULL, src, -1, buffer, size + 1, NULL, NULL);if (*dest != 0)delete *dest;*dest = buffer;return ret;} Note: When someone is using it, the request for a buffer space (zise + 1) is sent, and the last byte is written as ''/0'' to end the string. However, during debugging, I found that the size given by the system already contains a byte written into ''/0'', and the final string obtained is, ''/0'' is written by the system API. (Maybe my experiment has an error and needs to be verified ). The method for converting a Unicode character array to a UTF-8 is similar to that for a widechartomultibyte function, as long as the first parameter representing the code page is changed to cp_utf7 (65000) and cp_utf8 (65001 ). Likewise, there are functions to convert multiple bytes into Unicode Character arrays. Similar to the above function, you can first obtain the required size by providing an empty buffer, and then open up the space to get the final character array. However, considering the efficiency, we can sacrifice some space and provide an array of characters that are large enough. In extreme cases, the array size (all ASCII values) is the same as the size of the byte array. int StringEncode::Gb2312ToUnicode(WCHAR **dest, const char *src){int length = strlen(src); // null terminated bufferWCHAR *buffer = new WCHAR[length + 1]; // WCHAR means unsinged short, 2 bytes // provide enough buffer size for Unicodesint ret = ::MultiByteToWideChar(CP_ACP, MB_PRECOMPOSED, src, length, buffer, length);buffer[ret] = 0;if (*dest != 0)delete *dest;*dest = buffer;return ret;} Note: There is no need to judge whether to delete the previous operations in the buffer zone because it is okay to delete the NULL pointer because the delete operation provides such a mechanism. 8. URL Decoding When sending a GET request with IE, the URL is encoded with a UTF-8. When the packet capture data analysis needs to be decoded, the following function is a simple implementation: CString CTestUrlDlg::UrlToString(CString url){CString str = "";int n = url.GetLength();url.MakeLower();BYTE a, b1, b2;for (int i=0; i= ''0'') && (c <= ''9''))d = c - ''0'';else if ((c >= ''a'') && (c <= ''f'')){d = c - ''a'' + 10;}else if ((c >= ''A'') && (c <= ''F'')){d = c - ''A'' + 10;}elsed = 0;return d;}static void UnicodeToGB2312(const WCHAR unicode, char* buffer){//int size = ::WideCharToMultiByte(CP_ACP, 0, unicode, -1, NULL, 0, NULL, NULL); int ret = ::WideCharToMultiByte(CP_ACP, NULL, &unicode, -1, buffer, 3, NULL, NULL);}CString CTestUrlDlg::Uft8ToGB(CString url){CString str = "";char buffer[3];WCHAR unicode;unsigned char * p = (unsigned char *)(LPCTSTR)url;int n = url.GetLength();int t = 0;while (t < n){unicode = UTF8ToUnicode(p, t);UnicodeToGB2312(unicode, buffer);buffer[2] = 0;str += buffer;}return str;} Example: Cstring STR = "/mfc % E8 % 8B % B1 % E6 % 96% E6 % 87% 8B % E5 % 89% 8C. CHM "; cstring ret = urltostring (STR); ret = uft8togb (RET); // MFC manual. CHM IX. Summary Common algorithms include mime. Due to space restrictions, many posts have been posted on the Internet, so I will not repeat them here. For this article, due to my limited personal ability, it is inevitable that there will be omissions. I hope you can give me some advice and make progress together. |