1. Determine Based on BOM first
The BOM of the UTF-8: ef bb bf; the corresponding decimal value is: 239 187 191 if the first three bytes of the file match with it, the file encoding is utf8
The UTF-16LE BOM: FF Fe; the corresponding decimal value is: 255 254 If the first two bytes of the file are consistent with the corresponding encoding is UTF-16LE
The BOM of the UTF-16BE: Fe ff; the corresponding decimal value is: 254 255 if the first two bytes of the file match, the corresponding encoding is UTF-16BE
2. Determine if Bom does not exist
UTF-8 determination, according to the content of Determination
UTF-8 encoding rules:
Character byte length mark byte value
One-word term length 0 xxxxxxx
Two bytes long: 110 XXXXX 10 xxxxxx
Three-character section length: 1110 XXXX 10 xxxxxx 10 xxxxxx
Four-byte length 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
** Data used for byte determination **
Define the length of the array bthead to 4. Save the decimal value used to determine the flag byte: 0,192,224,240
Define the length of the array btbitandvalue to 4. Save it to obtain the decimal value of the length of the flag byte: 128: 224,240,248.
** Data used for value determination **
Define the variable btvaluehead to save the decimal value corresponding to the value mark: 128
Define the variable btfixvalueand to save the decimal value of the flag used to obtain the value: 192
A. Read the file content in bytes and save it to the byte array.
B. Perform loop operations on the file content read in.
First, bitwise AND operation are performed on the Current byte and the four values in btbitandvalue respectively. The value obtained each time is compared with the value in bthead, when an equal value is found, the length of the character l can be determined based on the current value. And execute the next loop. When B is skipped (L-1), B is executed.
C. Mark of obtaining value. Bitwise AND operation is performed on the resulting value and btfixvalueand the obtained value is compared with the btvaluehead. If equal, the C operation is performed on the next byte until the number of executions is the L-1. If not equal, it indicates that the UTF-8 encoding format is not.
The determination of UTF-16 and the determination of UTF-8 is similar as long as you know the encoding rules can be.
3. Conversion of text code
APIs used for conversion:
Multi-byte to wide byte: multibytetowidechar
Wide-byte to multi-byte: widechartomultibyte
Msdn provides detailed parameter descriptions.
Specific steps: first determine the length of the string to save the conversion result, and set the corresponding parameter to a fixed value.
Second, allocate storage variables based on the obtained length.
Finally, convert.
Pay attention to the issue of memory release allocated.
This article from the csdn blog, reproduced please indicate the source: http://blog.csdn.net/yinzhiqing/archive/2009/11/18/4825789.aspx