File encoding determination (utf8, UTF16) and conversion

Source: Internet
Author: User

 
1. Determine Based on BOM first

The BOM of the UTF-8: ef bb bf; the corresponding decimal value is: 239 187 191 if the first three bytes of the file match with it, the file encoding is utf8

The UTF-16LE BOM: FF Fe; the corresponding decimal value is: 255 254 If the first two bytes of the file are consistent with the corresponding encoding is UTF-16LE

The BOM of the UTF-16BE: Fe ff; the corresponding decimal value is: 254 255 if the first two bytes of the file match, the corresponding encoding is UTF-16BE

2. Determine if Bom does not exist

UTF-8 determination, according to the content of Determination

UTF-8 encoding rules:

Character byte length mark byte value

One-word term length 0 xxxxxxx

Two bytes long: 110 XXXXX 10 xxxxxx

Three-character section length: 1110 XXXX 10 xxxxxx 10 xxxxxx

Four-byte length 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

** Data used for byte determination **

Define the length of the array bthead to 4. Save the decimal value used to determine the flag byte: 0,192,224,240

Define the length of the array btbitandvalue to 4. Save it to obtain the decimal value of the length of the flag byte: 128: 224,240,248.

** Data used for value determination **

Define the variable btvaluehead to save the decimal value corresponding to the value mark: 128

Define the variable btfixvalueand to save the decimal value of the flag used to obtain the value: 192

A. Read the file content in bytes and save it to the byte array.

B. Perform loop operations on the file content read in.

First, bitwise AND operation are performed on the Current byte and the four values in btbitandvalue respectively. The value obtained each time is compared with the value in bthead, when an equal value is found, the length of the character l can be determined based on the current value. And execute the next loop. When B is skipped (L-1), B is executed.

C. Mark of obtaining value. Bitwise AND operation is performed on the resulting value and btfixvalueand the obtained value is compared with the btvaluehead. If equal, the C operation is performed on the next byte until the number of executions is the L-1. If not equal, it indicates that the UTF-8 encoding format is not.

The determination of UTF-16 and the determination of UTF-8 is similar as long as you know the encoding rules can be.

3. Conversion of text code

APIs used for conversion:

Multi-byte to wide byte: multibytetowidechar

Wide-byte to multi-byte: widechartomultibyte

Msdn provides detailed parameter descriptions.

Specific steps: first determine the length of the string to save the conversion result, and set the corresponding parameter to a fixed value.

Second, allocate storage variables based on the obtained length.

Finally, convert.

Pay attention to the issue of memory release allocated.

This article from the csdn blog, reproduced please indicate the source: http://blog.csdn.net/yinzhiqing/archive/2009/11/18/4825789.aspx

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.