Conversion between UTF8,UTF16,UTF32,UTF16-LE,UTF16-BE,GBK

Source: Internet
Author: User

Unicode is a coding standard developed by unicode.org and is currently supported by most operating systems and programming languages. unicode.org official definition of Unicode is: Unicode provides a unique number for every character. As you can see, Unicode does this by defining a corresponding numeric representation for each character. For example, the Unicode value of "a" is 0x0061, and the UNICDE value of "One" is 0x4e00, which is the simplest case in which each character is represented by 2 bytes.

Unicode.org defines more than million characters, and if all characters are represented in a uniform format, 4 bytes is required. The Unicode representation of "a" becomes 0x00000061, and the Unicode value of "a" is 0x00004e00. In fact, this is the Unicode scheme used on the Utf32,linux operating system.

However, careful analysis can be found, in fact, the majority of characters only use 2 bytes can be expressed. The Unicode range in English is 0x0000-0x007f, the Unicode range of Chinese is 0x4e00-0x9f**, the real need to expand to 4 bytes to represent the word Fu Shaozhi, so some systems directly use 2 bytes to represent Unicode. For example, on a Windows system, Unicode is two bytes. For those characters that require 4 bytes to be represented, use a proxy approach to extend (in fact, make a mark on the low two bytes, indicating that this is a proxy that needs to connect the next two bytes to form a character). This benefit is a large amount of savings in access space, but also improve the speed of processing. This method of Unicode notation is UTF16. Generally on the Windows platform, referring to Unicode, that means UTF16.

As for Utf16-le and utf16-be, it is related to the CPU architecture of the computer. Le refers to little Endian, while be refers to big Endian. There are a lot of related posts on the web. Our general X86 system is little endian, can be regarded as utf16=utf16-le.

Because for Europe and North America, the encoding used is actually between 0X0000-0X00FF, and only one character is required to represent all the characters. Even the use of UTF16 as memory access means that there is a huge amount of wasted space, so there is the UTF8 encoding method. This is a very flexible encoding, for only 1 bytes of characters, the use of a byte, for the Chinese and Japanese South Korea, such as the original need two bytes to represent the characters, then through a UTF16-UTF8 algorithm to achieve the conversion between each other (typically requires 3 bytes to represent), For characters that require 4 bytes to be represented, UTF8 can be extended to 6 bytes per character. The algorithm used by UTF8 is interesting, and the approximate mapping relationship is as follows:
UTF-32 UTF8
0x00000000-0x0000007f 0xxxxxxx
0X00000080-0X000007FF 110xxxxx 10xxxxxx
0X00000800-0X0000FFFF 1110xxxx 10xxxxxx 10xxxxxx
0X00010000-0X001FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0X00200000-0X03FFFFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0X04000000-0X7FFFFFFF 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
It can be found that this is very similar to the IP address algorithm.
Since UTF8 can be conveniently converted to UTF16 and UTF32 (no Code table required, the conversion algorithm can be found on unicode.orgC code)。 and the implementation of UTF8 on every operating system platform is the same, and there is no cross-platform problem, so UTF8 is a good solution for cross-platform Unicode. Of course, for Chinese, because each character needs 3 bytes to represent, it is a bit wasteful.

UTF8 Text header for EF BB BF
UTF16 Text Header : Big-endian 's FEFF; indicates that the byte stream is; Little-endian 's FFFE

int convertutf8utf16 (unsigned char* utf8, int& size8, char* utf16, int& size16)

{

int count =0, I;

Char Tmp1, TMP2;

unsigned short int integer;

unsigned short int *p;

for (I=0;i<size8;i+=1)

{

p = (unsigned short int*) &utf16[i];



if (Utf8[count] < 0x80)

{

<0x80

integer = Utf8[count];

count++;

}

else if ((Utf8[count] < 0xDF) && (utf8[count]>=0x80))

{

Integer = Utf8[count] & 0x1F;

Integer = integer << 6;

Integer + = utf8[count+1] &0x3F;

count+=2;

}

else if ((Utf8[count] <= 0xEF) && (UTF8[COUNT]>=0XDF))

{

Integer = Utf8[count] & 0x0F;

Integer = integer << 6;

Integer + = utf8[count+1] &0x3F;

Integer = integer << 6;

Integer + = utf8[count+2] &0x3F;

count+=3;

}

Else

{

printf ("error!/n");

}

*p = integer;

}

Size8 = count;

Size16 = i;

return SIZE16;

}

int Convertutf16utf8 (char* utf16, int& size16, char* UTF8, int& size8)

{

int i=0, count=0;

Char Tmp1, TMP2;



unsigned short int integer;

for (i=0;i<size16;i+=2)

{

Integer = * (unsigned short int*) &utf16[i];



if (integer<0x80)

{

Utf8[count] = utf16[i] & 0x7f;

count++;

}

else if (integer>=0x80 && integer<0x07ff)

{

TMP1 = integer>>6;

Utf8[count] = 0xC0 | (0x1F & Integer>>6);

UTF8[COUNT+1] = 0x80 | (0x3F & Integer);

count+=2;

}

else if (integer>=0x0800)

{

TMP1 = integer>>12;

Utf8[count] = 0xE0 | (0x0F & integer>>12);

UTF8[COUNT+1] = 0x80 | ((0x0fc0 & Integer) >>6);

UTF8[COUNT+2] = 0x80 | (0x003f & Integer);



Count + = 3;

}

Else

{

printf ("error/n");

}

}

Size16 = i;

Size8 = count;

return count;

}

UTF-8 turn Unicode, Unicode turn GBK, UTF-8 turn GBK

#include <windows.h>
#include <stdio.h>

void Main () {

Three different versions of Lao Xu
unsigned char utf8[] = "/xe8/x80/x81/xe5/xbe/x90";
unsigned char unicode[] = "/x01/x80/x90/x5f";
unsigned char ansi[] = "/xc0/xcf/xd0/xec";

int Len;

UTF-8 Turn Unicode
Len = MultiByteToWideChar (Cp_utf8, 0, (LPCSTR) UTF8,-1, null,0);
WCHAR * WszUtf8 = new wchar[len+1];
memset (WszUtf8, 0, Len * 2 + 2);
MultiByteToWideChar (Cp_utf8, 0, (LPCSTR) UTF8,-1, WszUtf8, Len);

MessageBoxW (null, (const wchar_t*) WszUtf8, NULL, MB_OK);

Unicode to ANSI, actually after two conversions UTF-8 has become GBK encoded
Len = WideCharToMultiByte (CP_ACP, 0, WszUtf8,-1, NULL, 0, NULL, NULL);
Char *szgbk=new Char[len + 1];
memset (SZGBK, 0, Len + 1);
WideCharToMultiByte (CP_ACP, 0, WszUtf8,-1, SZGBK, Len, null,null);

MessageBoxA (null, (const char*) SZGBK, NULL, MB_OK);

Delete[] SZGBK;
Delete[] WszUtf8;

}

Conversion between UTF8,UTF16,UTF32,UTF16-LE,UTF16-BE,GBK

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.