Character encoding Unicode UTF-8, gb2312, shift-JIS encoding judgment.

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

String encoding judgment; Unicode, between UTF-8 Encoding

The difference between Unicode and UTF-8 encoding Unicode is a character set, while UTF-8 is one of Unicode, Unicode is always dubyte, while UTF-8 is variable, for Chinese characters, Unicode occupies 1 byte less than the byte occupied by the UTF-8 Unicode is double byte, and the Chinese character occupies three bytes in the UTF-8
Basic multilingual plane) character only 3 bytes long at most to look at the UTF-8 encoding table:

U-00000000-U-0000007F: 0 xxxxxxx
U-00000080-U-000007FF: 110 XXXXX 10 xxxxxx
U-00000800-U-0000FFFF: 1110 XXXX 10 xxxxxx 10 xxxxxx
U-00010000-U-001FFFFF: 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
U-00200000-U-03FFFFFF: 111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx
U-04000000-U-7FFFFFFF: 1111110x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

The position of XXX is filled by the bit represented by the binary number of characters. The closer the position is to the right, the less special the X has, only the shortest multi-byte string that is sufficient to express the number of characters encoded. Note that in the Multi-byte string, the number of "1" starting with the first byte is the number of bytes in the entire string, and the first line starts with 0 to be compatible with ASCII encoding and is a byte, the second row is a double-byte string, and the third row is a three-byte string, such as a Chinese character, and so on (in fact, we can simply regard the number of first 1 as the number of bytes)

To convert Unicode to a UTF-8, of course, you need to know where their difference lies. Let's take a look at how encoding in Unicode is converted to a UTF-8, In the UTF-8, if the size of a character is smaller than 0x80 (128), it is an ASCII character, which occupies one byte and does not need to be converted, because UTF-8 is compatible with ASCII encoding, if your Chinese character is u4f60 in UNICODE, convert it to binary 100111101100000, then according to the method of UTF-8 conversion can be UNICODE binary from the position to the high to retrieve the binary number, each time take 6 bits, such as the above binary can be extracted as the following format, fill in the preceding format, with less than 8 bits filled with 0

UNICODE: 100111101100000 4f60

UTF-8: 11100100,10111101, 10100000 e4bda0

From the above can be very intuitive to see the conversion between Unicode to the UTF-8, of course, know the UTF-8 format, you can carry out the inverse operation, it is to extract it from the corresponding position in the binary according to the format, and then the conversion is the Unicode character (this operation can be completed by displacement)

For example, because the value of your conversion is greater than 0x10000 and less than 0 x, it can be regarded as three-byte storage, then the maximum bit needs to be shifted to the right by 12 digits and then calculated or (|) based on the maximum bit of the Three-byte format as 11100000 (0xe0) to get the highest bit value. Similarly, the second bit is shifted to the right by 6 digits, then there are the binary values of the highest and second bits. You can perform the (&) operation with 111111 (0x3f), and then perform the (|) operation with 11000000 (0x80) the third digit does not need to be shifted. As long as the last six digits (with 111111 (ox3f) and 11000000 (0x80) or (|) are taken directly, conversion successful! The code in VC ++ is as follows (Unicode to UTF-8 conversion)

1 const wchar_t punicode = l "you ";
2 char utf8 [3 + 1];
3 memset (utf8, 0, 4 );
4 utf8 [0] = 0xe0 | (punicode> 12 );
5 utf8 [1] = 0x80 | (punicode> 6) & 0x3f );
6 utf8 [2] = 0x80 | (punicode & 0x3f );
7utf8 [3] = "/0 ";
8 // char [4] is the UTF-8 character you have

Of course, the conversion from UTF-8 to Unicode is also completed by shift, is to pull out the binary number of the corresponding location of the UTF-8 format in the above example you are three bytes, so to each byte for processing, there are high to low for processing in the UTF-8 you for 11100100100,10111101, 10100000 from the high that is the first byte 11100100 is to put the "0100" to get out, this is very simple as long as the sum of 11111 (0x1f) is obtained and (&), it can be learned from three bytes that the most in place must be before 12 digits, because each time we take six digits, we need to shift the result to the left by 12 digits. The highest digit is 000000, 111101, and the second digit is to get, then, you only need to take the second byte 10111101 and 111111 (0x3f) and (&) in the result of moving the obtained result to the left six digits and the result obtained from the highest byte or (| ), the second digit is completed in this way. The obtained result is 000000, 111101, 111111, and so on. The last digit is directly connected to (0x3f) and (& ), And then get the preceding result or (|) to get the result 0100,111101, 100000ok. The conversion is successful! The code in VC ++ is as follows (UTF-8 to Unicode conversion)

1 // UTF-8 Format String
2constchar * utf8 = "you ";
3wchar_t Unicode;
4 Unicode = (utf8 [0] & 0x1f) <12;
5 Unicode | = (utf8 [1] & 0x3f) <6;
6 Unicode | = (utf8 [2] & 0x3f );
7 // Unicode is OK!

As for Unicode conversion to gb2312 in MFC windows has its own API (widechartomultibyte) can be converted so that the UTF-8 format can be converted to gb2312, wonderful C ++ code
4 char [0] = 0xe0 | (punicode> 12 );
5 char [1] = 0x80 | (punicode> 6) & 0x3f );
6 char [2] = 0x80 | (punicode & 0x3f );
7 char [3] = "/0 ";

# Ifdef Win32
/*************************************** *********************************
* Name: ascii2utf8 ()
* Desc: Convert the acii string to utf8 format. Return the converted length.
**************************************** ********************************/
Static int ascii2utf8 (
Const char * szbuf // original string buffer
, STD: string & szdest // target string
); /*************************************** *********************************
* Name: utf82acii ()
* Desc: Convert utf8 to an ASCII string
**************************************** ********************************/
Static int utf82ascii (
Const char * szutf8data // original string buffer
, STD: string & szdest // target string
);//-------------------------------------------------------
// Determine whether utf8 is used
//-------------------------------------------------------
Static bool isutf8 (const char * pzinfo );//-------------------------------------------------------
// Determine whether gb2312
//-------------------------------------------------------
Static bool isgb2312 (const char * pzinfo );
# Endif //-------------------------------------------------------
// Gb2312 or not
//-------------------------------------------------------
Static int isgb (char * ptext );//-------------------------------------------------------
// Whether Chinese characters exist
//-------------------------------------------------------
Static bool ischinese (const char * pzinfo); //. cpp # ifdef Win32
/*************************************** ************
* Function: ascii2utf8
* Description: ASCII conversion to uft-8
* Input: szbuf: original string buffer
Szdest: UTF-8 target
* Output:
* Return:
* Others:
**************************************** ***********/
Int cgtdbsvrlib: ascii2utf8 (
Const char * szbuf // original string buffer
, STD: string & szdest
)
{
Int nlen = 0;
# Define w_len 1024
# Define buf_max (w_len * 4) Try
{
Wchar [w_len] = {0 };
Char szdestdata [buf_max] = {0}; multibytetowidechar (cp_acp, 0, szbuf,-1, wchar, sizeof (wchar)/sizeof (wchar [0]);
Nlen = widechartomultibyte (cp_utf8, 0, wchar,-1, szdestdata, sizeof (szdestdata), null, null );
Szdest = szdestdata; # ifdef _ debug_test
Char szbuf [buf_max] = {0 };
Multibytetowidechar (cp_utf8, 0, szutf8data, nlen + 1, wchar, sizeof (wchar)/sizeof (wchar [0]);
Nlen = widechartomultibyte (cp_acp, 0, wchar,-1, szbuf, sizeof (szbuf), null, null );
Gt_trace (e_debug, "/R/nsizeof (wchar) = % lD", sizeof (wchar ));
# Endif
}
Catch (long nline)
{
// Gt_trace (e_debug, "/R/n code exception % s: line = % LD/N", _ file __, nline );
}
Catch (...)
{
} Return nlen;
}/************************************** *************
* Function: utf82ascii
* Description: Convert UTF-8 to ASCII
* Input: szutf8data: original string buffer
Szdest: ASCII target character
* Output:
* Return:
* Others:
**************************************** ***********/
Int cgtdbsvrlib: utf82ascii (
Const char * szutf8data // original string buffer
, STD: string & szdest
)
{
Int nlen = 0;
# Define w_len 1024
# Define buf_max (w_len * 4) wchar [w_len] = {0 };
Char szdestdata [buf_max] = {0}; multibytetowidechar (cp_utf8, 0, szutf8data,-1, wchar, sizeof (wchar)/sizeof (wchar [0]);
Nlen = widechartomultibyte (cp_acp, 0, wchar,-1, szdestdata, sizeof (szdestdata), null, null); szdest = szdestdata;
Return nlen;
}/************************************** *************
* Function: isutf8
* Description: determines whether to encode UTF-8 characters.
* Input: pzinfo: the character to be judged.
* Output:
* Return:
* Others:
**************************************** ***********/
Bool cgtdbsvrlib: isutf8 (const char * pzinfo)
{
Int nwsize = multibytetowidechar (cp_utf8, mb_err_invalid_chars, pzinfo,-1, null, 0 );
Int error = getlasterror ();
If (error = error_no_unicode_translation)
{
Return false;
}
// Determine whether it is gb2312. You only need to replace cp_utf8 with 936. Return true;
}/************************************** *************
* Function: isgb2312
* Description: gb2312 encoding character
* Input: pzinfo: character to be judged
* Output:
* Return:
* Others:
**************************************** ***********/
Bool cgtdbsvrlib: isgb2312 (const char * pzinfo)
{
Int nwsize = multibytetowidechar (936, mb_err_invalid_chars, pzinfo,-1, null, 0 );
Int error = getlasterror ();
If (error = error_no_unicode_translation)
{
Return false;
}
// Determine whether it is cp_utf8. You only need to replace 936 with cp_utf8. Return true;
} # Endif /************************************ ***************
* Function: isgb
* Description: gb2312 or not
* Input: ptext: the character to be judged.
* Output:
* Return:
* Others:
**************************************** ***********/
Int cgtdbsvrlib: isgb (char * ptext)
{
Unsigned char * sqchar = (unsigned char *) ptext;
If (sqchar [0]> = 0xa1)
{
If (sqchar [0] = 0xa3)
{
Return 1; // fullwidth character
}
Else
{
Return 2; // Chinese Characters
}
}
Else
{
Return 0; // english, numbers, and punctuation
}
}/************************************** *************
* Function: ischinese
* Description: whether Chinese characters exist.
* Input: pzinfo: character to be judged
* Output:
* Return:
* Others:
**************************************** ***********/
Bool cgtdbsvrlib: ischinese (const char * pzinfo)
{
Int I;
Char * ptext = (char *) pzinfo;
While (* ptext! = '/0 ')
{
I = isgb (ptext );
Switch (I)
{
Case 0:
Ptext ++;
Break;
Case 1:
Ptext ++;
Ptext ++;
Break;
Case 2:
Ptext ++;
Ptext ++;
Return true;
Break;
}
} Return false;

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Character encoding Unicode UTF-8, gb2312, shift-JIS encoding judgment.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Character encoding Unicode UTF-8, gb2312, shift-JIS encoding judgment.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support