C # Transcoding

Last Update:2018-12-07 Source: Internet

Author: User

Tags comparison table

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

String unicodestring = "China ";

// Create two different encodings.
Encoding ASCII = encoding. ASCII;
Encoding Unicode = encoding. Unicode;

Byte [] unicodebytes = Unicode. getbytes (unicodestring );

Byte [] asciibytes = encoding. Convert (unicode ascii, unicodebytes );

If you want to obtain gb2312

Encoding ANSI = encoding. getencoding ("gb2312 ");

Analysis on the Conversion Relationship Between utf8 and gb2312 Unicode codes

1. UCOS stands for the universal character set of universal character set
2. UTF stands for UCS Transformation Format
3. the gb2312 encoding uses the location code retrieval method. The 1-9 area stores Chinese characters, the 16-55 area stores the first-level Chinese characters, and the 56-87 area stores the second-level Chinese characters. For example, 1st Chinese characters are 1st characters in the 16 area, is: 'Ah', so 1601, that is, 0x1001 represents the location code value of this 'Ah.
To change the location code to the machine internal code, you must add the area code and the location code 0xa0 respectively. Therefore, the machine internal code of 'A' is 0xb0a1.
4. International Code = location code + 0x2020
Machine internal code = International Code + 0x8080 that is = location code + 0xa0a0
>>> Print Unicode ('\ xb0 \ xA1', 'gb2312') // convert the machine internal code to Unicode code
Ah
>>> Print Unicode (CHR (0xb0) + CHR (0xa1), 'gb2312 ')
Ah
>>> A = U' \ u00a9'
>>> Print

>>> A. encode ('utf-8 ')
'\ Xc2 \ xa9'
>>> Print CHR (65)
>>> Print unichr (0x8089)
5. unicode re-encodes Chinese characters, which is completely different from the gb2312 encoding method and sequence. Unicode starts from 0x4e00 to 0x9fa5, so Unicode and gb2312 encoding are converted, you need a conversion table for fast conversion [Luther. gliethttp]
6. The wchar_t type can be used to store Unicode characters.

In ubuntu, use
Locale displays the default encoding of the system.
Luther @ gliethttp :~ $ Locale
Lang = en_US.UTF-8
Lc_ctype = "en_US.UTF-8"
Lc_numeric = "en_US.UTF-8"
Lc_time = "en_US.UTF-8"
Lc_collate = "en_US.UTF-8"
Lc_monetary = "en_US.UTF-8"
Lc_messages = "en_US.UTF-8"
Lc_paper = "en_US.UTF-8"
Lc_name = "en_US.UTF-8"
Lc_address = "en_US.UTF-8"
Lc_telephone = "en_US.UTF-8"
Lc_measurement = "en_US.UTF-8"
Lc_identification = "en_US.UTF-8"
Lc_all =
Luther @ gliethttp :~ $ Vim utf8.c
Then enter: Luther China
Save, view and save, utf8.c's HEX:
0000000: 6C 75 74 68 65 72 E4 B8 ad E5 9B BD 0a Luther .......
Use python for further analysis:
>>> Ord ('\ n ')
10
>>> A = 'China' // use the python default encoding scheme to store values for Chinese Characters
>>>
'\ Xe4 \ xb8 \ XAD \ xe5 \ x9b \ xbd' // This is UTF-8 encoding, so this is the default Python encoding scheme in my ubutun
>>> A = u'china' // use Unicode encoding
>>>
U' \ u4e2d \ u56fd '// This is the Unicode encoding value.
>>> A. encode ('gb2312') // convert it to the location code value of gb2312
'\ Xd6 \ xd0 \ xb9 \ xfa'
>>> A. encode ('utf-8') // convert it to UTF-8 encoding
'\ Xe4 \ xb8 \ XAD \ xe5 \ x9b \ xbd' // The default encoding value of the above python is the same
>>> A = u'china'
U' \ u4e2d \ u56fd'
>>> B = A. encode ('gb2312') // convert Unicode to gb2312
>>> B
'\ Xd6 \ xd0 \ xb9 \ xfa'
>>> Unicode (B, 'gb2312') // convert gb2312 to Unicode
U' \ u4e2d \ u56fd'
Utf8 code of Unicode characters
>>> U'ge'
U' \ u845b'
Print utf8 as a visual character
>>> Print U' \ u845b'
Ge

[NOTE: The following content from: http://www.ixpub.net/thread-865394-1-1.html]
UTF-8
now understand Unicode, so what is UTF-8? Why is there a UTF-8?
convert ASCII to UCS-2, just insert a 0x0 Before encoding. Using these encodings will include some controllers, such as ''or '/', which may cause serious errors in UNIX and some C functions. So certainly, UCS-2 is not suitable for external unicode encoding.
therefore, UTF-8 was born. So how is UTF-8 encoded? How to solve the problem of UCS-2?
example:
E4 BD A0 11100100 10111101 10100000
This is the UTF-8 code of the word "you"
4f 60 01001111 01100000
This is the Unicode code of "you"
according to the encoding rules of the UTF-8, the decomposition is as follows: xxxx0100 xx111101 xx100000
concatenates numbers except X into your Unicode code.
note that the first three 1 s of the UTF-8 indicate that the entire UTF-8 string is composed of three bytes.
after UTF-8 encoding, no more sensitive characters, because the highest bit is always 1.

The conversion relationships between Unicode and UTF-8 are as follows:
U-00000000-U-0000007F: 0 xxxxxxx // No 1 represents only 1 byte
U-00000080-U-000007FF: 110 XXXXX 10 xxxxxx // The first 2 1 represented by 2 bytes
U-00000800-U-0000FFFF: 1110 XXXX 10 xxxxxx 10 xxxxxx // The first 3 1 represented by 3 bytes
U-00010000-U-001FFFFF: 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx // and so on
U-00200000-U-03FFFFFF: 111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx
U-04000000-U-7FFFFFFF: 1111110x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
Unicode encoding to UTF-8, simply put the Unicode byte stream to X into the UTF-8.

Therefore, we can see that Unicode encoding and UTF-8 encoding have a linear conversion relationship, while unicode encoding and gb2312 encoding do not have a linear conversion relationship. Therefore, we must use a comparison table to swap Unicode and gb2312 encoding, just like the conversion between the Gregorian calendar and the lunar calendarAlgorithmSimilarly, linear computing cannot be performed [Luther. gliethttp]

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More