Ascii unicode UTF-8 UTF-16 gbk gb2312 gb18030

Source: Internet
Author: User
Tags ranges
Document directory
  • ASCII
  • GB2312
  • GBK
  • UTF Encoding
  • Setlocale
  • Miserable programmer
ASCII

Table language English and Western European language.

ASCII is represented by 7 characters, which can represent 128 characters. Its Extension uses 8 characters to represent 256 characters.

ASCII from 00 to 7F, extended from 00 to FF.

GB2312

Simplified Chinese character set. Compatible with ASCII codes,

It is expressed in 2 bytes and can represent 7445 characters, including 6763 Chinese characters, covering almost all high-frequency Chinese characters.

The high byte ranges from A1 to F7, and the low byte ranges from A1 to FE. Encode the high byte and low byte with 0XA0 respectively.

GBK

GB2312 extension, added support for traditional Chinese characters, compatible with GB2312

It is expressed in 2 bytes and can contain 21886 characters.

High byte from 81 to FE, low byte from 40 to FE

GB18030

It provides Chinese, Japanese, and Korean encoding and is compatible with GBK.

It is represented by varying bytes (1 ASCII, 2, 4 bytes ). It can contain 27484 characters.

1 byte from 00 to 7F; 2 byte high byte from 81 to FE, low byte from 40 to 7E and 80 to FE; 4 byte first three byte from 81 to FE, the second and fourth bytes are from 30 to 39.

UNICODE

Unicode name "Universal Multiple-Octet Coded Character Set", short for UCS, is a Set of Character encodings maintained by international organizations. The UCOS assigns a unique code point (code point) to each character, which is usually expressed as U + xxxx, where xxxx is the corresponding hexadecimal code value.

The UCS has two formats: UCS-2 and UCS-4. The UCS-2 is 2-byte encoded, ranging from U + 0000 ~ U + FFFF; UCS-4 is 4-byte encoded, range from U + 00000000 ~ U + 7 FFFFFFFF. In the current Unicode4.0 standard, U + 0000 ~ The U + FFFF interval already contains common texts in all languages in the world, BMP (Basic Multilingual Plane) for short. With the extension of other special symbols, the maximum value is U + 0010 FFFF.

Currently, the UCS-4 is only added 0x0000 In Front Of The UCS-2

UTF Encoding

The UTF-8, UTF-16, and UTF-32 are commonly used in the representation of the UCS character in the computer. The UTF-32 represents a single UCS character in 32 bits, one-to-one correspondence with the code value of the UCS-4. The UTF-16 is coded in 16 bits, U + 0000 ~ U + FFFF range is represented by a single 16-bit value, one-to-one correspondence with the UCS-2, and U + 10000 ~ The characters in the U + 10FFFF range need to be represented by two consecutive 16-bit values; the UTF-8 is encoded in 8-bit units, similar to the UTF-16, need to use 1 ~ Six consecutive 8-bit values are used to represent a single UCS character (in fact, since currently only the U + 10 FFFF is used, the length of a single character encoded by the UTF-8 is no more than 4 bytes ).

The character contained in GB2312 is a subset of unicode, but it does not mean that a character has the same encoding value in GB2312 and Unicode.

Setlocale

charsetlocale(int category, const char*
locale);

Category: For locale classification, express a locale field, usually has the following predefined constants: LC_ALL, LC_COLLATE, LC_CTYPE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, LC_TIME, LC_ALL indicates the union of all other locale categories.

Locale: The desired locale name string. in Linux/Unix, the locale name is usually in the following format: language [_ territory] [. codeset] [@ modifier], language is the language code specified in ISO 639, territory is the Country Code specified in ISO 3166, and codeset is the character set name.

Setlocale () version in windows CRT

  1. Lang [_ country_region [. code_page]: Although the format is the same as that of glibc, when the locale name of Windows does not comply with POSIX specifications, for example, the POSIX name is:Zh_CN.GBKIn Windows CRT, use:Chinese_People 'S Republic of China.936

  2. . Code_page: You can directly use the code page to set locale and use it. OCP ,. ACP two pseudo code pages ,. OCP indicates the active OEM code page obtained from the system ,. ACP indicates the active ANSI code page obtained from the system.

  3. "": Set locale according to the active ANSI code page in the Windows system environment .. The OCP,. ACP, and Environment Code Page are all affected by the settings of "region and language options" in the control panel. After the Simplified Chinese version of Windows is installed by default, the ANSI code page of the activity is 936 (GBK). You can use the chcp console program to view the activity code page.

  4. NULL: Retrieve the current locale without changing the current locale.

When outputting wchar_t characters to the terminal and console, you need to set setlocale (), because generally, the terminal and console environments do not support the character set encoding of the UCS series, when streaming functions (such as printf () are used, the UCS character is converted to an appropriate local ANSI encoded character within the standard/RT library, the conversion is based on the activity locale set by setlocale (), and finally the result Character Sequence is passed to the terminal. The process is the opposite for the input stream from the terminal.

You can use the method of redirecting the output to a file to verify the above mechanism: whether it is Windows CRT, Linux glibc, Cygwin glibc, when wprintf () is used to print the wchar_t character text, the content redirected to the file is always GBK, UTF-8, and other local ANSI encoding, rather than ucsencoding.

In Linux, you can useLocale-Command to view all configured locale in the system. With no optionLocaleCommand to view the locale active in the current Shell. UseLocale-mCommand to view all available Character Set encodings supported by the locale system.

When locale is""Set locale according to the environment settings. To enable the program to change the active locale according to the environment, the following code is generally added during the initialization phase of the program: setlocale (LC_ALL ,"").

When locale isNULLThe function only retrieves the current locale operation, and transmits it through the return value without changing the current locale.

Miserable programmer

I have been programming in windows, saving the file content, using wchar_t directly, that is, one character uses two bytes. Two problems are caused:

  • The English content occupies too much space;
  • The wchar_t string stored in windows is incorrect when wchar_t is used as the default memory space on Linux. Linux is the use of UTF-32, 4 bytes (I am such a stingy, heartache to death me)

There are two solutions:

  1. Still use UCS-16 save, but in linux to read and convert to memory UCS-32 format, pay attention to the size side problem;
  2. Unified use of UTF-8, or GBK storage, different platforms read and then convert the specific platform to achieve wchar_t.

Note: The sorting has not been completed yet.

Objective: to sort out the coding system definitions.

How to convert the coding systems.

Encoding method of the coding system (Storage Format of encoding ??)

References:

Http://tech.sina.com.cn/s/2001-07-26/1850.html

Http://zhidao.baidu.com/question/285081608.html

Http://www.cnblogs.com/hnrainll/archive/2011/05/07/2039700.html

Large-Scale Price Reduction
  • 59% Max. and 23% Avg.
  • Price Reduction for Core Products
  • Price Reduction in Multiple Regions
undefined. /
Connect with us on Discord
  • Secure, anonymous group chat without disturbance
  • Stay updated on campaigns, new products, and more
  • Support for all your questions
undefined. /
Free Tier
  • Start free from ECS to Big Data
  • Get Started in 3 Simple Steps
  • Try ECS t5 1C1G
undefined. /

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.