Analysis of character encoding used by API of operating system and programming language

Source: Internet
Author: User

1. What is the string encoding in the Java Runtime environment?

Using Java as a programming language, character encoding is related to the JVM, and is independent of the operating system.

The Java default encoding is determined by the JVM at the time of installation and is determined by default based on the environment of your system.

We can get its default character encoding through the Defaultcharset () method of the Java CharSet class.

I installed the JDK is 1.7, the system environment is 64-bit, get the default character encoding is UTF16, and is Big-endian (I am very strange, my machine is Little-endian, and the virtual machine is the default is the big-endian code).

Look at the following code:

String name = "Zhang San";

Charset def = Charset.defaultcharset (); Gets the system default encoding def = utf-16 (Big-endian)
Charset UTF8 = charset.forname ("UTF-8");
Charset GBK = Charset.forname ("GBK");

byte[] Bdefult = Name.getbytes (); Gets the name of the byte stream = [-2,-1, 95, 32, 78, 9]
byte[] Butf16big = Name.getbytes (Def); Word stream after name conversion to UTF16 (Big-endian) = [-2,-1, 95, 32, 78, 9]
byte[] BGBK = name.getbytes (GBK); Name conversion to GBK after byte stream = [-43,-59,-56,-3]
byte[] Butf8 = name.getbytes (UTF8); The byte stream after name is converted to UTF8 = [-27,-68,-96,-28,-72,-119]

From this we can see that after executing string name = "Zhang San", the string stored is already the system default character encoded UTF16 (Big-endian).

Perhaps the attentive person will find out why the "Zhang San" byte Count is [-2,-1, 95, 32, 78, 9] with 6 bytes? Not UTF16 not a character is two bytes to represent it.

Yes, look down:

The recommended method for labeling byte order in the Unicode specification is the BOM. The BOM is not a BOM for "Bill of Material", but a byte Order Mark.

(Unicode is a method of character encoding, but it is a coding scheme designed by international organizations that can accommodate all languages in the world.) The scientific name for Unicode is "Universal multiple-octet Coded Character Set", referred to as UCS. UCS can be seen as an abbreviation for "Unicode Character Set". )

There is a character called "ZERO WIDTH no-break SPACE" in the UCS encoding, and its encoding is Feff. Fffe is not a character in UCS, so it should not appear in the actual transmission. The UCS specification recommends that the character "ZERO WIDTH no-break SPACE" be transmitted before the byte stream is transmitted.

This means that if the recipient receives Feff, the byte stream is Big-endian, and if Fffe is received, it indicates that the byte stream is Little-endian .

We know that the representation of single-byte characters (char) is not the size of the end of the point, and ushort, int, long, int64 and other types have the size of the end of the point, so gbk and UTF8 byte stream does not need to indicate whether it is the size of the end, and utf-16 (that is, ushort) The byte stream is needed to inform it that it is a big or small end.

Since string in Java is split by default with UCS2 Utf-16be, we need to precede the byte stream with (FE, FF) to tell that it is big-endian.

In fact [-2,-1, 9] = [FE, FF, 95, 32, 78, 9], so the meaning of the preceding two bytes only indicates whether the stored byte stream is big or small, convenient for transmission when recognized, and no other meaning.

Summarize:

(1), the Java default character encoding is determined by the virtual machine and is system-independent.

(2), Unicode, and utf-16:1 Characters of 2 bytes (regardless of language)

(3), Char in Java is Unicode-encoded by default, so a char in Java also accounts for 2 bytes

2, C, and C + + languages

We all know that C and C + + languages are not Java-like virtual machines, they are ultimately called the system's API, which is doomed to its final invocation of the interface is related to the system.

(1), Windows system

Vc/vs's C + + development environment, which implements a standard C + + interface, is sure to require the use of Windows APIs.

The most straightforward way to understand which APIs are provided by the DLLs of the WIN32 subsystem is to use WIN32DSM to view the DLL's export tables directly. At this point we find that the API with strings in the Win32 API generally has two versions, such as Createfilea () and Createfilew (). Of course there are exceptions, such as the GetProcAddress function.

A represents the ANSI code page, W is a wide character, which is the UNICode character. The Unicode characters in Window s generally refer to the Utf16-le encoding of UCS2. Let's observe the relationship between A/W versions through several examples.

We use WIN32DSM to view the assembly code of the Gdi32.dll, we can see Createfilea () Call Gdigetcodepage () get the current code page, and then call MultiByteToWideChar () to convert the input string, Then call an intrinsic function. and Createfilew () calls this intrinsic function directly.

Therefore, it can be judged that the internal implementation of Windows System API is encoded with UCS2 Utf16-le. So Windows API we try to use the wide-character interface.

After understanding that the API implementation of Windows is implemented with UCS2, how does its single-byte interface convert to UCS2?

The Windows API provides the SetLocale interface to set the character encoding information used by the current program. So you tell Windows what character encoding char* uses in your current program through this interface.

Gets the character encoding information for the current program: char* Plocale = setlocale (Lc_all, NULL); You can see that the default character encoding for C + + is c locale.

Sets the current program for the system Environment character encoding: char* Plocale = setlocale (Lc_all, ""); In the case of the Simplified Chinese version, it is equivalent to setlocale (Lc_all, ". 936").

(2), Linux system

I do not know much about the API implementation of Linux system, but according to the information on the Internet, it is implemented by UTF-8 character encoding.

Interested people can study libstdc++, it implements the standard C + + interface on Linux, can see its source understanding, there should be many ways to call the Linux system functions.

3. Summary:

(1), the Windows System API is implemented using UCS2 's Utf16-le encoding, and the Linux system API is implemented using UTF-8 encoding.

(2), the Java default character encoding is related to the JVM, the general default is UCS2 UTF-16BE encoding, independent of the system.

(3), the default character encoding C/C + + is the locale, the user can modify the current program's character encoding environment through setlocale ().

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.