Java IO4: Character encoding

Source: Internet
Author: User


Character encoding, this is not part of the IO content, but the byte stream after the write should be a character stream, since it is a character stream, that involves a "character encoding" problem, considering that the character encoding is not only in the IO this block, Java many scenes are involved in this concept, so this article is specifically written about the character encoding, There are a lot of specific online, but the purpose of this article is to make it clear the role of various coding methods, personally, do not ask, there is no need to understand the character coding in depth.

Character Set and character encoding

The first concept is the difference between a character set and a character encoding:

1. Character Set (charset)

A character set refers to a collection of all the abstract characters supported by a system. Characters are the general name of all kinds of words and symbols, including the national text, punctuation, graphic symbols, numbers, etc., the common character set has the ASCII character set, the GB2312 character set, the BIG5 character set, the GB18030 character set, the Unicode character set and so on.

2, character encoding (encoding)

In order to accurately handle various character sets, the computer will encode characters so that the computer can recognize and store various text. So character encoding is the number of digital systems that a computer can accept as a symbol, called a digital code.

ASCII code

The computer is only the number 0 and 1 (strictly speaking, even 0 and 1 are not, only open and close, nothing more than 0 and 1 means the state of the switch), in the computer software is the amount of digital identification, the screen shows a character is also a number. The first computers used in the United States have very few characters, so each character is represented by a number, and a byte can represent a number that is internal security enough to hold all these characters. In fact, the number of bytes of these characters is the highest bit of the byte is 0, that is, these numbers are between 0~127, such as the character a corresponds to 97, the character B corresponds to the number 98, the character and the corresponding encoding of the number is fixed, this set of coding rules is called ASCII code (U.S. standard Information Interchange Code) . A simple ASCII code table

GB2312 and GBK

With the popularization of computers in other countries, many countries have introduced the local character set into the computer, greatly expanding the range of characters in the computer. A byte can represent a range that is not large enough to hold Chinese characters (see the ASCII Code table above), and mainland China will use two bytes for each medium character, and the encoding of the original ASCII code remains unchanged.

In order to distinguish between a Chinese character and two ASCII characters, the highest bit of each byte is 1, mainland China has specified a corresponding number for each medium character, and in 1980 set up a set of Chinese characters inside code specification, this set of specifications is GB2312, The GB2312 contains the English characters included in the ASCII code, plus 6,763 Simplified Chinese characters and some other symbols other than ASCII code.

GBK is a follow-up standard established in 1995 , which encodes more Chinese characters (including traditional) and Japanese kana on the basis of GB2312. GBK is also the system default encoding for Chinese Windows operating systems today.


A character that appears in a country's localization system, sent by e-mail to another country's localization system, is not the original character, but the other country's character or garbled, because there are no real characters in the computer, the characters are in the form of numbers, Sending a character by mail actually transmits the character encoding that corresponds to the character, and the same number in different countries and regions is likely to be a different symbol.

In order to solve the inconvenience of using different localized character encodings for each country and region, people all the symbols in the world are uniformly encoded, called Unicode (Uniform Code, universal Code). All characters no longer distinguish between countries and regions, are all human symbols, such as "medium" in Unicode is no longer GBK in the d6d0, but everywhere is 4e2d, if all computer systems use this encoding, then 4e2d the word in any place represents the "medium" of Chinese characters. Unicode-encoded characters occupy a size of two bytes, meaning that no more than 65,536 characters are in the world.

Of course, Unicode contains only 65,536 characters to want to contain all the characters in the world is not enough, so Unicode provides a word ideographic surface mapping, the link address is the wiki Wikipedia for the ideographic polygon mapping interpretation.

UTF-8 and UTF-16

Unicode is a character set standard, and how it should be applied to a computer is another topic, and there are two common ways of encoding Unicode:

1, UTF-16. A two-byte representation of the Unicode conversion format, which is a fixed-length representation. That is to say, no matter what character can use two bytes, two bytes is 16Bit, so called UTF-16. UTF-16 encoding is very convenient, each two bytes represents a character, which greatly simplifies the operation of the string operation.

2, UTF-8. UTF-16 Unified uses two bytes to represent a character, although it is very simple to represent, but a large part of the character is enough to represent a byte, now requires two bytes, storage space is magnified by one. UTF-8 takes a variable-length technique in which each coded area has a different loadline length, and different types of characters can be composed of 1~6 bytes.

Java and character encoding

characters in Java use Unicode encoding , and Java technology supports a fully expanded native platform character set with Unicode guarantees cross-platform features, while both display output and keyboard input are local encodings. Therefore, the transformation of the two problems inevitably.

Look at a very simple example:

 Public Static void throws exception{    //  Here the string is encoded into GB2312 byte b[] by the GetBytes () method ()     = "Everyone learns the Java language together ". GetBytes (" GB2312 ");     New File ("D:/files/encoding.txt");     New fileoutputstream (file);    Out.write (b);    Out.close ();}

Take a look at what's in the file:

Normal output, no coding issues. But if so:

 Public Static void throws exception{    //  Here the string is encoded into GB2312 byte b[] by the GetBytes () method ()     = "Everyone learns the Java language together ". GetBytes (" iso8859-1 ");     New File ("D:/files/encoding.txt");     New fileoutputstream (file);    Out.write (b);    Out.close ();}

And look at what's in the file:

Garbled problem arises, this is mainly due to the JDK setting environment variables, we use the program to look at the JDK environment variables:

 Public Static void Main (string[] args) {    system.getproperties (). List (System.out);}

Look at the output of all the information, a bit long:

1--Listing (TM) SE Runtime Environment3Sun.boot.library.path=E:\MyEclipse10\Common\binary\com.sun ....4java.vm.version=11.3-B025Java.vm.vendor=Sun Microsystems Inc.6Java.vendor.url=; HotSpot (TM) 64-Bit Server ASun.os.patch.level= Virtual Machine Specification -User.dir=f:\ Code \myeclipse\testio thejava.runtime.version=1.6.0_13-b03 -java.awt.graphicsenv=sun.awt.Win32GraphicsEnvironment -java.endorsed.dirs=E:\MyEclipse10\Common\binary\com.sun .... -Os.arch=AMD64\Users\dell1\AppData\Local\Temp -Line.separator= +  AJava.vm.specification.vendor=Sun Microsystems Inc. atUser.variant= Vista -sun.jnu.encoding=GBK -Java.library.path=E:\MyEclipse10\Common\binary\com.sun .... Platform API specification -Java.class. version=50.0 64-Bit Server Compiler -os.version=6.2 toUser.home=C:\Users\dell1 +User.timezone= thefile.encoding=GBK *java.specification.version=1.6 $ NotoginsengJava.class. path=f:\ Code \myeclipse\testio\bin -java.vm.specification.version=1.0 +Java.home=E:\MyEclipse10\Common\binary\com.sun .... AJava.specification.vendor=Sun Microsystems Inc. theUser.language=ZH Mode $java.version=1.6. 0_13 $java.ext.dirs=E:\MyEclipse10\Common\binary\com.sun .... -Sun.boot.class. path=E:\MyEclipse10\Common\binary\com.sun .... -Java.vendor=Sun Microsystems Inc. theFile.separator= -Java.vendor.url.bug= ...Wuyisun.cpu.endian=Little -sun.desktop=Windows WuSun.cpu.isalist=amd64

Note that the 34 lines indicate that the JDK is using GBK (GBK is an extension on the GB2312, so there is no problem with the GB2312 character set), and since the JDK uses GBK, there is certainly a problem with iso8859-1.

Java IO4: Character encoding

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.