Encoding relationship between characters and bytes in Java

Last Update:2018-12-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I also need to know a little bit of common sense, that is, we write all characters in notepad and other text tools. No one will write bytes (bytes can be written, but a special editor is required ),But in fact, we write characters, but the disk actually stores bytes.

Here there is a conversion problem, of course, these issues notepad itself will help us solve. Open a notepad and save it as a file. You will find several storage formats available for you to choose from,
ANSI format: ASCII format
Unicode format: adopts international encoding Storage
Unicode big endian format: this is a little different from Unicode, but I don't know how specific it is.
UTF-8: Using UTF-8 storage, see the above twoArticle, You will be very familiar with the encoding described here. Utf-8 is an implementation of Unicode.

For example, enter "Connect" in the notepad.

1. When we save another notepad, we use Unicode for storage. Although the characters we see are still "connected", the bytes stored on the disk are indeed
8fde () 901a (), which is stipulated that Unicode is internationally specified and is assigned a unique encoding for each character in the world. To obtain the Unicode of a character, you can search for it online. The simplest method is to open the Word document, enter the character, move the cursor to the end of the character, and press Alt + X, word will automatically convert characters to unicode encoding. Here we can also see that the Unicode is used to store Chinese characters. Each Chinese Character occupies two bytes.

2. When we save another notepad, we use UTF-8 for storage. Although the characters we see are still "connected", the bytes stored on the disk have actually changed.
E8 BF 9e (connected) E9 80 9A (connected ). This is the encoding of UTF-8 storage. As for the storage of UTF-8, you can read the above two articles to learn why it is stored. We can see that UTF-8 uses three bytes to store one Chinese character.

In addition, we also need to know: How does a computer differentiate the storage of a notepad?
In other words, why do I use the 8fde (connection) 901a (connection) stored in UNICODE, and the computer will know that this is a unicode code, which can be decoded using Unicode and restored to "connected? How does the computer know that E8 BF 9e (connections) E9 80 9A (connections) is stored in UTF-8 storage mode?

There is a mark here, that is, when storing bytes, notepad first marks at the beginning, whether the storage format below this notepad is UTF-8 or Unicode.

For example,

1. Unicode storage is connected ". Disk bytes are actually stored as follows:

FF Fe 8fde 901a

The first two FF Fe are tags, telling the computer that this document is stored in Unicode

2. UTF-8 storage is "connected ". Disk bytes are actually stored as follows:

Ef bb bf E8 BF 9e E9 80 9A

the first three ef bb bf files are stored in UTF-8

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Encoding relationship between characters and bytes in Java

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support