I also need to know a little bit of common sense, that is, we write all characters in notepad and other text tools. No one will write bytes (bytes can be written, but a special editor is required ),But in fact, we write characters, but the disk actually stores bytes.
Here there is a conversion problem, of course, these issues notepad itself will help us solve. Open a notepad and save it as a file. You will find several storage formats available for you to choose from,
ANSI format: ASCII format
Unicode format: adopts international encoding Storage
Unicode big endian format: this is a little different from Unicode, but I don't know how specific it is.
UTF-8: Using UTF-8 storage, see the above twoArticle, You will be very familiar with the encoding described here. Utf-8 is an implementation of Unicode.
For example, enter "Connect" in the notepad.
1. When we save another notepad, we use Unicode for storage. Although the characters we see are still "connected", the bytes stored on the disk are indeed
8fde () 901a (), which is stipulated that Unicode is internationally specified and is assigned a unique encoding for each character in the world. To obtain the Unicode of a character, you can search for it online. The simplest method is to open the Word document, enter the character, move the cursor to the end of the character, and press Alt + X, word will automatically convert characters to unicode encoding. Here we can also see that the Unicode is used to store Chinese characters. Each Chinese Character occupies two bytes.
2. When we save another notepad, we use UTF-8 for storage. Although the characters we see are still "connected", the bytes stored on the disk have actually changed.
E8 BF 9e (connected) E9 80 9A (connected ). This is the encoding of UTF-8 storage. As for the storage of UTF-8, you can read the above two articles to learn why it is stored. We can see that UTF-8 uses three bytes to store one Chinese character.
In addition, we also need to know: How does a computer differentiate the storage of a notepad?
In other words, why do I use the 8fde (connection) 901a (connection) stored in UNICODE, and the computer will know that this is a unicode code, which can be decoded using Unicode and restored to "connected? How does the computer know that E8 BF 9e (connections) E9 80 9A (connections) is stored in UTF-8 storage mode?
There is a mark here, that is, when storing bytes, notepad first marks at the beginning, whether the storage format below this notepad is UTF-8 or Unicode.
For example,
1. Unicode storage is connected ". Disk bytes are actually stored as follows:
FF Fe 8fde 901a
The first two FF Fe are tags, telling the computer that this document is stored in Unicode
2. UTF-8 storage is "connected ". Disk bytes are actually stored as follows:
Ef bb bf E8 BF 9e E9 80 9A
the first three ef bb bf files are stored in UTF-8