Learning notes for ASCII, gb231, GBK, UTF-8/utf8, ANSI, Unicode

Last Update:2014-08-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Continue with the previous study content and write your own study notes! I always feel that learning without notes is not so steadfast. I admit that I am a poor person, especially those who can remember what I have learned. Ah! Unfortunately, I am not. I can only make something. Even if I forget it later, I can look at it and find something. After all, it is a word I typed myself.

It's already sixteen thousand words! Hey hey... Continue to cheer!

Today, the company has a network problem. It means that Skype can be used to work at home. As a result, I am not conscious enough to make soy sauce! Continue to learn about MySQL! I have just graduated from college students, and I have learned and written some of the most basic information. If you encounter great opportunities, please try again! I am about to watch this video. Every time I see it, I am very anxious and excited, just like watching it quickly. Here I want to talk about my thoughts on watching videos and reading books:

Video viewing is suitable for beginners and users with relatively low basic knowledge. We can watch videos and practice them with examples given by many teachers. However, it takes a lot of time to watch videos. In addition, self-control is required for a person who has a net or a person sitting in a room to learn by himself! However, you can't remember the technology as long as you have a plan and don't worry about it.

Reading a book is suitable for a certain degree of foundation. For most of the books, you only have a certain degree of understanding and mastery. It is good to understand this part as a matter of improvement and difficulties encountered in your work. reading a book is fast. However, a lot of things will soon be forgotten in the past, because every knowledge point in the book is not used every day, so we don't have to forget it for a long time. Therefore, reading a book is suitable for a little bit of foundation. You don't need to read every page in the process. It's a line of emphasis. I don't know how many points I have read, can I remember all the points I put in the past?

I still remember the gibberish problem I encountered when I was dealing with my objects in March. After three days, I solved the problem, but I did not understand the cause, so I just learned this time and made myself a wish.

What I see is: I would like to thank Shiba for my understanding of MySQL from getting started to mastering! The Internet is so great that we can learn a lot for free. If you want to learn something, you can find it online. If you cannot find it, you can leave a message. I have shared my Baidu online storage and 1 TB of resources. It is estimated that you will not be able to finish it all your life! All right.

ASCII

The earliest shard set, with a size of 0 to, is stored in one byte.

Gb2312

In China, 3000 Chinese characters are commonly used, not to mention uncommon words!

Is one byte insufficient?

Thought: expressed in two bytes

[] []

0000 0000 0000 0000

1111 1111 1111 1111

0-> 65535, more than 60 thousand combinations, enough

Gb2312 Character Set

[202 197] In the example

[197] What does it mean? 69 and 197 are the entire understanding or a single understanding? At this point, there is an ambiguity: it is because the value of a single byte is less than 127, which is exactly the ASCII value.

If it is strictly two-byte bound, it can be understood as Chinese

The gb2312 cannot be compatible with English characters smaller than 127.

Q: How can I use ASCII to express Chinese Characters in two more bytes?

ASCII 0-127

0xxx xxxx

Gb2312 does not occupy 0-127 at all.

The two bytes occupied by gb2312 are from [129-255] [129-255].

In this way, the number of Chinese combinations is reduced, with more than 10 thousand

GBK

Fully compatible with gb2312, but larger than gb2312, that is, more characters can be encoded.

GBK and gb2312 are both two bytes. How can we expand the capacity?

A: The second position of GBK is low, not limited to 129-255. It can be used even if it is less than 127.

Example: 148 35 65 179 82

Q: How many Chinese characters and English letters are there?

Chinese, English, and Chinese

We can see that GBK is a Chinese character when the byte is greater than 127, and there is a byte later. When the byte is smaller than 127, it indicates that this is an English encoding, there is no followed byte. That is to say, if there is a rule that is greater than 127, there is a byte next to it and you need to read it. If the value is smaller than, you will know that this is English and there is no more at the end.

Capacity: 21003 Chinese characters, 883 characters, and 1894 characters

ANSI, Unicode, UTF-8

Two GBK bytes in China

[137] [134]-> Possible correspondence-> medium

So what about Japan?

[137] What does [134] correspond?

The JIS character set is used in Japan.

ANSI stands for the local character set

This is more related to your local language settings. For example, the Chinese ANSI is GBK, and the Japanese ANSI is JIS.

After multi-byte resolution, there is another problem: how can we solve the character set compatibility problem in different countries?

Answer: Unicode (Uniform Code, universal code) ultimate trick

What is Unicode? It is a world-wide code table. All characters in the world are assigned with a uniform label, which will not be messy.

Capacity: 2 ^ 32 more than 4 billion, astronomical numbers, enough

However, we usually use the first 65535 labels in a centralized manner.

Therefore, two bytes are enough,

Note: Unicode only allocates numbers and uses four bytes for numbers. It is not compatible with ASCII.

Unicode is only responsible for allocating numbers, so there are others that simplify the bytes without changing Unicode.

0000 0000 0000 0000 0000 0000 0000 0041->

Simplified

0000 0041->

Discard the zero value wasted at a high level with certain rules. This is to form a specific encoding format (conjecture: the simplified encoding can save more bandwidth and ensure faster network transmission. The reason is: four bytes are required. Now, only one byte is needed)

In this way, the simplified encoding can be transmitted over the network. Unicode is only responsible for assigning numbers without actual encoding. The difference must be noted !!!! That is to say, Unicode is the character set and UTF-8 is the character encoding. Here I am looking for a blog, and I recommend it to you.

Basic knowledge about character sets and encoding:

1.Basic knowledge

The information stored in the computer is represented by the binary number, and the English letters, Chinese characters, and other characters we see on the screen are the result after the binary number conversion. In general, the rules are used to store characters in a computer. For example, what 'A' represents is called "encoding". Otherwise, the binary numbers stored in the computer are parsed and displayed, it is called "decoding", just like encryption and decryption in cryptography. If an incorrect decoding rule is used during decoding, 'A' is resolved to 'B' or garbled.

Character Set: a collection of all abstract characters supported by the system. A character is a general term for all types of texts and symbols, including Chinese characters, punctuation marks, graphical symbols, and numbers.

Character encoding (character encoding): a set of rules that can be used for a set of natural language characters (such as the alphabet or syllable table ), matches a set of other things (such as numbers or electric pulses. That is, establishing a correspondence relationship between a symbolic set and a digital system is a basic technology of information processing. People usually use a collection of symbols (usually text) to express information. Computer-based information processing systems store and process information by combining different states of components (hardware. The combination of different States of a component can represent numbers in the digital system. Therefore, character encoding refers to converting a symbol into a number that can be accepted by a computer. It is called a digital code.

2.Common Character Set and character encoding

Common Character Set names: ASCIICharacter Set, gb2312Character Set, big5Character Set, gb18030Character Set, UnicodeCharacter Set. To accurately process characters in various character sets, a computer must encode the characters so that the computer can recognize and store various texts.

Blog: http://www.cnblogs.com/skynet/archive/2011/05/03/2035105.html

Continue the above discussion

Utf8/UTF-8

So how can we simplify transmission?

In this case, UTF-8 (Unicode Transformation Format) is the most famous one.

This simplification is like the relationship between the original file and the compressed file.

Q: What is the UTF-8 binary value for a given UNICODE character?

We know that Unicode can be introduced with UTF-8, and Unicode can also be introduced with UTF-8.

Q: How many bytes does utf8 occupy?

First, it is impossible to set the length. Otherwise, what is the significance of compression!

Let's look at the corresponding table

Utf8Is a variable length: 1-6Bytes

Variable Length! Then how can we determine the character boundary, that is, how many bytes are taken by the computer to know that the character is taken out correctly?

We carefully observe the table above:

When the maximum byte is 0 (0 xxxxxxx), take one byte, 110 (110 XXXXX 10 xxxxxx) Take two bytes, 1110 take three bytes, 11110 take four bytes, 111110 takes five bytes, and 1111110x takes six bytes.

Utf8 occupies three bytes.

Inference:When Will garbled characters?

1: The client statement is inconsistent with the facts

2: When results does not match the client page.

When Will data be lost?

Connection and server character sets are less than client hours.
　　　　

Verification set:Sorting rules of character sets

A character set can have one or more sorting rules.

Take utf8 as an example. The default utf8_general_ci rule can also be sorted in binary format. utf8_bin

How do I declare a checkpoint set?

Create Table ()... charset utf8 collateutf8_general_ci;

Note:The checkpoint set must be a valid checkpoint set of the character set.

Well, let's take a look at this. I will try again later to learn about the MySQL character set. It's too long to look at people too tired, write a little short, one record a problem

Learning notes for ASCII, gb231, GBK, UTF-8/utf8, ANSI, Unicode

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Learning notes for ASCII, gb231, GBK, UTF-8/utf8, ANSI, Unicode

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Learning notes for ASCII, gb231, GBK, UTF-8/utf8, ANSI, Unicode

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support