[No000093] Press ALT and press the number key to knock out any kanji and characters!

Source: Internet
Author: User
Tags control characters uppercase letter

1. In Notepad, ( Chinese system )

Hold down ALT and press 52946 and finally release alt.

Hold down ALT and press 45230 and finally release alt.

Hold down ALT and press 50403 and finally release alt.

You will see "I love You" three words.

2. Principle: ALT + "Unicode encoding" corresponding to the decimal number can be typed into the "Unicode encoding"

For example, 52946 is "I" in Unicode under the decimal code, 45230 is "Love" in Unicode under the decimal code, 50403 is "You" in Unicode under the decimal code.

So what is Unicode encoding?

Unicode (Uniform Code, universal Code, single code) is a character encoding used on a computer. It sets a uniform and unique binary encoding for each character in each language to meet the requirements for cross-language, cross-platform text conversion, and processing.

Unicode is implemented in a different way than encoding. The Unicode encoding of a character is deterministic. However, in the actual transmission process, because different system platform design is not necessarily consistent, and for space-saving purposes, the implementation of Unicode encoding is different.

The implementation of Unicode is known as the Unicode Conversion format (Unicode translation format, referred to as UTF).

Here we have to mention another encoding of the--ASCII code (readable as: Aske code).

ASCII code is the United States (national) Information Interchange Standard (generation) code, a scheme using 7 or 8 bits encoding, up to 256 characters (including letters, numbers, punctuation, control characters and other symbols) assigned (or specified) values, This code is used in most minicomputer and all personal computers to standardize data transfer in different computer hardware and software systems.

3. How do I see the decimal code for a character in Unicode?

Law one, open "character mapping table"

Select a character and sit down is the corresponding Unicode encoding

It's obvious! Hexadecimal encoding under Unicode 0021h=33d, so alt+33 can be typed! (Alt is a function key in Word, so it cannot be implemented)

Law two, right-click Unicode to save a text file, open with the binary editor, you can see its corresponding 16 encoding.

Required Knowledge: Conversion between Unicode and UTF-8

Unicode is a character set, and UTF-8 is one of the Unicode, Unicode is a fixed-length double byte, and UTF-8 is mutable, and for Chinese characters, Unicode occupies 1 bytes less than the bytes occupied by UTF-8. Unicode is double-byte, while UTF-8 Chinese characters account for three bytes. UTF-8 encoded characters can theoretically be up to 6 bytes long, whereas 16-bit BMP (Basic multilingual Plane) characters use up to 3 bytes long. Here's a look at the UTF-8 code table:

U-0000 0000-u-0000 007f:0xxx xxxx

U-0000 0080-u-0000 07ff:110x xxxx 10xx xxxx

U-0000 0800-u-0000 ffff:1110 xxxx 10xx xxxx 10xx xxxx

U-0001 0000-u-001f ffff:1111 0xxx 10xx xxxx 10xx xxxx 10xx xxxx

U-0020 0000-u-03ff ffff:1111 10xx 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx

U-0400 0000-u-7fff ffff:1111 110x 10xx xxxx 10xx xxxx 10xx xxxx 10xx xxxx 10xxxxxx

The position of XXX is filled in by the binary representation of the character encoding number, the more the right x has less special meaning, the shortest one is enough to express a character encoding number of multibyte string. Note In a multibyte string, the number of "1" at the beginning of the first byte is the number of bytes in the entire string. And the first line starts with 0, is to be compatible with ASCII encoding, for one byte, the second line is a double-byte string, the third behavior 3 bytes, such as the Chinese character belongs to this, and so on. (Personally think: in fact, we can simply put the number of the previous 1 as the number of bytes)

In order to convert Unicode to UTF-8, of course, you need to know where their differences are. Let's take a look at how the encoding in Unicode is converted to UTF-8, in UTF-8, if the byte of a character is less than 0x80 (128) is an ASCII character, takes up one byte, and can be used without conversion because UTF-8 is compatible with ASCII encoding. If the character "you" in Unicode is encoded as "u4f60", convert it to binary 0100 1111 0110 0000 (4f60) and then convert it in UTF-8 way.

Unicode binary can be removed from the low to high-level binary numbers, each fetch 6 bits, such as the above binary can be taken out as the following format, the front by the format to fill, less than 8 bits with 0 to fill. Reference U-0000 0800-u-0000 ffff:1110 xxxx 10xx xxxx 10xx xxxx

UNICODE: 11110000 =4f60

Utf–8:1110 0,1101, ten0000 =e4bda0

From the above can be very intuitive to see the conversion between Unicode to UTF-8, of course, know the format of UTF-8, you can do inverse, is to format it in the corresponding position in the binary, and then the conversion is the resulting Unicode character (this operation can be done by "displacement "To complete.) As the above "you" conversion, because its value is greater than 0x800 less than 0x10000, so can be judged as three bytes of storage, then the highest bit needs to move to the right "12" bit and then the highest bit in three-byte format for 11100000 (0xE0) or (|) can get the highest value. Similarly, the second position is to move the "6" bit to the right, then the highest and second binary values are left, which can be 0x3F by 111111 (&) and 11000000 (0x80) or (|). The third place will not have to shift, as long as the last six (with 111111 (ox3f) Take &), in the 11000000 (0x80) or (|). OK, the conversion is successful! The code in VC + + is shown below (Unicode to UTF-8 conversion).

1 const wchar_t Punicode = L "You";

2 Char utf8[3+1];

3 memset (utf8,0,4);

4 Utf8[0] = 0xe0| (PUNICODE>>12);

5 Utf8[1] = 0x80| ((punicode>>6) &0x3f);

6 utf8[2] = 0x80| (punicode&0x3f);

7 Utf8[3] = "n";

8//char[4] is the character of UTF-8 "you".

The conversion of UTF-8 to Unicode is, of course, done by shifting and so on, by UTF-8 the binary numbers of the corresponding positions in those formats. In the above example, "You" is three bytes, so each byte is processed and there is a high-to-low processing. In the UTF-8 "you" for 11100100,10111101,10100000. From the high up that the first byte 11100100 is the "0100" to take out, this is very simple as long as and 11111 (0x1F) with (&), by three bytes can be learned that the most in place must be 12 bit before, because each fetch six bits. So we have to move the resulting results to the left 12 bits, the highest bit also completed the 0100,000000,000000. The second is to take the "111101" to get out, then only the second byte 10111101 and 111111 (0x3F) (&). The result of moving the resulting result to the left 6-bit and the highest byte is taken or (|), and the second bit is completed, and the resulting result is 0100,111101,000000. And so the last one directly with 111111 (&), and then with the results obtained from the previous 0x3F or (|) can be obtained results 0100,111101,100000. OK, the conversion is successful!

The code in VC + + is shown below (UTF-8 to Unicode conversion).

1//utf-8-formatted string

2 const char* UTF8 = "You";

3 wchar_t Unicode;

4 Unicode = (Utf8[0] & 0x1F) << 12;

5 Unicode |= (utf8[1] & 0x3F) << 6;

6 Unicode |= (utf8[2] & 0x3F);

7//unicode is ok!

Of course, in the programming process can not only convert a character, it is important to note that the length of the character must be clear, otherwise it will bring ... The above is the result of my research these days, as for the conversion of Unicode to GB2312 in MFC, Windows has its own API (WideCharToMultiByte) can be converted. This will also be able to convert the UTF-8 format to GB2312, here will not repeat, if you have a better way to hope to advise.

1. ASCII code

We know that inside the computer, all the information is ultimately represented as a binary string. Each bits (bit) has 0 and 12 states, so eight bits can combine 256 states, which is called a byte. In other words, a byte can be used to represent 256 different states, each of which corresponds to a symbol, which is 256 symbols, from 0000000 to 11111111. In the 60 's, the United States developed a set of character encodings, which made a uniform provision for the relationship between English characters and bits. This is known as ASCII code and has been used so far. The ASCII code specifies a total of 128 characters, such as a space "space" is 32 (binary 00100000), the uppercase letter A is 65 (binary 01000001). These 128 symbols (including 32 control symbols that cannot be printed out) take up only one byte of the latter 7 bits, and the first 1-bit uniform is 0.

2, non-ASCII encoding

It is enough to encode 128 symbols in English, but 128 symbols are not enough to represent other languages. For example, in French, where there is a phonetic symbol above the letter, it cannot be represented by an ASCII code. As a result, some European countries decided to use the highest bits of the bytes that were idle to incorporate new symbols. For example, the code for E in French is 130 (binary 10000010). In this way, the coding system used in these European countries can represent a maximum of 256 symbols. However, there are new problems. Different countries have different letters, so even if they are encoded using 256 symbols, the letters are not the same. For example, 130 is represented in the French code, but in Hebrew it represents the letter Gimel (?), and in the Russian language, another symbol is represented in the code. But anyway, in all of these encodings, 0-127 represents the same symbol, and the difference is just 128-255 of this paragraph.

As for Asian countries, the use of symbols is more, the Chinese character is about 100,000. A byte can represent only 256 symbols, which is certainly not enough, and must be expressed using multiple bytes to express a symbol. For example, the common encoding method in Simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so it is theoretically possible to represent a maximum of 256x256=65536 symbols. The issue of Chinese coding needs to be discussed in this article, which is not covered by this note. It is only pointed out that although a symbol is represented in multiple bytes, the Chinese character coding of the GB class is irrelevant to the Unicode and UTF-8.

3.Unicode

As mentioned in the previous section, there are many coding methods in the world, and the same binary numbers can be interpreted as different symbols. Therefore, if you want to open a text file, you must know its encoding, or in the wrong way to interpret the code, there will be garbled. Why do e-mails often appear garbled? It is because the sender and the recipient are using different encoding methods.

It can be imagined that if there is an encoding, all the symbols in the world are included. Each symbol is given a unique encoding, then the garbled problem disappears. This is Unicode, as its name indicates, which is an encoding of all symbols.

Unicode is of course a large collection, and now the scale can accommodate the 100多万个 symbol. Each symbol is encoded differently, for example, u+0639 means that the Arabic letter ain,u+0041 represents the capital letter of the English a,u+4e25 denotes the Chinese character "strict". The specific Symbol correspondence table, may query unicode.org, or the specialized Chinese character correspondence table.

4. Problems with Unicode

It is important to note that Unicode is just a set of symbols, which only specifies the binary code of the symbol, but does not specify how the binary code should be stored.

For example, the Chinese character "strict" Unicode is hexadecimal number 4E25, converted to a binary number is a full 15 bits (100111000100101), that is to say, the symbol of at least 2 bytes. Representing other larger symbols, it may take 3 bytes or 4 bytes, or more.

There are two serious problems here, and the first question is, how can you differentiate between Unicode and ASCII? How does the computer know that three bytes represents a symbol instead of three symbols? The second problem is that we already know that the English alphabet is only one byte to express enough, if Unicode uniform rules, each symbol with three or four bytes, then each letter must have two to three bytes is 0, which is a great waste for storage, the size of the text file will be two or three times times larger , it is unacceptable.

They result in:

1) There is a variety of Unicode storage methods, i.e. there are many different binary formats that can be used to represent Unicode.

2) Unicode cannot be promoted for a long period of time until the advent of the Internet.

5.utf-8

The popularization of the Internet has strongly demanded the emergence of a unified coding method. UTF-8 is the most widely used form of Unicode implementation on the Internet. Other implementations include UTF-16 and UTF-32, but they are largely unused on the Internet. Again, the relationship here is that UTF-8 is one of the ways Unicode is implemented.

One of the biggest features of UTF-8 is that it is a variable-length coding method. It can use 1~4 bytes to represent a symbol, varying the length of a byte depending on the symbol.

The coding rules for UTF-8 are simple, with only two lines:

1) for a single-byte symbol, the first bit of the byte is set to 0, and the next 7 bits are the Unicode code for the symbol. So for the English alphabet, the UTF-8 encoding and ASCII code are the same.

2) for n-byte notation (n>1), the first n bits are set to 1, the n+1 bit is set to 0, and the first two bits of the subsequent bytes are set to 10. the rest of the bits are not mentioned, all of which are Unicode codes for this symbol.

The following table summarizes the encoding rules, and the letter x represents the bits that are available for encoding.

Unicode Symbol Range | UTF-8 Encoding method

(hex) | (binary)

--------------------+---------------------------------------------

0000 0000-0000 007F | 0xxxxxxx

0000 0080-0000 07FF | 110xxxxx 10xxxxxx

0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx

0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Below, or take the Chinese character "Yan" as an example, demonstrates how to implement UTF-8 encoding.

Known as "Strict" Unicode is 4E25 (100 1110 0010 0101), according to the table above, you can find 4E25 in the range of the third row (0000 0800-0000 FFFF), so "strict" UTF-8 encoding requires three bytes, that is, the format is " 1110xxxx 10xxxxxx 10xxxxxx ". Then, starting from the last bits of "Yan", the X in the format is filled in sequentially, and the extra bits complement 0. This gets, "strict" UTF-8 code is "11100100 10111000 10100101", converted into 16 binary is e4b8a5.

6. Conversion between Unicode and UTF-8

Using the example in the previous section, you can see that the Unicode code for "strict" is 4e25,utf-8 encoding is E4B8A5, and the two are not the same. The transitions between them can be implemented by the program.

Under the Windows platform, one of the simplest ways to convert is to use the built-in Notepad applet Notepad.exe. After opening the file, click "Save as" on the "File" menu, you will get out of a dialog box, at the bottom there is a "coded" drop-down bar.

There are four options: Ansi,unicode,unicode big endian and UTF-8.

1) ANSI is the default encoding method. For English documents is ASCII encoding, for the Simplified Chinese file is GB2312 encoding (only for the Windows Simplified Chinese version, if the traditional Chinese version will use the BIG5 code).

2)Unicode encoding refers to the UCS-2 encoding method, which is a Unicode code that is stored directly in characters with two bytes. This option uses the little endian format.

3) The Unicode big endian encoding corresponds to the previous option. In the next section I will explain the meaning of little endian and big endian.

4) UTF-8 encoding, which is the encoding method mentioned in the previous section.

After selecting the "Encoding mode", click "Save" button, the file encoding method will be converted immediately.

7. Little Endian and Big endian

As mentioned in the previous section, Unicode codes can be stored directly in the UCS-2 format. Take the Chinese character "Yan" for example, the Unicode code is 4E25, need to be stored in two bytes, one byte is 4E, the other byte is 25. Storage,4E in front, 25 in the back , is the big endian way,25 in front, 4E in the back , is little endian way.

The two quirky names come from the book of Gulliver's Travels by British writer Swift. In the book, the Civil War broke out in the small country, the cause of the war is people arguing, whether to eat eggs from the big Head (Big-endian) or from the head (Little-endian) knocked open. For this matter, the war broke out six times, one Emperor gave his life, and the other emperor lost his throne.

Therefore, the first byte in front, is the "Big endian", the second byte in front is the "small Head Way" (Little endian).

Then, naturally, there is a problem: How does the computer know which encoding to use for a particular file?

Defined in the Unicode specification, each file is preceded by a character that represents the encoding sequence, which is named "0-width non-newline space" (ZERO wide no-break space), denoted by Feff. This happens to be two bytes, and FF is 1 larger than FE.

If the first two bytes of a text file are Fe FF, it means that the file is in a large head, and if the first two bytes are FF FE, it means that the file is in a small way.

8. Example

Below, give an example.

Open Notepad program Notepad.exe, create a new text file, the content is a "strict" word, followed by Ansi,unicode,unicode big endian and UTF-8 encoding method to save.

Then, use the "hex feature" in the text editing software UltraEdit to see how the file is encoded internally.

1) ANSI: The encoding of the file is two bytes "D1 CF", which is the "strict" GB2312 coding, which also implies that GB2312 is stored in the big head way.

2) Unicode: Encoding is four bytes "ff fe 4E", where "FF fe" indicates a small head mode of storage, the true encoding is 4E25.

3) Unicode Big endian: The encoding is four bytes "Fe FF 4E 25", wherein "FE FF" indicates that the head is stored in the way.

4) UTF-8: The encoding is six bytes "EF BB bf E4 B8 A5", the first three bytes "EF BB bf" indicates that this is UTF-8 encoding, and after three "E4B8A5" is the specific code of "strict", its storage sequence is consistent with the encoding order.

9. Extended Reading

* The Absolute Minimum every software Developer absolutely, positively must Know about Unicode and Character sets (on the most basic of the character set This knowledge)

* Talk about Unicode encoding

* Rfc3629:utf-8, a transformation format of ISO 10646 (if UTF-8 is implemented)

Finish

Where FF Fe is the head under Unicode, so I correspond to 11 62 parts, but Windows is Flashback 11 62 corresponding UTF-8 is 6211

6211h=25105d=61021o=0110 0010 00010 001B

CED2

You

? 0100 1111 0110 0000? =4f60 =unicode

? 1100 0100 1110 0011? =c4e3 =ansi (GB2312 encoded)

I

? 0110 0010 0001 0001? =6211 =unicode

? 1100 1110 1101 0010? =ced2 =ansi (GB2312 encoded)

Strict

? 0100 1110 0010 0101? =4e25 =unicode

? 1101 0001 1100 1111? =D1CF =ansi (GB2312 encoded)

[No000093] Press ALT and press the number key to knock out any kanji and characters!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.