Computer in the string code, garbled, BOM and other issues such as detailed _ related skills

Source: Internet
Author: User
Tags coding standards

Because the computer is a Windows 7 system, the development environment in Linux, often in the Linux encounter garbled problem, is very painful, so decided to understand the code to the Dragon Ballo, and share a you, lest appear garbled when confused.

Is there a file encoding

Before we explain the character encoding, we need to be clear that the file itself does not encode a word, only text has the concept of encoding, we usually say that a file is what the encoding, usually refers to the code of characters in the file.

Vim why will appear garbled

I am in the general use of VIM in Linux file editing, found often encountered garbled situation, then why is there garbled? First we understand the basics of vim coding, and there are 3 variables in the code:

1.encoding:vim internal use of the character encoding method, including Vim (buffer), menu text, message text and so on. The default is based on your locale choice. The user's manual suggests changing its value only in. VIMRC, which in fact seems to make sense only by changing its value in. vimrc. You can edit and save files with another encoding, such as your vim encoding for Utf-8, the edited file is cp936 encoded, VIM automatically converts the read file into Utf-8 (Vim's readable way), and when you write to the file, will automatically revert to cp936 (file Save code).

The character encoding of the file currently edited in 2.fileencoding:vim, and Vim saves the file as this character encoding (whether or not it is the case for new files).

3.termencoding:vim the character encoding of the terminal (or Windows Console window) that is working. If Vim is in the same term as vim encoding, no setting is required. Otherwise, you can use Vim's termencoding option to automatically convert to term encoding.

When the encoding of these three variables is problematic, garbled characters appear:

1.encoding is not UTF-8 encoding: If encoding is not a utf-8 encoding, other characters may not be able to convert to encoding specified encoding. For example, if the encoding is GBK encoded and the file content is BIG5 encoded, it cannot be converted to GBK encoding.

2.fileencoding wrong: fileencoding code is vim to read the file content of the encoding used, if its encoding and file character encoding is different, will inevitably appear garbled. fileencoding code generally by vim automatic detection, you can use the Fileencodings settings, VIM automatically detect fileencoding sequence list. However, vim sometimes detects errors, there will be garbled.

3.termencoding code is wrong: we log on to the server generally using remote logins, which involves terminal coding problems, I often encountered because the terminal code does not cause garbled. such as: SecureCRT set to Utf-8 code, and Vim termencoding is GBK, there is garbled.

Introduction to character encoding

So much to say, what is the character encoding?

Character (Character) is the general name of words and symbols, including text, graphic symbols, mathematical symbols, and so on, a set of abstract characters is the character set (Charset). Character sets often correspond to a specific language, all characters in the text, or most of them, constitute the character set of the text, such as the English character set and the character set. To handle a variety of characters, the computer needs to correspond the characters to the binary code, which is the character encoding (Encoding). Coding first determines the character set, sorts the characters within the character set, and then corresponds to the binary number. Depending on the number of characters in the character set, it is determined to encode in several bytes.

Differences in character sets and encodings:

The collection of character set characters is not necessarily suitable for computer storage, network transmission, processing, and sometimes must be encoded (encode) before it can be applied. Unicode character sets, such as UTF-8, UTF-16, UTF-32, can be encoded according to different needs.

Development of character encoding

The character encoding is probably divided into three stages of development.

ASCII encoding: ASCII (American Standard Code for Information Interchange, American Information Interchange standard codes) is a computer coding system based on the Latin alphabet. Because the computer originated in the United States, to represent English characters, they developed ASCII encoding, ASCII encoding using 7-bit binary to represent characters, high level for parity, 0x20 the following byte state called "control Code", including punctuation, numbers, and so on, when people feel that enough.

Multiple encodings coexist: But with the wide application of computers, other countries are starting to use computers, but many countries are not using English, and many of their letters are not in ASCII, and in order to save their text on the computer, they decide to use the empty space after number 127th to indicate the new letters, symbols, Also added a lot of drawing forms need to use down to the horizontal, vertical, cross and other shapes, has been the serial number to the last state 255. The character set of the page from 128 to 255 is called "Extended character set". The best extension scenario is ISO 8859-1, commonly called latin-1,latin-1, which includes enough additional character sets to write basic Western European languages.

Later, other countries began to use computers, in order to express their country's language, different countries and regions have developed different standards, resulting in gb2312,big5,jis and other coding standards. Typically, a 2-byte range of 0x80~xff is used to represent 1 characters. For example: Chinese characters in Chinese operating system, using [0xd6,0xd0] These two bytes of storage. These use 2 bytes to represent an extended encoding of one character, called ANSI encoding.

Under the Simplified Chinese system, ANSI encoding represents GB2312 encoding, and ANSI encoding represents JIS code under Japanese operating systems. Different ANSI codes are incompatible, and when information is exchanged internationally, text that belongs to both languages cannot be stored in the same section of ANSI-encoded text.

Unicode encoding is incompatible with the ANSI codes of different countries, and in order to make international information exchange more convenient, international organizations have developed Unicode character sets, which set a uniform and unique number for each character in various languages to meet the requirements of text conversion and processing across languages and platforms. When Unicode begins to develop, the memory capacity of the computer is greatly developed and space is no longer a problem. So ISO is directly required to use two bytes, that is, 16-bit to unify all the characters, for those "half-width" characters in ASCII, Unicode keeps its original encoding unchanged, only extending its length from 8 bits to 16 bits, while other cultures and language characters are all reunified. Because the "half angle" English symbol only needs to use the low 8 digits, therefore its high 8 bit is always 0, therefore this atmospheric scheme will waste one time space in the preservation English text.

But when Unicode comes, together with the advent of computer networks, how Unicode transmission on the network is also a must be considered, so the transmission of the UTF (UCS Transfer Format) standards appear, as the name suggests, UTF8 is 8 bits per time to transmit data, and UTF16 is 16 bits at a time, but for the transmission of reliability, from Unicode to UTF is not a direct response, but to have some algorithms and rules to convert.

Children's shoes that have studied computer networks know that there is a very important problem in the transmission of information in the network, that is, for the interpretation of the data high and low level, some computers are used to send the first low, such as the Intel architecture used in our PC, and others are the way to send first. When exchanging data in the network, in order to check whether the two sides have the same knowledge of the high and low position, an easy way to do this is to send a sign to the other person at the beginning of the text stream-if the later text is high and send first, send "Feff" or "FFFE" instead. (The IP/TCP protocol stipulates that the network byte order uses the big-endian method)
Since UTF-8 encoding is compatible with ASCII encoding and facilitates transmission, it has been widely used.

Conversion rules from Unicode to UTF8:

Copy Code code as follows:

Unicode UTF-8
0000-007f 0xxxxxxx
0080-07FF 110xxxxx 10xxxxxx
0800-FFFF 1110xxxx 10xxxxxx 10xxxxxx

Development of Chinese character ANSI coding

GB2312: In order to express Chinese characters, our country first invented the GB2312 code, GB2312 code stipulates that a character less than 127 of the same meaning as the original, but two more than 127 words fonts together, it represents a Chinese character, The preceding byte (which he calls high byte) is used from 0xa1 to 0xf7, followed by a byte (low byte) from 0xa1 to 0xFE, so that we can assemble about 7,000 more simplified Chinese characters. In these codes, we have also compiled mathematical symbols, Roman Greek letters, and Japanese pseudonyms. Even in ASCII, the original number, punctuation, letters have all been a two byte long coding, which is often said "full-width" characters, and originally in the number below 127th is called "half-width" character.

GBK later found that many out-of-the-way words and the language of the minority were still not able to code in, in order to represent these characters, they simply no longer require that the low byte must be the code after 127th, as long as the first byte is greater than 127, the fixed representation that this is the beginning of a Chinese character, regardless of what follows is not the extended character set in the content. As a result, the coding scheme after the extension is called the GBK Standard, GBK includes all the contents of GB2312, and adds nearly 20,000 new Chinese characters (including traditional characters) and symbols.

GB18030 later, the minority also to use the computer, so we expand, and added thousands of new minority characters, GBK expanded into a GB18030. Since then, the culture of the Chinese nation can be passed on in the computer age.

BOM Introduction

The origin of 1.BOM: To identify Unicode files, Microsoft recommends that all Unicode files should begin with ZERO WIDTH nobreak space (U+feff) characters. This serves as a "signature" or "byte order mark (Byte-order Mark,bom)" To identify the encoding and byte order used in the file.

2. Different systems for BOM support: Because some systems or programs do not support BOM, a Unicode file with a BOM can sometimes cause problems.

①jdk1.5 and previous reader cannot process a UTF-8 encoded file with a BOM, and an exception is thrown when parsing an XML file of this format: The Content is not allowed in Prolog.
②linux/unix does not use a BOM because it destroys the syntax conventions of existing ASCII files.
③ different editing tools for the BOM processing also vary. When you save a file as UTF-8 encoding using a notepad with Windows, Notepad inserts the BOM automatically at the beginning of the file (although the BOM is not necessary for UTF-8). Many other editors can be selected without a BOM. UTF-8 and UTF-16 are the same.

3.BOM and Xml:xml parsing reads an XML document, the Consortium defines 3 rules:

① If there is a BOM in the document, the file encoding is defined;
② If there is no BOM in the document, look at the encoding attribute in the XML declaration;
③ If neither of the above is true, assume that the XML document is UTF-8 encoded.

Determining the character set and encoding of text

Software usually has three ways to determine the character set and encoding of text:

1. Check BOM

The most standard approach for Unicode text is to detect the first few bytes of text.

Copy Code code as follows:

Opening byte charset/encoding
EF BB BF UTF-8
FE FF Utf-16/ucs-2, little Endian (Utf-16le)
FF FE utf-16/ucs-2, big endian (UTF-16BE)
FF FE utf-32/ucs-4, little endian.
FE FF utf-32/ucs-4, Big-endia

2. User choice: Take a more secure way to determine the character set and its code, that is, a pop-up dialog box to consult the user.

3. Take the "guess" method: If the software does not want to trouble users, or it is not convenient for the user to ask, it can only take their own "guess" method, the software can be based on the characteristics of the entire text to guess which charset it may belong to, which is very likely not allowed. This is the case with Notepad to open the "Unicom" file. (the file that originally belongs to ANSI code is treated as UTF-8.)

Introduction to several encodings of Notepad

1.ANSI encoding: Notepad defaults to save the encoding format is: ANSI, that is, the local operating system default internal code, Simplified Chinese is generally GB2312. How does this prove? When saved in Notepad, open with a text editor such as EmEditor, EditPlus, and UltraEdit. Recommended use of EmEditor, open, in the lower corner will display the encoding: GB2312.

2.Unicode encoding: Use Notepad to save the last, encoding select "Unicode", with EmEditor open the file, found that the encoding format is: Utf-16le+bom (signed). Viewed in hexadecimal, the initial two bytes were found to be: FF FE. This is the BOM.

3.Unicode big endian with Notepad to save the last, encoding select "Unicode", with EmEditor open the file, found that the encoding format is: Utf-16be+bom (signed). Viewed in hexadecimal, the initial two bytes are found: FE FF. This is the BOM.

4.utf-8: Save the last Notepad, the encoding select "UTF-8", with EmEditor open the file, found that the encoding format is: UTF-8 (signed). Viewed in hexadecimal mode, the first three bytes found are: EF BB BF. This is the BOM.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.