The problem of character garbled in the Linux __linux

Source: Internet
Author: User
Tags i18n locale printable characters

Linux under the Chinese will often appear garbled, some browsing the Web page appears garbled, and some text mode shows the Chinese appear garbled. The following figure shows the problem I encountered. I installed the Centos,x-window display Chinese normal, but in the text mode, the display of Chinese will appear garbled.

Linux in locale detailed

Locale is a very important concept in the process of internationalization and localization, it is believed that for Chinese users, it usually involves internationalization or localization, which includes three aspects: Reading Chinese, writing Chinese, compatibility and communication with Windows Chinese system. From the practical experience, it seems that the setting of locale has little to do with Chinese, but it has a close relationship with the way of writing Chinese and the mode of the window partition. I think that just like a pure English windows can browse Chinese, Japanese or Italian pages, you don't need to set locale to see Chinese. So why set the locale? When will you use the locale?

One, why to set the locale as I said earlier, setting locale is not directly related to whether you can browse Chinese pages, even if you set locale to en_US. Iso-8859-1 such a standard English locale you can still browse the Chinese web page, as long as your system has the corresponding character set (this does not necessarily need) and the appropriate font (such as SimSun), the browser can translate the Web page into Chinese for you to see. The specific process is the network to transfer the Web page to your machine, the browser will determine the corresponding coded character set, according to the character set used in the Web page, to find the appropriate font library, and then by the text rendering tool to display the corresponding text on the screen.

In the following I will occasionally the character set analogy to the cipher this, personally feel for some things easier to understand, if you are not accustomed to, the full text copy to any text editor, with the character set to replace the password can be.

That sometimes the Web page is garbled or is the box is how it happened. Personally, the display is garbled because the set character set is wrong (or no corresponding character set), such as the Web page is encoded with UTF-8, you have to use GB2312 to see, and the system according to GB2312 to find fonts, and then on the screen, of course, is a bunch of garbled, Which means you're going to use a wrong cipher to translate the telegram sent to you, of course the content that is called a disorder; As for some time browsing the Web page can show a part of Chinese characters, but there are a lot of places is the box, can display Chinese characters to explain the browser has correctly judged the code of the page, and in the font library found the corresponding text, But not every font library contains all the fonts of a character set, and sometimes the display is not complete, to find a relatively full support for more character sets of fonts on it.

Since I can browse Chinese web pages, why do I have to set locale?

In fact, you have not thought of such a question, why Gentoo official forum of the Chinese Forum Web page is encoded with UTF-8 (although we have been strongly advised to use GB2312 code), but Sina is the GB2312 encoded. and Xorg's official website is iso-8859-15 code, I did not set this locale how the same can browse it. The problem is like you have all the cipher books, no matter what character set the site is coded, you can use the password in your hand to translate them, but the problem is that although you can browse the Chinese Web page, but the entire operating system in the flow of English characters. So, just as you can understand English, you can understand Chinese. The fundamental problem is that you can't write Chinese.

When you decide what to write, the first thing to decide is in that language, and for a computer, you have to tell your Linux system that you want to use that cipher book to write what you want to write. Know why you need to use the GB2312 character set to browse Sina, because Sina's Web page is written with GB2312.

In order for your Linux to be able to enter Chinese, you need to set the locale of the system into Chinese (strictly speaking the language categories in locale lc_ctype), such as ZH_CN. GB2312, ZH_CN. GB18030 or ZH_CN. UTF-8. A lot of people don't understand these weird expressions. What does this alien expression prescribe? This question is detailed later, and now just know that this is locale's way of expression.

Second, exactly what is locale. Locale This word in Chinese translation into regions or areas, in fact, this word contains a much broader meaning. Locale is a software runtime locale defined by the language used by computer users, the country or region in which they are located, and the local cultural tradition.

This user environment can be divided into several broad categories according to the various aspects of the cultural tradition involved, usually including the language symbols and their classifications (LC_CTYPE), Numbers (lc_numeric), comparisons and Sorting habits (lc_collate), Time display formats (lc_time) , Currency Unit (lc_monetary), information is mainly prompted information, error messages, status information, titles, tags, buttons and menus (lc_messages), name writing (lc_name), address writing (lc_address), telephone number writing method ( Lc_telephone), Weights and Measures Expressions (lc_measurement), default paper size (lc_paper), and locale an overview (lc_identification) of the information contained on itself.

So, locale is the language habit and cultural tradition and life habit of the people in a certain region. The locale of a region is defined by the custom of these broad categories, which are placed under the/usr/share/i18n/locales directory, such as en_US, ZH_CN and De_de@euro, which are locale definition files, These files are written in text format, you can use WordPad to open, look inside the content, of course, out of the limited annotation, most of the things you may not understand, because it is used in the Unicode character index method.

For the De_de@euro point, the @ is behind the fix, which means you can see two German locale:/usr/share/i18n /locales/de_de@euro/usr/share/i18n/locales/  De_de Open These two locale definitions, you will know that the difference is that De_de@euro uses European sort, comparison and indentation habits, and De_de uses German standard habits.

Above we talked about ZH_CN. The first half of the GB18030, and what is the latter part. Most Linux users know the character set used by the system.

Third, what is the character set. Character set is the character, especially the encoding of non-English characters within the system, which is usually called the inner code, all the character sets are/usr/share/i18n/charmaps, and all character sets are indexed with Unicode numbers. Unicode uses a uniform number to index all currently known symbols. The character set is the encoding of these symbols, or in the network transmission, the computer internal communication, for different characters of the expression, Unicode is a static concept, character set is a dynamic concept, is each character transmission or transmission of the specific form. Just like the Unicode number u59d0 is the "sister" of the sister, but the specific word is expressed in two bytes, three bytes, or four bytes, is the problem of character set. For example: The UTF-8 character set is the current popular encoding of characters, UTF-8 uses a byte to represent the commonly used Latin alphabet, using two bytes to represent commonly used symbols, including commonly used Chinese characters, three symbols that are not commonly used, and four bytes to represent other ancient ghost characters. The GB2312 character set is a two-byte representation of all characters. One thing to mention is that Unicode stores all characters in four bytes in addition to all characters indexed by numbering, which is a very important concept when it comes to mounting windows partitions. So you can also think of Unicode as a character set (I don't know the relationship between it and UTF-32, anyway, UTF-32 is a four-byte representation of all characters, but this notation is a waste of resources, because most of the time the computer world uses a byte can be done 26 Letters. So there will be utf-8,utf-16 and so on, otherwise datong world how good, save this many trouble.

Four, ZH_CN. What the hell is GB2312 talking about. Locale is the language environment at runtime of the software, which includes language (Language), Geography (territory), and character set (CodeSet). A locale writing format is: language [_ Region [. Character set]]. So, locale is always associated with a certain character set. Here are a few examples:

1, I speak Chinese, in the People's Republic of China, the use of GB 2312 character sets to express characters. Zh_cn. Gb2312= Chinese _ People's Republic of China + GB 2312 character set.

2, I speak Chinese, in the People's Republic of China, the use of GB 18030 character sets to express characters. Zh_cn. Gb18030= Chinese _ People's Republic of China + GB 18030 character set.

3, I speak Chinese, in the People's Republic of China Taiwan Province, the use of GB BIG5 character set to express characters. Zh_tw. Big5= Chinese _ Taiwan. Large five-yard character set

4, I speak English, in Great Britain, use the iso-8859-1 character set to express characters. en_GB. iso-8859-1= English _ Great Britain. Iso-8859-1 Character Set

5, I speak German, in Germany, using the UTF-8 character set, accustomed to European style. De_de. Utf-8@euro= German _ Germany. UTF-8 Character set @ modified in accordance with European custom

Note is not [email]de_de@euro.utf[/email]-8, so the complete locale expression is [language [_ Region] [. Character set] [@ fixed value]

The generated locale are placed in the/usr/lib/locale/directory, and each locale corresponds to a folder, which means that after the [email]de_de@euro.utf[/email]-8 locale is created, the/usr/is generated  Lib/locale /de_de@euro.utf-8/directory, which is specific to each locale content.

Five, how to customize locale in Gentoo generation locale is very easy, first of all, to add userlocales support in use, and then edit the Locales.build file, which is used to instruct glibc to generate locale files. Many people don't understand what each entry means. In fact, according to the above instructions should be very clear now.

File:/etc/locales.build en_us/iso-8859-1 en_US. Utf-8/utf-8

zh_cn/gb18030 ZH_CN. GBK/GBK ZH_CN. gb2312/gb2312 ZH_CN. Utf-8/utf-8

The above is my locales.build file, which is explained in turn:

En_us/iso-8859-1: Generate the locale named en_US, use the iso-8859-1 character set, and take this locale as the default value of the English _ US locale class, in fact it and en_US. Iso-8859-1/iso-8859-1 no difference.

en_US. Utf-8/utf-8: Build named en_US. UTF-8 of the locale, using the UTF-8 character set.

zh_cn/gb18030: Generate the locale named ZH_CN, use the GB18030 character set, and take this locale as the default value of Chinese _ Chinese locale class, in fact it and ZH_CN. gb18030/gb18030 no difference.

Zh_cn. GBK/GBK: Build named ZH_CN. GBK of the locale, using the GBK character set. Zh_cn. gb2312/gb2312: Build named ZH_CN. GB2312 of the locale, using the GB2312 character set. Zh_cn. Utf-8/utf-8: Build named ZH_CN. UTF-8 of the locale, using the UTF-8 character set.

With regard to the default locale, the default locale can be abbreviated to en_US or ZH_CN form, just for the simple expression of no particular significance.

Gentoo locale The definition of something, that is locale's build tool: Localedef. After compiling glibc you can use this localedef to add some locale, you will understand more locale. The concrete can see the localedef manpage.

The name of the locale generated by the $localedef-F character Set-I locale definition file, such as $localedef-f UTF-8-I zh_cn zh_cn. UTF-8

The above definition method and set ZH_CN in the Locales.build. The result of Utf-8/utf-8 is the same.

Vi. the organs of locale

Several locale have just been generated, but in order for them to take effect, the Linux system must be told to use that (several) locale. This requires a little understanding of the internal mechanism of locale. As I mentioned earlier, locale divides the various aspects of the cultural traditions involved into 12 categories, the 12 categories being: 1, language symbols and their classifications (LC_CTYPE) 2, Numbers (Lc_numeric) 3, Comparison and sorting habits (lc_collate) 4, Time display Format (lc_time) 5, Currency Unit (lc_monetary) 6, information is mainly prompted information, error messages, status information, title, tags, buttons and menus (lc_messages) 7, name writing (Lc_name) 8, Address writing method (lc_ Address) 9, telephone number writing method (Lc_telephone) 10, Weights and measures expression (lc_measurement) 11, default paper size size (lc_paper) 12, an overview of the locale itself contains information (lc_ Identification).

Among them, the most closely related to the Chinese input is lc_ctype, LC_CTYPE rules the system of valid characters and the classification of these characters, such as what is uppercase, lowercase letters, capitalization conversion, punctuation, printable characters and other character attributes. One of the most important items in the locale definition zh_cn is to define a large class of Chinese characters (class "Hanzi"), which is also described in Unicode, which makes Chinese characters valid in Linux systems, and no matter what character set they are encoded in.

Lc_ctype% is a copy of the ' i18n ' lc_ctype with the following modifications:-Additional Classes:hanzi

Copy "i18n"

Class "Hanzi"; / % ..;/ ..;/ ;;;;;;;;/ ;;;;;;;;/ ;;;; End Lc_ctype

In the locale definition of en_US, Chinese characters are not defined, so Chinese characters are not valid characters. So if you want to enter Chinese, you must use locale that support Chinese, that is, zh_xx, such as ZH_CN,ZH_TW,ZH_HK and so on.

Another very important point is that these classifications are independent of each other, that is, lc_ctype,lc_collate and lc_messages and so on are separate from each other and can be set to different values according to the needs of the user. This is good for many users, even necessary. For example, I need an English environment to be able to input Chinese, so I can set lc_ctype as ZH_CN. GB18030, and all the other items are en_us. UTF-8.

Seven, how to set locale it.

    setting locale is the locale category attribute that sets the 12 categories, that is, 12 lc_*. In addition to the 12 variables that can be set, there are two variables for simplicity: Lc_all and Lang. There is a priority relationship between them: Lc_all>lc_*>lang can say that Lc_all is the most superior set or mandatory setting, and LANG is the default setting. 1, if you set the LC_ALL=ZH_CN. UTF-8, then whatever value lc_* and Lang set, they will be forced to obey the Lc_all set and become ZH_CN. UTF-8. 2, if you set the LANG=ZH_CN. UTF-8, while the other lc_*=en_us. UTF-8, and the Lc_all is not set, then the system locale set to Lc_*=en_us. UTF-8. 3, if you set the LANG=ZH_CN. UTF-8, while the other lc_*, and Lc_all are not set, the system will set the lc_* to the default value, which is the value of Lang zh_cn. UTF-8. 4, if you set the LANG=ZH_CN. UTF-8, while the other lc_ctype=en_us. UTF-8, the other lc_*, and Lc_all are not set, then the system locale settings will be: Lc_ctype=en_us. UTF-8, the rest of the lc_collate,lc_messages and so on will take the default value, which is the value of Lang, that is, lc_collate=lc_messages= ... = Lc_paper=lang=zh_cn. UTF-8.  

So, Locale is set: 1, if you need a pure Chinese system, set lc_all= ZH_CN. XXXX, or lang= zh_cn. XXXX can, of course, you can set both, but as mentioned above, the value of Lc_all will cover all the other locale settings, do not work hard. 2, if you want only one can enter the Chinese environment, and keep the menu, title, System information, etc. for the English interface, then only need to set LC_CTYPE=ZH_CN. Xxxx,lang= en_US. xxxx is OK. So LC_CTYPE=ZH_CN. XXXX, and lc_collate=lc_messages= ... = Lc_paper=lang=en_us. Xxxx. 3, if you are happy, you can set 12 lc_* one by one to the value you need to create an ancient spirit of the system: LC_CTYPE=ZH_CN. GBK/GBK (using Chinese code GBK character set); Lc_numeric=en_gb. Iso-8859-1 (using the digital system of Great Britain) lc_measuremen=de_de@euro.iso-8859-15 (German weights and measures using the iso-8859-15 character set) Roman address writing, American paper setting .... No one's going to do that. 4, if you do nothing, that is, Lc_all,lang and lc_* do not specify a specific value, the system will use POSIX as a lcoale, that is, C locale.

This article comes from the Linux commune website (www.linuxidc.com) original link: http://www.linuxidc.com/Linux/2011-09/43160.htm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.