Character inner code
Each country (or region) specifies the character delimiter set for computer information exchange, such as the extended ASCII code of the United States, GB2312-80 of China, JIS of Japan, etc., as the country (region) the foundation of information processing plays an important role in unified coding. Because the code ranges of local character sets overlap, it is difficult to exchange information between them, and the independent maintenance cost of software localized versions is high. Therefore, it is necessary to extract the commonalities in the localization work, perform consistency processing, and minimize the content of special localization processing. This is the so-called internationalization (I18N ). The language information is normalized as local information, while the underlying character set uses Unicode that contains all characters.
Character Internal code refers to the internal code used to represent characters. We need to use internal codes when entering and storing documents. internal codes are divided into single-byte internal codes and dual-byte internal codes. The full name of a Single Byte internal code is Single-Byte Character Sets (SBCS), which can be 256 characters encoded. The full name of the dual-Byte internal code is Double-Byte Character Sets (DBCS ), it can be encoded with 65000 characters and is mainly used to encode the eastern text of a large character set.
CodePage refers to a selected list of character inner codes in a specific order. For earlier single-byte incode languages, the internal code order in CodePage allows the system to give an internal code based on the input value of the keyboard according to this list. For the double byte internal code, the corresponding table from MultiByte to Unicode is provided, so that the characters stored in the Unicode form can be converted to the corresponding character internal code. The support for CodePage is mainly used to access multilingual file names. Currently, Unicode is used in file systems under NTFS and FAT32/VFAT, this requires the system to dynamically convert these file names into the corresponding language encoding.
I believe that the readers of jsp (SUN Enterprise Application preferred) code must be familiar with ISO8859-1, ISO8859-1 is a CodePage we usually use more, it belongs to the Western European language. GB2312-80 is developed in the initial stage of the development of Chinese character information technology in China, which contains most of the commonly used first-and second-level Chinese characters and 9-area symbols. This character set is supported by almost all Chinese systems and international software. It is also the most basic Chinese character set.
GBK is an extension of the GB2312-80 and is upward compatible. It contains 20902 Chinese characters and Its Encoding range is 0x8140 ~ 0 xFEFE removes the 0x80 characters from the top. All the characters can be mapped to Unicode 2.0 one-to-one. That is to say, Java actually provides support for the GBK character set.
> GB18030-2000 (GBK2K) on the basis of GBK to further expand the Chinese characters, added the Tibetan, Mongolian and other ethnic minorities. GBK2K fundamentally solves the problem of insufficient characters and insufficient fonts.
Differences between different development platforms
1. tomcat (a good JSP running platform) 4 Development Platform
Tomcat in Windows 98/2000 (a very useful JSP running platform) 4 and later versions will encounter Chinese problems (while in Linux and tomcat (a very useful JSP running platform) 3. x is correct), mainly because the page is garbled. In IE, you can adjust the character set to GB2312 to display it normally.
To solve this problem, you can add <% @ page language = "Java" contentType = "text/html; charset = gb2312 "%>. However, this is not enough. Although Chinese characters are displayed, the Fields read from the Database become garbled. After analysis, it is found that the Chinese characters stored in the database are normal, the database uses the ISO8859-1 character set to access data, the Java program uses the unified ISO8859-1 character set by default when processing characters (this also reflects the Java internationalization idea), so when the data is added, Java and database are processed in ISO8859-1, this will not cause errors. But there is a problem when reading data, because data reading also uses the ISO8859-1 character set, and jsp (SUN enterprise-level application preferred) the file header contains the statement <% @ page language = "Java" contentType = "text/html; charset = gb2312" %>, which indicates that the page is displayed in the GB2312 character set, this is different from the read data. At this time, the page shows the characters read from the database is garbled, the solution is to transcode these characters, from the ISO8859-1 to GB2312, you can be normal display. This solution is universal to many platforms and can be used flexibly by readers.
2. tomcat (a good JSP running platform) 3.x, resin (a free JSP running platform) and Linux platform
In tomcat (a very useful JSP running platform) 3.x, resin (a free JSP running platform), or in Linux, no statement added <% @ page language = "Java" contentType = "text/html; charset = gb2312" %>, the <meta http-equiv = "Content-Type" content = "text/html; charset = gb2312"> Statement on the page works and can be displayed normally. Conversely, if <% @ page language = "java" contentType = "text/html; charset = gb2312" %> is added, the system reports an error, it indicates that tomcat (a very useful JSP running platform) engines of Version 4 and later are different in processing jsp (the preferred choice for SUN enterprise-level applications.