Ubiquitous character set
Lu_yi_ming (_ AT _) Sina.com 2004.12.
Version 0.2 (this article may have many errors for your reference only. Thank you for your advice)
I. Character Set Application Example: Web browsing
We start with a user browsing an HTML page with IE. assume that this is a "User Information Registration" page, and the user enters the name, age, and other information for registration.
After you start IE, enter the URL in the address bar (for character set processing related to keyboard input, see "user input name"). ie stores the URL in the cache using UTF16, then convert the URL to utf8 and package it into an HTTP packet (HTTP character set processing is omitted ), send it to TCP/IP socket to the Web server (the character set of DNS resolution is slightly different from that of Chinese domain names ).
The Web server (for the character set of the file system processed by the server, see "loading JSP files") returns HTTP data packets containing HTML files (to support international file storage using utf8 encoding, it is returned to IE through TCP/IP socket, and IE finds content Character Set setting in HTTP protocol, if no character set is found, find the character set string <meta http-equiv = "Content-Type" content = "text/html; charset = UTF-8"> from the HTML file, all cannot find the default is ISO-8859-1, in short, ie for the current HTML file identified a character set (we assume utf8), after, display HTML files by character set in the IE window (for text character sets, see "user input name ").
IE displays the HTML file according to the correct character set. The user starts to enter the name in a <input type = "text">, after each press on the keyboard, Windows adds the virtual key code to the wm_keydown message and sends it to the IE message queue. getmessage () retrieves the message and submits it to translatemessage () for processing. translatemessage () the virtual key code is handed over to the input method window for processing. The Chinese characters selected in the input method window are saved in UTF16 encoding, and then the character set type (UTF16/GBK) of the Registration class is based on the IE window) add the Chinese character encoding to wm_char and send it to the IE Message Queue (UTF16 is sent to IE, one Chinese Character and one message; GBK is two messages ), IE stores UTF16 Chinese character encoding in the memory buffer, and then calls textout () using encoding, Font, and display location as parameters. textout () finds UTF16 encoding pairs in the library. (If the font library is not UTF16 encoding, Character Set conversion is required first) of the vector/dot matrix image, which is displayed in the specified position in the IE window (<input type = "text"> ).
Click Submit.
All text entered by the user on the webpage has been encoded by UTF16 and stored in the IE cache. After you press the "Submit" button, ie converts the cached data (items in form) to the previously recognized HTML character set (previously assumed utf8 ), then it is packaged in the http package (the URL of the form action is still converted to utf8) and sent to the TCP/IP socket (the processing of relevant character sets is the same as before ).
Assume that a JSP file of J2EE server (Tomcat) running on Linux is submitted for processing.
Tomcat/JVM TCP/IP socket receives the HTTP packet. Because JVM can only process UTF16 characters, Tomcat converts the entire HTTP packet to UTF16 for analysis, after analysis, we know that we need to load a JSP for processing, but this JSP has not been loaded, so tomcat/JVM converts the file name (including the path) to utf8 (because the glibc/Linux kernel requires utf8, call glibc/Linux kernel open () to open the file (Character Set conversion for Linux kernel access to the file system is omitted, or is discussed in other sections of this article ). Tomcat/JVM continues to call the glibc/Linux kernel read () to read the binary data of the file into the memory, the first part of the file according to the ISO-8859-1 character set to UTF16, then, find the <% @ page contenttype = "text/html; charset = UTF-8 "%> to determine the character set of the file (this file is also saved in utf8 encoding to support internationalization ). Tomcat/JVM converts the data of the entire JSP file to UTF16 according to utf8, translates the data into Java format, and then converts the file data in Java format back to utf8, save the Java file to the working directory (the file system Character Set conversion is the same as above ). Tomcat calls javac to compile the Java file into a class file (Character Set conversion is also performed during the compilation process, otherwise the C library or operating system does not support it, but the specific situation is unknown. The string in the Code is converted to UTF16), and then read into the memory for loading (hereinafter referred to as the JSP code ).
Tomcat submits the main data in the http package to the JSP code for processing. The JSP code first converts the binary data into the JVM internal code UTF16 by utf8, then read the text of each field in form, convert the data type, and start processing.
Let's assume that the JSP code stores user data into a database (MySQL ).
The JSP code loads the JDBC driver of MySQL and connects to the database of a MySQL Server (the database name is) (the TCP/IP-related character set conversion is omitted when the server is connected ). After the JDBC driver establishes a connection with the DB, it immediately obtains the default Character Set of the DB from the server (to support international settings as utf8 ). The JSP code combines the data in the user form into an SQL statement (UTF 16 of course) insert Table1 (name, company) values ('name', 'unit '), run the command in JDBC. JDBC first asks the server about the character set of the name and company fields (the name character set is utf8, and the company character set is GBK), then insert... the statement is converted to utf8, but the "unit" is converted to GBK. Then, the string data is sent to the server through TCP/IP, the server stores data in a file (Character Set unchanged). The utf8 "name" occupies 6 bytes, And the GBK "unit" occupies 4 bytes.
Assume that the JSP code should get some information from the database in the next step, and then send a confirmation email to the user's email box for the user to confirm.
Prepare an SQL statement for JSP code Select ..., A result set is returned for JDBC operation. After JDBC converts the SELECT statement into a character set, it is sent to the MySQL database (the specific character set conversion is similar to the preceding one). The server returns a result set, which is received by JDBC, the JSP code is getstring () in the result set. After necessary character set conversion by JDBC, A UTF16 string is returned. JSP calculates the string used for sending mails Based on the obtained data, and calls the mimemessage. settext () of javamail to set the email. The character here is always UTF16 encoding. After the email is ready, JSP calls transport. Send () to send the email. Transport. Send performs conversion according to the character set definition in Linux environment variables before sending the mail data packet to TCP/IP socket. The character set conversion process for sending specific emails and the character set processing for SMTP protocol are omitted.
After the email is sent successfully, the JSP code displays some prompts to the user as the end of the new user registration.
JSP code is ready to return an HTML page to the user. First, specify the character set to be defined in the HTTP protocol. Here, the user continues to return the utf8 page <% @ page contenttype = "text/html; charset = UTF-8 "%>, and then out. println () outputs the HTML information to be displayed. Tomcat converts all the information in the output buffer from UTF16 to utf8, and then returns the information to IE through TCP/IP socket, after IE is converted to the preceding character set, it is displayed to the user.
Ii. Character Set Classification
1. Basic Definition
◇ GBK is the superset of gb2312. An English letter occupies 1 byte in GBK, and a Chinese character occupies 2 bytes. The Chinese character is sorted by pinyin, And the font contains traditional Chinese characters.
◇ Unicode generally refers to unicede 16. Unicode 32 is not commonly used. A single character occupies two bytes, both Chinese and English, and the height of the byte is later. The Chinese characters are sorted by radicals and strokes, and the font contains traditional Chinese characters.
◇ UTF-8 is a variant of Unicode 16, one-to-one correspondence with Unicode 16, so the sorting rules are the same. A letter occupies 1 byte, a Chinese character occupies 2 bytes, and a Chinese character occupies 3 bytes. String comparison, sorting, and other operations may be 30% slower than Unicode 16 (for reference only ).
2. Windows (2000) System
◇ Almost every word we see on Windows is output by the textout function (extextout), textout internally (within Windows) using UTF-16/UCS-2 little endian (from msdn ). Textout the character graphics displayed on the screen come from the font file. A font file provides a picture (vector or dot matrix) for each character. Of course, the font file specifies the character set to which the character belongs. So, if the character set of the font file is not a UTF-16/UCS-2 little endian, textout will first convert to find the correct character graphics to display on the screen.
◇ Windows has many character sets (libraries for each character set and UTF-16) to support applications.
◇ Windows also has a default local character set (End User Character Set) to support localization. The local character set of Windows is GBK.
◇ Windows contains two character-related APIs (... W and... A),... W for Unicode, and... a for local GBK.
◇ In a Windows application, you input a Chinese character on the keyboard (using the input method tool). The translatemessage function translates it into GBK or Unicode characters (determined by registerclass ), then append it to wm_char (wparam) and send it to your application.
◇ Notepad.exe in windows. The Chinese input from the keyboard is translated into Unicode characters and saved to the memory buffer. When saving the file, you can select the encoding, if you select ANSI, it is translated to save as gb2312; if you choose Unicode, it is saved as Unicode; if you choose UTF-8, it is translated to save as UTF-8. The text files of these character sets can be opened.
◇ The character set of the file name stored in the fat/32 partition is ANSI/OEM (GBK/gb2312), and the character set of the file name stored in the NTFS partition is Unicode
◇ Windows DOS window can display GBK/gb2312, UTF-16 text files.
◇ You can use the VC editor to display the binary data of a file.
3. Linux system (RedHat FC2/kernel 2.6.6)
◇ Linux kernel character set is UTF-8.
◇ Linux User Interface (Shell) Character Set can be set in environment variables (/etc/profile), the default is en_US.UTF-8, can be changed to: Export lc_type = "zh_cn.gb2312 ".
◇ Use the iconv command to display the content of files of other character sets, such as Cat... | iconv-F gb2312-T UTF-8
◇ The text editor kwrite on KDE is very good. You can select character set for translation when opening and saving files.
◇ The command for displaying the binary data of a file is hexdump-C.
◇ Description of fat/32 partition in fstab
/Dev/hdb1/mnt/winc vfat defaults, codePage = 936, iocharset = cp936 0 0
/Dev/hda5/mnt/wind vfat defaults, codePage = 936, iocharset = cp936 0 0
◇ Map a shared directory on another Windows machine to a Linux sub-directory at cost
Mount-t smb-O charset = gb2312, codePage = cp936, username = xxx // pcname/share/mnt/dir1
4. Web browsing
◇ Part of the HTTP character set is UTF-8, and the character set of the Data part is specified in the Protocol or in the HTML file.
◇ The character set of an HTML file is specified within the file. <meta http-equiv = "Content-Type" content = "text/html; charset = Unicode">. The browser does not know the character set of the received data block, but needs to find such a complicated string within it before knowing what character set it has found.
Perform a test: use FrontPage to edit a few lines of Chinese characters, set the webpage character set to Unicode (webpage Attribute-> text-> encoding), save it as a file, and then open the file in binary mode, delete the two bytes "ff fe" at the beginning of unicode16 and save them. After opening or browsing, ie, FrontPage, and InterDev do not know each other. Only Firefox is displayed normally. Firefox is really nice. It seems that IE and so on are all determined by the file header. Sometimes the HTML Character Set mark does not work at all.
◇ On Windows, both IE and Firefox translate user-typed characters into Unicode for saving, including the address bar and form. The address bar information is converted according to the HTTP character set before being sent to the server. Before the form information is sent to the server (the server side of the Application), it is converted according to the character set specified in HTML (or the character set recognized by the browser.
5. Java/J2EE
◇ The character set inside JVM is Unicode
◇ The character set of the Java source program file can be arbitrary. It is usually determined when the editor saves the file (the default Character Set of the System user interface ). During javac compilation, you can specify the character set of the Java source file. If this parameter is not specified, the environment variable or the default Character Set of the system is used .. The character set of the class file is Unicode.
◇ The code for converting a Java string (string) Character Set (from gb2312 to UTF-8) is: String str2 = new string (str1.getbyte ("gb2312"), "UTF-8 ");
◇ The character set of the JSP file can be arbitrary. during compilation, the character set definition will be searched from the beginning of the file <@...>.
◇ On Linux, the character set of the mails sent by javamail is converted to the character set defined in the environment variable of the local system by default.
6. VC/. net
◇ Omitted
7. Database
7.1 MySQL (4.1.8)
◇ The character set can be specified when the MySQL server is started, and the character set can be written into the configuration file during startup. The simple practice is to set it in the graphical management tool MySQL administrator.
◇ Different character sets can be set for each database (DB), table, and field in MySQL. If the name field supports internationalization (for example, the name of a person in multiple languages such as Chinese, English, Japanese, Korean, and Arabic), the UTF-8 character set is better.
◇ MySQL provides several sorting methods for each character set, but sorts the UTF-8 pinyin characters without Chinese characters (the UTF-8 Chinese characters are sorted by the radicals and strokes ). The alternative for sorting UTF-8 fields in Chinese by Pinyin is to add a GBK character set field. Each time two fields are inserted for the same data character set, you can sort by the last field.
◇ MySQL's JDBC driver is "smart" and will be automatically converted between the Unicode Character Set in Java and the character set in the database.
◇ MySQL's graphical client, MySQL query browser, is also smart in Processing Automatic Character Set conversion.
◇ MySQL character client MySQL is poor in Processing Automatic Character Set conversion.
8. Email
◇ This version is omitted and detailed in the next version.
9. SSH
◇ Omitted
Any character belongs to a character set, which is everywhere.