These days I studied UTF-8 coding, so dizzy that I will discuss my views with you.
Thank you for your approval. All of the following are my ideas. If anything is wrong, please kindly advise and help me to point it out.
========================================================== ============================
Related digress:
I. Operating System
The Windows system contains Unicode characters. The folder name and file name are all Unicode and can be normally displayed in any language system.
Ii. Input Method:
Microsoft Pinyin outputs Unicode and intelligent ABC outputs Simplified Chinese (so intelligent ABC cannot be used in a non-Simplified Chinese system and can only be used in English ).
3. textarea
The textarea of a webpage is displayed in Unicode. Therefore, all words can be displayed. Some input boxes made by flash cannot work.
Iv. Access2000
The data stored in access is Unicode and can be displayed in any language system.
If some characters in the data view are not normal, it is because the font used for display is not a unicode font,
Replace Arial Unicode MS fonts to display them all. (Access help, search, input Unicode, instructions)
V. Word
Simplified conversion in Word. After simplified conversion to traditional Chinese, the inner code is still in simplified Chinese, but in simplified Chinese.
Vi. asp is Unicode, and all texts are stored in Unicode. Convert to the specified character set as needed.
========================================================== ====================
Conclusion:
<% @ CodePage = 936%> Simplified Chinese
<% @ CodePage = 950%> traditional Chinese
<% @ CodePage = 65001%> UTF-8
CodePage specifies the encoding used by IIS to read passed strings (such as form submission and address bar transfer ).
It also specifies the encoding for converting all text variables from Unicode,
It also specifies the Unicode conversion encoding of the data retrieved from the database. (Note this .)
Keywords:
Reading: A string reads some words in simplified Chinese, and some words in Traditional Chinese. The encoding of the string is not changed.
Conversion: The system performs active conversion. For example, the conversion from Unicode to big5 is converted to big5. If big5 does not have the corresponding word, the unicode format (xxx;) is retained ;)
Simplified Chinese: Six conclusions
Unicode16 hexadecimal form: Six conclusions
Unicode10 hexadecimal form: Six conclusions
The following is the code conversion process I have come up:
Client: Unicode Input Method -- Unicode input box -- convert Unicode from charset to corresponding encoding () -- form sending Encoding
Server Side: IIS unlocks the form encoding-reads the code specified by codePage-converts the code to the corresponding Unicode-reads the code using request ("")-performs some processing-Saves the code as Unicode to the database.
Server: reads Unicode data from the database and converts it to the specified codePage encoding. --- generateSource code-- Ie reads and displays data according to charset.
The following is an example:
Example 1:
Suppose there are three ASP pages, a typical message page:
1. Write. asp simple input form, submit to add. asp.
<Meta http-equiv = "Content-Type" content = "text/html; charset = big5">
2. Add. asp receives messages and saves them to the database.
<% @ CodePage = 936%>
3. Read. asp gets a message from the database and displays it.
<% @ CodePage = 936%> charset = gb2312 or
<% @ CodePage = 950%> charset = big5
You can guess that I used the Microsoft Pinyin Input Method in write. asp to input "Six discussions ". What will be displayed in read. asp?
Is it dizzy. Let's analyze it from scratch.
Example 2:
Change <% @ codePage = 936%> of add. asp in Example 1 to <% @ codePage = 950%>. What will happen?
What have you found here?
1. If the input text is different from the character set, a conversion may result in a word in the unicode format. Here is the reason. The entire process will be retained later.
2. In Add. asp, codePage determines the text saved to the database and the Unicode corresponding to the language used. For example, codePage = 936,
Then the database stores Unicode in simplified Chinese (the database takes back the Simplified Chinese system, everything is normal ),
CodePage = 950 stores Unicode in Traditional Chinese. (It is incorrect to retrieve the Simplified Chinese system ).
3. Pay attention to the variation process of the string:
--------------------------------------------------------------------
1) Input Method --- charset Unicode ---- ing of the specified Character Set
2) Charset ---- form Encoding Simple string Encoding
3) Form Decoding The inverse process of the previous step is offset by the two steps.
4) Reading strings by codePage The string hasn't changed. This step may be "misunderstood to read"
5) Convert to the corresponding Unicode CodePage specified character set ---- Unicode ing
6) Intermediate processing, entering the database Without any changes, enter
7)
8) Read database by codePage Unicode ---- codePage ing of the specified Character Set
9) Display, read by charset specified Character Set String unchanged.
-------------------------------------------------------------------------------
Example 1:
Example 2:
========================================================== =====
Dizzy. Use knowledge now.
Case 1.
The Simplified Chinese system runs well.CodeAnd put it in a foreign space. The database is garbled and the original data is garbled.
Analysis: Most people usually use the simplified Chinese system, and the default codePage is 936, so it doesn't matter if you do not write it at ordinary times.
However, the problem of space outside the country has come out. It is converted from Unicode in the database to English encoding. Therefore, after the original simplified Chinese characters in the database are converted to English, garbled characters are displayed in GB.
The newly entered text is displayed normally, but the database stores Unicode in English.
Solution: Add <% @ codepage = 936 to all.
Only the conversion between Simplified Chinese and Unicode is supported throughout the process.
case 2:
What should I do if I want to convert Simplified Chinese code and data into a full Traditional Chinese version?
analysis: 1. The code file encoding is changed to big5, and the file storage encoding is set to traditional Chinese.
2. <% @ codePage = 936%>
3. charset = big5
4. access does not matter because the data in access is Unicode.
5. Now, the code can be run in a pure traditional Chinese system.
6. Legacy issues: There are some question marks when reading the original Simplified Chinese data. The result is read by 950 of the same example 1, and big5 is displayed. Because Unicode is converted from simplified Chinese to traditional Chinese, some traditional Chinese characters do not exist, there will be question marks.
7. solution: Use a temporary ASP page, codePage = 65001, read Unicode in simplified Chinese, use a Unicode-> big5 function, convert it to traditional Chinese, and then write it back to the database, should that be done?
case 3:
Simplified Chinese code and database, want to convert into a complete UTF-8 version, what should I do?
analysis: 1. Code file encoding all changed to UTF-8, the file itself save encoding optional utf8.
2. <% @ codePage = 65001%>
3. charset = UTF-8
4. Access version does not matter, because the data in access is Unicode.
5. OK, no issues left. The original Simplified Chinese is also displayed normally. Because the database is Unicode, reading by Unicode is not converted. It will not be garbled. It seems that it is very easy to go to The UTF-8.
================================================= ==========< br> I have deduced the case based on theory, unconfirmed.
if you have similar experiences, please criticize and correct them.