Summary of the importance of codePage to prevent Chinese garbled Characters During website development

Source: Internet
Author: User

Related digress:

I. Operating System
The Windows system contains Unicode characters. The folder name and file name are all Unicode and can be normally displayed in any language system.

Ii. Input Method:
Microsoft Pinyin outputs Unicode and intelligent ABC outputs Simplified Chinese (so intelligent ABC cannot be used in a non-Simplified Chinese system and can only be used in English ).

3. textarea
The textarea of a webpage is displayed in Unicode. Therefore, all words can be displayed. Some input boxes made by flash cannot work.

Iv. Access2000
The data stored in access is Unicode and can be displayed in any language system.
If some characters in the data view are not normal, it is because the font used for display is not a unicode font,
Replace Arial Unicode MS fonts to display them all. (Access help, search, input Unicode, instructions)

V. Word
Simplified conversion in Word. After simplified conversion to traditional Chinese, the inner code is still in simplified Chinese, but in simplified Chinese.

Vi. asp is Unicode, and all texts are stored in Unicode. Convert to the specified character set as needed.

Conclusion:
<% @ CodePage = 936%> Simplified Chinese
<% @ CodePage = 950%> traditional Chinese
<% @ CodePage = 65001%> UTF-8

CodePage specifies the encoding used by IIS to read passed strings (such as form submission and address bar transfer ).

It also specifies the encoding for converting all text variables from Unicode,
It also specifies the Unicode conversion encoding of the data retrieved from the database. (Note this .)

Keywords:
Reading: A string reads some words in simplified Chinese, and some words in Traditional Chinese. The encoding of the string is not changed.

Conversion: The system performs active conversion. For example, the conversion from Unicode to big5 is converted to big5. If big5 does not have the corresponding word, the unicode format (xxx;) is retained ;)

Simplified Chinese: Six conclusions
Unicode16 hexadecimal form: Six conclusions
Unicode10 hexadecimal form: Six conclusions

The following is the code conversion process I have come up:
Client: Unicode Input Method -- Unicode input box -- convert Unicode from charset to corresponding encoding () -- form sending Encoding

Server Side: IIS unlocks the form encoding-reads the code specified by codePage-converts the code to the corresponding Unicode-reads the code using request ("")-performs some processing-Saves the code as Unicode to the database.

Server: reads Unicode data from the database and converts it to the specified codePage encoding. --- generateSource code-- Ie reads and displays data according to charset.

The following is an example:
Example 1:
Suppose there are three ASP pages, a typical message page:
1. Write. asp: Submit the simple input form to add. asp.
<Meta http-equiv = "Content-Type" content = "text/html; charset = big5">
2. Add. asp receives messages and saves them to the database.
<% @ CodePage = 936%>
3. Read. asp gets a message from the database and displays it.
<% @ CodePage = 936%> charset = gb2312 or
<% @ CodePage = 950%> charset = big5

You can guess that I used the Microsoft Pinyin Input Method in write. asp to input "Six discussions ". What will be displayed in read. asp?
Is it dizzy. Let's analyze it from scratch.

Example 2:
Change <% @ codePage = 936%> of add. asp in Example 1 to <% @ codePage = 950%>. What will happen?

What have you found here?
1. If the input text is different from the character set, a conversion may result in a word in the unicode format. Here is the reason. The entire process will be retained later.
2. In Add. asp, codePage determines the text saved to the database and the Unicode corresponding to the language used. For example, codePage = 936,
Then the database stores Unicode in simplified Chinese (the database takes back the Simplified Chinese system, everything is normal ),
CodePage = 950 stores Unicode in Traditional Chinese. (It is incorrect to retrieve the Simplified Chinese system ).

3. Pay attention to the variation process of the string:

1) input method --- charsetunicode ---- ing of the specified Character Set
2) charset ---- form encoding string simple Encoding
3) the inverse process of form decoding is offset by two steps.
4) The string reading by codePage does not change. This step may cause "incorrect reading"
5) convert to the corresponding Unicode codepage specified character set ---- Unicode ing
6) intermediate processing, no change to the database, and direct access in Unicode form
7) read database Unicode by codePage ---- ing of the character set specified by codePage
8) it is displayed that the string read by character set specified by charset remains unchanged.

Example 1:

Example 2:

Dizzy. Use knowledge now.

Case 1.
The Simplified Chinese system runs well.CodeAnd put it in a foreign space. The database is garbled and the original data is garbled.
Analysis: Most people usually use the simplified Chinese system, and the default codePage is 936, so it doesn't matter if you do not write it at ordinary times.
However, the problem of space outside the country has come out. It is converted from Unicode in the database to English encoding. Therefore, after the original simplified Chinese characters in the database are converted to English, garbled characters are displayed in GB.
The newly entered text is displayed normally, but the database stores Unicode in English.
Solution: Add <% @ codepage = 936 to all.
Only the conversion between Simplified Chinese and Unicode is supported throughout the process.

Case 2:
What should I do if I want to convert Simplified Chinese code and data into a full traditional version?
Analysis: 1. The code file encoding is changed to big5, and the file storage encoding is set to traditional Chinese.
2. <% @ codePage = 936%>
3. charset = big5
4. The access version does not matter because the data in access is Unicode.
5. Now, the code can be run in the traditional Chinese system.
6. Legacy problems: There are some question marks when reading Simplified Chinese data. The result is read by 950 of the same example 1, and big5 is displayed. Because Unicode is converted from simplified Chinese to traditional Chinese, some traditional Chinese characters do not exist, there will be question marks.
7. solution: Use a temporary ASP page, codePage = 65001, read Unicode in simplified Chinese, use a Unicode-> big5 function, convert it to traditional Chinese, and then write it back to the database, should that be done?

I have deduced the two cases based on the theory and they are unconfirmed.
Comments with similar experiences are welcomed.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.