Asp.net Chinese Encoding

Source: Internet
Author: User

I used Asp.net to process URL parameters a few days ago. It was always garbled and depressing. I found a solution on the Internet. After reading this article, I copied it to my favorites.

 

 

ASP. NET Encoding
Garbled-pain in our hearts!
• "Character and encoding" is a frequently discussed topic. Even so, garbled characters often plague everyone. Although we have many ways to eliminate garbled characters, we do not necessarily understand the internal principles of these methods. Some causes of garbled code are actually caused by problems with the underlying code itself.

• Therefore, not only do beginners feel fuzzy about character encoding, but some bottom-layer developers also lack an accurate understanding of character encoding.

Development of character and encoding

How to store strings in memory
• In the ASCII stage, a single-byte string uses one byte to store one character (sbcs ). For example, "bob123" is in the memory:

-0000f 62 31 32 33 00

-B o B 1 2 3/0

• ANSI encoding supports multiple language stages. Each character is represented by one or more bytes. For example, "Chinese 123" is represented by seven bytes in the Chinese Windows95 memory:

-D6d0cec4 31 32 33 00

-Chinese 1 2 3/0

• In Unicode, when a computer stores strings, it stores the sequence numbers of each character in the Unicode Character Set. Currently, a computer generally uses two bytes (16 bits) to store a sequence number (DBCS). Therefore, the characters stored in this way are also called wide byte characters. For example, if the string "Chinese 123" is stored in Windows 2000, the memory actually stores 5 serial numbers in a total of 10 bytes:

-2d4e8765 3100 3200 3300 0000

-Chinese 1 2 3/0

[Sample Code]

String STR = "Chinese 123"; // unicode encoding

Console. writeline (Str. Length); // The length is 5

// String and byte String Conversion

Encoding GB = encoding. getencoding ("gb2312 ");

Byte [] bytes = GB. getbytes (STR); // obtain the number of characters from the byte according to gb2312

Char [] chars = GB. getchars (bytes); // obtain the Unicode string from the byte according to gb2312

STR = GB. getstring (bytes );

// There are two methods to write a string into a text file according to certain Encoding

// Type 1: Use the stream class to write a byte string that has been converted according to the specified Encoding

Streamwriter Sw = new streamwriter ("1.txt"); // 9 Bytes:

Sw. Write (chars );

Sw. Close ();

// Method 2: Construct the specified encoded writer to write the string.

Streamwriter sw2 = new streamwriter ("2.txt", false, GB); // 7 bytes

Sw2.write (STR );

Sw2.close ();

Unicode Character Set
• Name Origin

-UNICODE character set encoding is the abbreviation of the Universal multi-eight-bit encoding Character Set of the universal multiple-octet coded characterset. It sets a unified and unique binary encoding for each character in each language, to meet the requirements of cross-language and cross-platform text conversion and processing.

• UTF-8 Coding

-UTF-8 is one of Unicode usage methods. UTF is unicodetranslation format, which means converting Unicode to a certain format. The UTF-8 uses variable-length bytes to store Unicode characters, for example, ASCII letters continue to use 1 byte storage, accent text, Greek letters, or Spanish letters, etc. are stored in 2 bytes, commonly used Chinese characters use 3 bytes. The secondary flat character is 4 bytes.

• UTF-32, UTF-16 and UTF-8:

-UNICODE standard encoding character set encoding scheme, the UTF-16 uses a sequence of one or two unassigned 16-bit code units to encode Unicode code points; UTF-32 represents each Unicode Code Point as a 32-bit integer of the same value

Prone to garbled characters
-Garbled characters when non-Unicode programs are transplanted between different language Environments

• Strings in non-Unicode programs exist in some ANSI encoding format. If the language environment for running the program is different from that for development, the display of ANSI strings fails.

• For objective reasons, sometimes we have to run non-Unicode Japanese software in the Chinese operating system. In this case, we can use some tools, such as Antarctic star and applocale, simulate different language Environments temporarily.

-Web Page Submission string

• When the form on the page submits a string, the string is first converted into a byte string according to the encoding of the current page. Then, convert each byte into the "% xx" format and submit it to the web server. For example, for a page encoded as gb2312, when "medium" is submitted, the content submitted to the server is "% D6 % D0 ".

• On the server side, the web server converts the received "% D6 % D0" into two bytes: [0xd6, 0xd0], and then obtains the "medium" word according to the gb2312 encoding rules.

• By default, when "% D6 % D0" is submitted to the server, two Unicode characters [0x00d6, 0x00d0] are returned instead of a "medium" character. So garbled

-Read strings from the database

When reading strings from the database server through a database client (such as ODBC or JDBC), the client needs to learn the ANSI encoding used from the server. When the database server sends a byte stream to the client, the client is responsible for converting the byte stream into a unicode string according to the correct encoding.

-Strings in the email

• When a text or HTML segment is transmitted via email, the sent content is first converted to a "Byte string" by a specified character encoding ", then, the "Byte string" is converted to another byte string through a specified content-transfer-encoding ".

• The most common content-transfer-encoding methods include base64 and quoted-printable.

Email title code
• The Mail title, in a shorter format to mark "character encoding" and "Transfer Encoding ". For example, if the title content is "medium", it is expressed:

• // Correct title format

Subject: =? Gb2312? B? 1ta =? =

-The first "= ?" And "?" The middle part specifies the character encoding, Which is gb2312 in this example.

-"?" And "?" The "B" in the middle represents base64. If it is "Q", it indicates quoted-printable.

-Last "?" And "? "=" Is the header content after gb2312 is converted into a byte string and base64 is converted.

• If "Transfer Encoding" is changed to quoted-printable, similarly, if the title content is "medium ":

-Subject: =? Gb2312? Q? = D6 = D0? =

 

[Sample Code]

// Garbled code generation

Encoding client = encoding. getencoding ("gb2312 ");

Byte [] bytes = client. getbytes (STR); // obtain the byte according to gb2312

// The server decodes the data according to gb2312

Encoding serverok = encoding. getencoding ("gb2312 ");

String Server = serverok. getstring (bytes );

Console. writeline ("Correct Conversion:" + server );

// The server decodes Unicode errors.

Server = encoding. Unicode. getstring (bytes );

Console. writeline ("error conversion:" + server); // garbled

 

 

Encoding class under. net
System. Text
• System. the text namespace contains classes that represent ASCII, Unicode, UTF-7, and UTF-8 character encoding; an abstract base class used to convert character blocks into bytes and to convert bytes into character blocks.

 

[Sample Code]

// Transcoding between codes

String unicodestr = "this is a room ";

// Create different codes

Encoding ASCII = encoding. ASCII;

Encoding Unicode = encoding. Unicode;

// Convert from one encoding to another Encoding

Byte [] unicodebytes = Unicode. getbytes (unicodestr );

Byte [] asciibytes = encoding. Convert (UNICODE, ASCII, unicodestr );

// Convert to a string

Char [] asciichars = new char [ASCII. getcharcount (asciibytes, 0, asciibytes. Length)];

ASCII. getchars (asciibytes, 0, asciibytes. length, asciichars, 0 );

String asciistr = new string (asciichars );

 

Console. writeline ("Original: {0}", unicodestr );

Console. writeline ("ASCII coverted string: {0}", asciistr );

Encoding in ASP. NET
• Globalization of Web. config files:

-<Globalization requestencoding = "UTF-8" responseencoding = "UTF-8"/>

-Requestencoding is the app Request Encoding, and responseencoding is the IIS response encoding.

The following two methods only change the current page, while the above changes the entire application

• Response. contentencoding: gets or sets the HTTP Character Set of the output stream.

• Httprequest. contentencoding: indicates the encoding object of the character set of the client.

For example, if we set different sequence sets during the transmission process between different pages, garbled characters may occur.

Chinese character encoding
• In 1980, in order to make every Chinese character have a unified national code, China issued the first Chinese character encoding National Standard: GB2312-80 "information exchange with Chinese character encoding Character Set" basic set, gb2312. this character set is the unified standard for all Chinese character systems in China. Later published the National Standard GB18030-2000 "information exchange with Chinese character encoding character set basic set of expansion", referred to as gb18030

• In Chinese Windows operating systems, the default code page in. NET programming is gb18030 Simplified Chinese

• Chinese characters can be expressed using a location code:

-For example, the hexadecimal code of "good" is BA C3, the first two are regions, the last two represent locations, and the BA is in Zone 26th, "Good" refers to the 35th Chinese characters in this area, that is, the C3 position, so the digital code is 2635. This is the location principle of gb2312 Chinese characters.

 

[Sample code: A Chinese verification code is generated]

// Obtain the gb2312 encoding table

Encoding GB = encoding. getencoding ("gb2312 ");

// Call the function to generate four random Chinese character codes

Object [] bytes = createregion (4 );

// Decodes Chinese Characters Based on the byte array encoded by Chinese Characters

String str1 = GB. getstring (byte []) convert. changetype (Bytes [0], typeof (byte []);

String str2 = GB. getstring (byte []) convert. changetype (Bytes [1], typeof (byte []);

String str3 = GB. getstring (byte []) convert. changetype (Bytes [2], typeof (byte []);

String str4 = GB. getstring (byte []) convert. changetype (Bytes [3], typeof (byte []);

 

String strkey = str1 + str2 + str3 + str4;

Session ["regcode"] = strkey;

Byte [] DATA = GB. getbytes (strkey );

Response. contentencoding = GB;

Response. outputstream. Writer (data, 0, Data. Length );

 

 

Public static object [] createregioncode (INT length)

{

// Define a string that stores the elements of the Chinese character encoding.

String [] rbase = new string [16] {"0", "1", "2", "3", "4", "5", "6 ", "7", "8", "9", "A", "B", "C", "D", "E", "F "};

 

Random RND = new random ();

 

Object [] bytes = new object [length];

// Generate a hexadecimal byte array containing two elements at a time in each loop, and put it into the object array.

// The Chinese character is composed of four location codes

// The first and second digits of the location code serve as the first element of the byte array.

// The third and fourth digits of the location code are the second elements of the byte array.

For (INT I = 0; I <length; I ++)

{

// The first location code

Int R1 = RND. Next (11, 14 );

String str_r1 = rbase [R1]. Trim ();

// Second place of the Location Code

// Replace the seed of the random number generator to avoid repeated values

RND = new random (R1 * unchecked (INT) datetime. Now. ticks) + I );

Int R2;

 

If (r1 = 13)

{

R2 = RND. Next (0, 7 );

}

Else

{

R2 = RND. Next (0, 16 );

}

 

String str_r2 = rbase [R2]. Trim ();

 

// The Third Place of the Location Code

RND = new random (R2 * unchecked (INT) datetime. Now. ticks) + I );

Int R3 = RND. Next (10, 16 );

String str_r3 = rbase [R3]. Trim ();

 

// The fourth place of the Location Code

RND = new random (R3 * unchecked (INT) datetime. Now. ticks) + I );

Int R4;

If (R3 = 10)

{

R4 = RND. Next (1, 16 );

}

Else if (R3 = 15)

{

R4 = RND. Next (0, 15 );

}

Else

{

R4 = RND. Next (0, 16 );

}

 

String str_r4 = rbase [R4]. Trim ();

 

// Define the random Chinese character location code generated by storing two byte Variables

Byte byte1 = convert. tobyte (str_r1 + str_r2, 16 );

Byte byte2 = convert. tobyte (str_r3 + str_r4, 16 );

// Store two byte variables in the byte array

Byte [] str_r = new byte [] {byte1, byte2 };

// Put the byte array of the generated Chinese characters into the object Array

Bytes. setvalue (str_r, I );

}

 

Return bytes;

}

Base64 encoding rules
• Base64 encoding uses 64 basic ASCII characters to recode the data. It splits the data to be encoded into byte arrays. Take 3 bytes as a group. Sort the 24-bit data in order and divide the 24-bit data into four groups, that is, 6-bit data in each group. Add two zeros before the highest bits in each group to make up one byte. In this way, a 3-byte data is reencoded into 4 bytes. When the number of bytes of the data to be encoded is not an integer multiple of 3, that is to say, the last group is not three bytes long. At this time, fill in 1 to 2 0 bytes in the last group. Add 1 to 2 "=" at the end after the final encoding ".

 

• Example: base64 encoding for ABC

First, take the ASCII code value corresponding to ABC. A (65) B (66) C (67 ).

Take the binary value A (01000001) B (01000010) C (01000011), and then connect the three bytes of binary code (010000010100001001000011 ), then, the data block is divided into 4 data blocks in 6 bits, and the encoded value (00010000) (00010100) (00001001) is formed after the maximum bit is filled with two zeros ). Then, convert the four bytes into a decimal number (16) (20) (19) (3 ). Finally, we can find out the corresponding ASCII code character (q) (u) (j) (d) Based on the 64 basic sequence tables given by base64 ). The value here is actually the index of the data in the orders table.

• Note base64 sequence table:

Abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz0123456789 +/

 

[The sample code uses base64 encoding to store images in XML]

// Convert the image into a string

// Image files are temporarily stored in byte Arrays

Byte [] filebytearray = new byte [filelength];

Stream so = lofile. postfile. inputstream;

So. Read (filebytearray, 0, filelength );

 

String IMG = convert. tobase64string (filebytearray );

 

 

// Convert a string into an image

Response. contenttype = "IMG"; // set the output file type

Response. outputstream. Write (convert. frombase64string (strdata), 0, nsize );

Response. End ();

// It can also be saved as an image

Filestream FS = new filestream (@ "C:/1.bmp", filemode. openorcreate, fileaccess. Write );

FS. Write (convert. frombase64string (strdata), 0, nsize );

FS. Close ();

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.