I used Asp.net to process URL parameters a few days ago. It was always garbled and depressing. I found a solution on the Internet. After reading this article, I copied it to my favorites.
ASP. NET Encoding
Garbled-pain in our hearts!
• "Character and encoding" is a frequently discussed topic. Even so, garbled characters often plague everyone. Although we have many ways to eliminate garbled characters, we do not necessarily understand the internal principles of these methods. Some causes of garbled code are actually caused by problems with the underlying code itself.
• Therefore, not only do beginners feel fuzzy about character encoding, but some bottom-layer developers also lack an accurate understanding of character encoding.
Development of character and encoding
How to store strings in memory
• In the ASCII stage, a single-byte string uses one byte to store one character (sbcs ). For example, "bob123" is in the memory:
-0000f 62 31 32 33 00
-B o B 1 2 3/0
• ANSI encoding supports multiple language stages. Each character is represented by one or more bytes. For example, "Chinese 123" is represented by seven bytes in the Chinese Windows95 memory:
-D6d0cec4 31 32 33 00
-Chinese 1 2 3/0
• In Unicode, when a computer stores strings, it stores the sequence numbers of each character in the Unicode Character Set. Currently, a computer generally uses two bytes (16 bits) to store a sequence number (DBCS). Therefore, the characters stored in this way are also called wide byte characters. For example, if the string "Chinese 123" is stored in Windows 2000, the memory actually stores 5 serial numbers in a total of 10 bytes:
-2d4e8765 3100 3200 3300 0000
-Chinese 1 2 3/0
[Sample Code]
String STR = "Chinese 123"; // unicode encoding
Console. writeline (Str. Length); // The length is 5
// String and byte String Conversion
Encoding GB = encoding. getencoding ("gb2312 ");
Byte [] bytes = GB. getbytes (STR); // obtain the number of characters from the byte according to gb2312
Char [] chars = GB. getchars (bytes); // obtain the Unicode string from the byte according to gb2312
STR = GB. getstring (bytes );
// There are two methods to write a string into a text file according to certain Encoding
// Type 1: Use the stream class to write a byte string that has been converted according to the specified Encoding
Streamwriter Sw = new streamwriter ("1.txt"); // 9 Bytes:
Sw. Write (chars );
Sw. Close ();
// Method 2: Construct the specified encoded writer to write the string.
Streamwriter sw2 = new streamwriter ("2.txt", false, GB); // 7 bytes
Sw2.write (STR );
Sw2.close ();
Unicode Character Set
• Name Origin
-UNICODE character set encoding is the abbreviation of the Universal multi-eight-bit encoding Character Set of the universal multiple-octet coded characterset. It sets a unified and unique binary encoding for each character in each language, to meet the requirements of cross-language and cross-platform text conversion and processing.
• UTF-8 Coding
-UTF-8 is one of Unicode usage methods. UTF is unicodetranslation format, which means converting Unicode to a certain format. The UTF-8 uses variable-length bytes to store Unicode characters, for example, ASCII letters continue to use 1 byte storage, accent text, Greek letters, or Spanish letters, etc. are stored in 2 bytes, commonly used Chinese characters use 3 bytes. The secondary flat character is 4 bytes.
• UTF-32, UTF-16 and UTF-8:
-UNICODE standard encoding character set encoding scheme, the UTF-16 uses a sequence of one or two unassigned 16-bit code units to encode Unicode code points; UTF-32 represents each Unicode Code Point as a 32-bit integer of the same value
Prone to garbled characters
-Garbled characters when non-Unicode programs are transplanted between different language Environments
• Strings in non-Unicode programs exist in some ANSI encoding format. If the language environment for running the program is different from that for development, the display of ANSI strings fails.
• For objective reasons, sometimes we have to run non-Unicode Japanese software in the Chinese operating system. In this case, we can use some tools, such as Antarctic star and applocale, simulate different language Environments temporarily.
-Web Page Submission string
• When the form on the page submits a string, the string is first converted into a byte string according to the encoding of the current page. Then, convert each byte into the "% xx" format and submit it to the web server. For example, for a page encoded as gb2312, when "medium" is submitted, the content submitted to the server is "% D6 % D0 ".
• On the server side, the web server converts the received "% D6 % D0" into two bytes: [0xd6, 0xd0], and then obtains the "medium" word according to the gb2312 encoding rules.
• By default, when "% D6 % D0" is submitted to the server, two Unicode characters [0x00d6, 0x00d0] are returned instead of a "medium" character. So garbled
-Read strings from the database
When reading strings from the database server through a database client (such as ODBC or JDBC), the client needs to learn the ANSI encoding used from the server. When the database server sends a byte stream to the client, the client is responsible for converting the byte stream into a unicode string according to the correct encoding.
-Strings in the email
• When a text or HTML segment is transmitted via email, the sent content is first converted to a "Byte string" by a specified character encoding ", then, the "Byte string" is converted to another byte string through a specified content-transfer-encoding ".
• The most common content-transfer-encoding methods include base64 and quoted-printable.
Email title code
• The Mail title, in a shorter format to mark "character encoding" and "Transfer Encoding ". For example, if the title content is "medium", it is expressed:
• // Correct title format
Subject: =? Gb2312? B? 1ta =? =
-The first "= ?" And "?" The middle part specifies the character encoding, Which is gb2312 in this example.
-"?" And "?" The "B" in the middle represents base64. If it is "Q", it indicates quoted-printable.
-Last "?" And "? "=" Is the header content after gb2312 is converted into a byte string and base64 is converted.
• If "Transfer Encoding" is changed to quoted-printable, similarly, if the title content is "medium ":
-Subject: =? Gb2312? Q? = D6 = D0? =
[Sample Code]
// Garbled code generation
Encoding client = encoding. getencoding ("gb2312 ");
Byte [] bytes = client. getbytes (STR); // obtain the byte according to gb2312
// The server decodes the data according to gb2312
Encoding serverok = encoding. getencoding ("gb2312 ");
String Server = serverok. getstring (bytes );
Console. writeline ("Correct Conversion:" + server );
// The server decodes Unicode errors.
Server = encoding. Unicode. getstring (bytes );
Console. writeline ("error conversion:" + server); // garbled
Encoding class under. net
System. Text
• System. the text namespace contains classes that represent ASCII, Unicode, UTF-7, and UTF-8 character encoding; an abstract base class used to convert character blocks into bytes and to convert bytes into character blocks.
[Sample Code]
// Transcoding between codes
String unicodestr = "this is a room ";
// Create different codes
Encoding ASCII = encoding. ASCII;
Encoding Unicode = encoding. Unicode;
// Convert from one encoding to another Encoding
Byte [] unicodebytes = Unicode. getbytes (unicodestr );
Byte [] asciibytes = encoding. Convert (UNICODE, ASCII, unicodestr );
// Convert to a string
Char [] asciichars = new char [ASCII. getcharcount (asciibytes, 0, asciibytes. Length)];
ASCII. getchars (asciibytes, 0, asciibytes. length, asciichars, 0 );
String asciistr = new string (asciichars );
Console. writeline ("Original: {0}", unicodestr );
Console. writeline ("ASCII coverted string: {0}", asciistr );
Encoding in ASP. NET
• Globalization of Web. config files:
-<Globalization requestencoding = "UTF-8" responseencoding = "UTF-8"/>
-Requestencoding is the app Request Encoding, and responseencoding is the IIS response encoding.
The following two methods only change the current page, while the above changes the entire application
• Response. contentencoding: gets or sets the HTTP Character Set of the output stream.
• Httprequest. contentencoding: indicates the encoding object of the character set of the client.
For example, if we set different sequence sets during the transmission process between different pages, garbled characters may occur.
Chinese character encoding
• In 1980, in order to make every Chinese character have a unified national code, China issued the first Chinese character encoding National Standard: GB2312-80 "information exchange with Chinese character encoding Character Set" basic set, gb2312. this character set is the unified standard for all Chinese character systems in China. Later published the National Standard GB18030-2000 "information exchange with Chinese character encoding character set basic set of expansion", referred to as gb18030
• In Chinese Windows operating systems, the default code page in. NET programming is gb18030 Simplified Chinese
• Chinese characters can be expressed using a location code:
-For example, the hexadecimal code of "good" is BA C3, the first two are regions, the last two represent locations, and the BA is in Zone 26th, "Good" refers to the 35th Chinese characters in this area, that is, the C3 position, so the digital code is 2635. This is the location principle of gb2312 Chinese characters.
[Sample code: A Chinese verification code is generated]
// Obtain the gb2312 encoding table
Encoding GB = encoding. getencoding ("gb2312 ");
// Call the function to generate four random Chinese character codes
Object [] bytes = createregion (4 );
// Decodes Chinese Characters Based on the byte array encoded by Chinese Characters
String str1 = GB. getstring (byte []) convert. changetype (Bytes [0], typeof (byte []);
String str2 = GB. getstring (byte []) convert. changetype (Bytes [1], typeof (byte []);
String str3 = GB. getstring (byte []) convert. changetype (Bytes [2], typeof (byte []);
String str4 = GB. getstring (byte []) convert. changetype (Bytes [3], typeof (byte []);
String strkey = str1 + str2 + str3 + str4;
Session ["regcode"] = strkey;
Byte [] DATA = GB. getbytes (strkey );
Response. contentencoding = GB;
Response. outputstream. Writer (data, 0, Data. Length );
Public static object [] createregioncode (INT length)
{
// Define a string that stores the elements of the Chinese character encoding.
String [] rbase = new string [16] {"0", "1", "2", "3", "4", "5", "6 ", "7", "8", "9", "A", "B", "C", "D", "E", "F "};
Random RND = new random ();
Object [] bytes = new object [length];
// Generate a hexadecimal byte array containing two elements at a time in each loop, and put it into the object array.
// The Chinese character is composed of four location codes
// The first and second digits of the location code serve as the first element of the byte array.
// The third and fourth digits of the location code are the second elements of the byte array.
For (INT I = 0; I <length; I ++)
{
// The first location code
Int R1 = RND. Next (11, 14 );
String str_r1 = rbase [R1]. Trim ();
// Second place of the Location Code
// Replace the seed of the random number generator to avoid repeated values
RND = new random (R1 * unchecked (INT) datetime. Now. ticks) + I );
Int R2;
If (r1 = 13)
{
R2 = RND. Next (0, 7 );
}
Else
{
R2 = RND. Next (0, 16 );
}
String str_r2 = rbase [R2]. Trim ();
// The Third Place of the Location Code
RND = new random (R2 * unchecked (INT) datetime. Now. ticks) + I );
Int R3 = RND. Next (10, 16 );
String str_r3 = rbase [R3]. Trim ();
// The fourth place of the Location Code
RND = new random (R3 * unchecked (INT) datetime. Now. ticks) + I );
Int R4;
If (R3 = 10)
{
R4 = RND. Next (1, 16 );
}
Else if (R3 = 15)
{
R4 = RND. Next (0, 15 );
}
Else
{
R4 = RND. Next (0, 16 );
}
String str_r4 = rbase [R4]. Trim ();
// Define the random Chinese character location code generated by storing two byte Variables
Byte byte1 = convert. tobyte (str_r1 + str_r2, 16 );
Byte byte2 = convert. tobyte (str_r3 + str_r4, 16 );
// Store two byte variables in the byte array
Byte [] str_r = new byte [] {byte1, byte2 };
// Put the byte array of the generated Chinese characters into the object Array
Bytes. setvalue (str_r, I );
}
Return bytes;
}
Base64 encoding rules
• Base64 encoding uses 64 basic ASCII characters to recode the data. It splits the data to be encoded into byte arrays. Take 3 bytes as a group. Sort the 24-bit data in order and divide the 24-bit data into four groups, that is, 6-bit data in each group. Add two zeros before the highest bits in each group to make up one byte. In this way, a 3-byte data is reencoded into 4 bytes. When the number of bytes of the data to be encoded is not an integer multiple of 3, that is to say, the last group is not three bytes long. At this time, fill in 1 to 2 0 bytes in the last group. Add 1 to 2 "=" at the end after the final encoding ".
• Example: base64 encoding for ABC
First, take the ASCII code value corresponding to ABC. A (65) B (66) C (67 ).
Take the binary value A (01000001) B (01000010) C (01000011), and then connect the three bytes of binary code (010000010100001001000011 ), then, the data block is divided into 4 data blocks in 6 bits, and the encoded value (00010000) (00010100) (00001001) is formed after the maximum bit is filled with two zeros ). Then, convert the four bytes into a decimal number (16) (20) (19) (3 ). Finally, we can find out the corresponding ASCII code character (q) (u) (j) (d) Based on the 64 basic sequence tables given by base64 ). The value here is actually the index of the data in the orders table.
• Note base64 sequence table:
Abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz0123456789 +/
[The sample code uses base64 encoding to store images in XML]
// Convert the image into a string
// Image files are temporarily stored in byte Arrays
Byte [] filebytearray = new byte [filelength];
Stream so = lofile. postfile. inputstream;
So. Read (filebytearray, 0, filelength );
String IMG = convert. tobase64string (filebytearray );
// Convert a string into an image
Response. contenttype = "IMG"; // set the output file type
Response. outputstream. Write (convert. frombase64string (strdata), 0, nsize );
Response. End ();
// It can also be saved as an image
Filestream FS = new filestream (@ "C:/1.bmp", filemode. openorcreate, fileaccess. Write );
FS. Write (convert. frombase64string (strdata), 0, nsize );
FS. Close ();