//
A programming question about Chinese character encoding
/* I saw this question online not long ago. I changed the program and made a brief analysis of the cause, the main achievement is to have some knowledge about gb2312 encoding and encoding problems. Please correct the error. */
/* Programming: compile a function to intercept strings. The input is a string and number of segments,
The output is a byte string. However, make sure that no half of Chinese characters are intercepted,
For example, "my ABC" 4 should be cut into "My AB", input "my ABC Han Def", 6, the output should be "I abc" instead of "I ABC + Han half ".
*/
Class splitstring
{
Private string STR;
Private int bytenum;
Public splitstring (){}
Public splitstring (string STR, int bytenum)
{
This. Str = STR;
This. bytenum = bytenum;
}
Public void splitit ()
{
Byte BT [] = Str. getbytes ();
System. Out. println ("length of this string =>" + bt. Length );
If (bytenum> 1)
{
If (BT [bytenum] <0)
{
Pbinint ("BT [bytenum]", BT [bytenum]);
System. Out. println ("BT [" + bytenum + "] =" + bt [bytenum]); // 1
System. Out. println ("BT [" + bytenum + "] =" + (INT) BT [bytenum]); // 2
System. Out. println ("BT [" + bytenum + "] =" + (BT [bytenum] & 0x000000ff); // 3
System. Out. println ("BT [" + bytenum + "] =" + (BT [bytenum + 1] & 0xff); // 4
String substrx = new string (BT, 0, -- bytenum );
System. Out. println (substrx );
}
Else
{
String substrex = new string (BT, 0, bytenum );
System. Out. println (substrex );
}
}
Else
{
If (bytenum = 1)
{
If (BT [bytenum] <0)
{
String substr1 = new string (BT, 0, ++ bytenum );
System. Out. println (substr1 );
}
Else
{
String substr2 = new string (BT, 0, bytenum );
System. Out. println (substr2 );
}
}
Else
{
System. Out. println ("input error !!! Enter an integer greater than zero :");
}
}
}
Static void pbinint (string S, int I ){
System. Out. println (
S + ", INT:" + I + ", binary :");
System. Out. Print ("");
For (Int J = 31; j> = 0; j --)
If (1 <j) & I )! = 0)
System. Out. Print ("1 ");
Else
System. Out. Print ("0 ");
System. Out. println ();
}
}
Class testsplitstring
{
Public static void main (string ARGs [])
{
String STR = "My abce Ah Defe ";
Int num = 6;
Splitstring sptstr = new splitstring (STR, num );
Sptstr. splitit ();
}
}
/*General ideaYes: each Chinese Character corresponds to two bytes, and the first byte is 1, which corresponds to a negative number.
Of course, this is actually not very rigorous. The specific reason is not discussed in detail here.
*/
/*
Result Analysis:
'Ah' corresponds to a region code of 16 bits and a location code of 1, which is calculated as follows:
The output at 3 points is 176, and the corresponding binary 0x000000b0 in Java
The last byte is useful, that is, the region code 10110000 corresponds to is 176-128-32 = 16,
Minus 128 is because the first byte is always 1, and the last seven digits (that is, the seventh power of 2) are useful.
Minus 32 because gb2312 stipulates
Similarly, the output at 4 points is 161, corresponding to the location number, 161-128-32 = 1
The values at 1 and 2 are-80 (bytes are converted to integers)
Because the integer 10110000 is changed to 0xffffffb0, It is a negative complement, and the corresponding source code is 0x80000050. This negative number is-80.
It is worth mentioning that in order to calculate the integer representing 'Ah', (16 + 128 + 32) * 256 + (1 + 128 + 32), the char function is used in Excel:
For example, char (16 + 128 + 32) * 256 + (1 + 128 + 32) can get 'Ah'
*/
/*
References:
Chinese character encoding and representation
1) Chinese Character Exchange Code (Chinese character code) Chinese Character Exchange Code (Chinese Character Code) is mainly used for Chinese character information exchange.
National Standard Code: The Chinese Character Exchange Code stipulated in the basic set of "Chinese character encoding character set for Information Exchange" (codenamed gb2312 80) promulgated by the National Bureau of Standards in 1980 is used as the national standard Chinese character encoding. There are a total of 7445 characters in gb2312 80: 6763 Chinese characters, 3755 first-level Chinese characters (sorted by Chinese pinyin letters), 3008 second-level Chinese characters (sorted by the beginning and strokes) according to the 682 gb2312 80 non-Chinese characters, all Chinese characters and symbols of the National Standard Code form a 94 square matrix. In this square matrix, each row is called a "area", and each column is called a "bit ". This square matrix is actually composed of 94 areas (numbered from 01 to 94), each of which has 94 characters (numbered from 01 to 94. The combination of a Chinese character's region code and Location Code constitutes the "location code" of the Chinese character ". The upper two digits are the area code, and the lower two digits are the location code. In this way, the location code can uniquely identify a Chinese character or character. On the contrary, any Chinese character or symbol corresponds to a unique location code without duplicate codes.
The distribution of location codes is as follows:
Zone No. contains all types of symbols not available on the keyboard of zone 1, zone 2, and various Serial No. 3 (given in Chinese) 4-5 areas: Japanese letters 6 areas: Greek letters 7 areas: Russian letters 8 areas: the mother tone of the pinyin tone and the name of the alphabet 9 area tabulation symbols 10-15 areas not used 16-55 areas Primary Chinese characters (pinyin alphabetic order) second-level Chinese Characters in Area 56-87 (arranged in the order of radicals and strokes) custom Chinese Characters in Area 88-94
We can see that 94 Chinese characters and symbols can be divided into four groups:
Area 1-15: graphic symbol area. Area 1 9 is the standard symbol area, and area 10 15 is the custom symbol area.
② 16-55: The primary Chinese Character area, which contains 3755 Chinese characters. The Chinese Characters in these areas are sorted by Chinese pinyin, And the homophone words are listed by strokes.
Area 56-87: the second-level Chinese Character area, which contains 3008 Chinese characters. The Chinese Characters in these areas are sorted in the order of the radicals and strokes.
④ Area 88-94: the custom Chinese Character area.
The Country Code specifies that each Chinese character (including non-Chinese characters) is represented by two-byte code. Each byte has a maximum of 0 and only uses a minimum of 7 bits, while 34 low 7 bits are suitable for control, in this way, only 27-34 = 94 characters are encoded for Chinese characters. Two bytes are 94 = 8836 Chinese characters encoded. In the two bytes that indicate a Chinese character, the high byte corresponds to the row number in the encoding table, which is called the area code. The low byte corresponds to the column number in the encoding table, which is called the bit number.
The range of Chinese character country code is expressed in binary: 00100001 00100001 01111110 01111110 (1 + 32) 10 (1 + 32) 10 (94 + 32) 10 (94 + 32) the 10-7 ASCII code is a 128-character set. The encoding value 0 31 (00000000 00011111) does not correspond to any printed characters. It is usually called a control character. It is used for communication control in computer communication or functional control on computer devices. The encoding value 32 (00100000) is a space character sp. The encoding value 127 (1111111) is the delete character Del.
Select 00100001 (33) 10 as the starting binary position of the Chinese character country code to skip the 32 control characters and space characters of the ASCII code. Therefore, the high and low positions of Chinese Character Mark codes are greater than the corresponding location codes (32) 10 or (00100000) 2 or (20) h, namely: country code high position = area code + 20 h (h indicates hexadecimal) Country Code low position = location code + 20 h
2) Chinese Character machine internal code (Internal Code) (Chinese character storage code)
The Chinese Character machine internal code (inner Code) is used to unify the representation of different Chinese character input codes in the computer. In order to unify the various input codes of Chinese characters in the computer, there is a Chinese character machine Internal Code dedicated to storing Chinese characters in the computer, it is used to convert multiple types of Chinese character input codes used during input into Chinese Character machine internal codes for storage, so that the Chinese character input codes in the machine can be stored and processed in the computer. Computers must process both Chinese characters and English letters. Therefore, the computer must be able to distinguish Chinese and English characters. The inner code of an English character is an 8-bit ASCII code with a maximum value of 0. In order not to conflict with the 7-bit ASCII code, the maximum bit of each byte of the Country Code is changed from 0 to 1, and the rest of the encoding remains unchanged as the internal code of Chinese characters.
The range of the inner code of the Chinese character machine is expressed in binary: 10100001 10100001 11111110 11111110 the inner code's high and low is greater than the corresponding national standard code's high and low (128) 10 or (10000000) 2 or (80) h: Internal code high = National Standard Code high + 80 h internal code low = National Standard Code low + 80 h and because: country code High Level = area code + 20 h Country Code low level = location code + 20 h so: the machine internal code High Level = area code + a0h machine internal code low level = location code + a0h that is to say, the intra-host code and intra-host code are higher than the corresponding intra-host code and bit Code respectively (160) 10 or (10100000) 2 or (A0) h, for example: the location code of the Chinese character "ah" is "1601". The location code is (16) 10 or (10) h, and the location code is (01) 10 or (01) H. Server internal code high = 10 h + a0h = b0h server internal code low = 01 H + a0h = a1h so: Server internal code = b0a1h
3) Chinese character input code (external code)
The Chinese character input code (external code) is a code designed to input Chinese characters into a computer using keyboard characters. When you enter English letters, press the key for the characters you enter. The entered code is the same as the internal code. When entering a Chinese character, you may have to press a few keys to enter a Chinese character. There are hundreds of Chinese Character Input schemes, but these extremely different external codes are converted into uniform internal codes after being entered into the computer. The Chinese Character Input scheme can be roughly divided into the following four types:
(1) audio codes: such as full fight, dual fight, and micro-soft pinyin
(2) form code: for example, wubi font, Zheng code, Table Code, etc.
(3) sound form code: such as smart ABC and natural code
(4) digital codes: such as location codes and telegraph Codes
4) Chinese Character Font code (output code)
A Chinese Character Font (output Code) is used to display and print Chinese characters. It is a digital information of Chinese characters. The inner code of a Chinese character is a digital code that represents a Chinese character. However, in order to display Chinese characters in the output, a Chinese Character Font must be output. In the Chinese character system, the lattice is generally used to represent the font. 32 bytes (16*16/8 = 32) are used to store 16*16 Chinese character lattice symbols, the 24*24 dot matrix font should be stored in 72 bytes (24*24/8 = 72.
In general, the larger the dot matrix used for displaying Chinese characters, the better the quality of the Chinese characters. Of course, the larger the storage required for each Chinese Character dot matrix.
5) Chinese Character address code
The Chinese Character Address Code refers to the logical address used to store Chinese characters in the Chinese Character Library (which mainly refers to an integer matrix library. In Chinese character libraries, the font type information is stored continuously on the storage medium in a certain order (most are arranged in the order of Chinese characters in the standard Chinese Character Exchange Code). Therefore, most Chinese Character address codes are sequential, in addition, it has a simple correspondence with the Chinese character incode to simplify the conversion from the Chinese character incode to the Chinese character address code.
*/