String, character, byte, bit summary and question, bit Summary

Source: Internet
Author: User
Tags coding standards

String, character, byte, bit summary and question, bit Summary

A string consists of one character. Each character is represented by one or more bytes, and each byte is represented by eight bits.

In C #, strings are usually declared by strings. characters are declared by char, bytes are represented by bytes, and bit is represented by bit. For detailed analysis, see the following test code analysis:

Complete Test code:

1 using System; 2 using System. collections. generic; 3 using System. linq; 4 using System. linq. expressions; 5 using System. text; 6 using System. threading. tasks; 7 using System. IO; 8 namespace CSharpRumenJD 9 {10 class Program11 {12 static void Main (string [] args) 13 {14 15 string unicodestr = "ah? /123 "; 16 Console. writeLine ("string:" + unicodestr); 17 Console. writeLine ("Length:" + unicodestr. length); 18 Console. writeLine ("Unicode Byte Length:" + System. text. encoding. unicode. getByteCount (unicodestr); 19 var unicodebytes = System. text. encoding. unicode. getBytes (unicodestr); 20 Console. writeLine ("gb2312 Byte Length:" + Encoding. getEncoding ("gb2312 "). getByteCount (unicodestr); 21 var gb2312bytes = System. text. encoding. getEncoding ("gb2312 "). getBytes (unicodestr); 22 # region garbled Test 23 var gb2312tounidecodestr = System. text. encoding. unicode. getString (gb2312bytes); 24 Console. writeLine ("gb2312bytes into Unicode string:" + gb2312tounidecodestr); 25 var gb2312str = System. text. encoding. getEncoding ("gb2312 "). getString (gb2312bytes); 26 Console. writeLine ("gb2312bytes string:" + gb2312str); 27 # endregion28 # region prints binary data 29 int capacity = gb2312bytes. length * 8; 30 StringBuilder sb = new StringBuilder (capacity); 31 for (int I = 0; I <gb2312bytes. length; I ++) 32 {33 sb. append (gb2312bytes [I] + ":" + Convert. toString (gb2312bytes [I], 2 ). padLeft (8, '0') + "|"); 34} 35 Console. writeLine (sb. toString (). trimEnd ('|'); 36 # endregion37 StreamWriter sw = new StreamWriter ("1.txt", false, System. text. encoding. unicode); 38 sw. write (unicodestr); 39 sw. close (); 40 StreamWriter sw1 = new StreamWriter ("2.txt", false, Encoding. getEncoding ("gb2312"); 41 sw1.Write (unicodestr); 42 sw1.Close (); 43 Console. readKey (); 44} 45} 46}View Code

Test results:

 

The test result shows the same string,

The length of the byte obtained by Unicode encoding is 12, and the length of the byte obtained by GB2312 is 7,

In addition, garbled characters occur when the byte array encoded by GB2312 is converted into a string using Unicode. There is no problem when the byte array encoded by GB2312 is converted into a string using the encoding method of GB2312,

Question 1:Why do the two encoding methods have different bytes?

 

Unicode code: Unicode code is also an international standard encoding, Unicode is currently widely used in UCS-2, it uses two bytes to encode a character (thanks to the guidance of ohmygirl ), unicode in C # uses the UTF-16 encoding format by default, so the byte length of the above string is 12,

GB2312 encoding is a branch of ANSI encoding. It supports multiple language stages in ANSI encoding. Each character is represented by one or more bytes (MBCS). Therefore, characters stored in this way are also calledMulti-byte characters. For example, "ah? /123 "is 7 bytes in length. Each Chinese Character occupies 2 bytes, and each English or numeric character occupies 1 byte,

Development of character and encoding

From the perspective of computer support for multiple languages, there are roughly three phases:

 

System internal code

Description

System

Phase 1

ASCII

At the beginning, the computer only supports English, and other languages cannot be stored and displayed on the computer.

English DOS

Phase 2

ANSI Encoding
(Localization)

To enable the computer to support more languages, we usually use 0x80 ~ 2 bytes in the 0xFF range to 1 character. For example, in the Chinese operating system, the byte [0xD6, 0xD0] is used for storage.

Different countries and regions have developed different standards, resulting in respective coding standards such as GB2312, BIG5, and JIS. These two bytes are used to represent the extended encoding of each character.ANSI Encoding. In a simplified Chinese system, ANSI encoding represents GB2312 encoding. In a Japanese operating system, ANSI encoding represents JIS encoding.

Different ANSI codes are incompatible with each other. When information is exchanged internationally, texts in the two languages cannot be stored in the same segment.ANSI Encoding.

Chinese DOS, Chinese Windows 95/98, Japanese Windows 95/98

Phase 3

UNICODE
(International)

To facilitate international information exchanges, international organizations have developedUNICODE Character SetSet a uniform and unique number for each character in a variety of languages to meet the requirements of cross-language and cross-platform text conversion and processing.

Windows NT/2000/XP, Linux, Java

 

Question 2: What do the decimal numbers in the last row represent?

 

Because byte arrays are encoded in GB2312 format, you need to first understand the processing method of GB2312. In the program using GB2312, each Chinese Character and symbol is expressed in two bytes. The first byte is called "high byte" (also known as "zone Byte"), and the second byte is called "low Byte" (also known as "bit byte "), "High Byte" uses 0xA1-0xF7 (add the area code of area 01-87 with 0xA0) and "low Byte" uses 0xA1-0xFE (add 01-94 with 0xA0 ), 0x0 is converted into a 10-digit number, which is 160. "Ah" is the first Chinese Character in the GB2312 character set. Its area code is 16 and its location code is 01, and its location code is 1601,

Therefore, the High-Level bytecode is 0xA0 + 16, that is, 160 + 16 = 176, and the low-level bytecode 0xA0 + 01 is 160 + 1 = 161, which is exactly the same as that, the remaining five decimal digits match the number of the five characters after the ah word. The query is as follows:

 

Question 3: why are the size and bytes of the generated text file inconsistent?


The size of the file generated by GB2312 encoding format is 7 bytes, which is consistent with that printed on the console, while that generated by Unicode encoding format is 14 bytes, it is two bytes longer than the bytes printed on the console. I don't know how to explain this phenomenon.

 

References:

 


Talk about characters and bytes

For example, the ASCII code of character a is 65.

Z tianjié
Byte: bytes are the units in which information is transmitted over the network (or stored in hard disk or memory.

Byte is a unit of measurement used by Computer Information Technology to measure storage capacity and transmission capacity. one byte is equal to eight-bit binary.

An English letter (case-insensitive) occupies the space of one byte, and a Chinese character occupies the space of two bytes.
Symbol: English Punctuation occupies one byte, and Chinese Punctuation occupies two bytes.

A sequence of binary numbers, which is generally an 8-bit binary number as a numerical unit in a computer. For example, an ASCII code is a byte. The conversion of such units is:
1 gib (TB, KiloGigaByte) = 1024 gib (40 to the power of 2)
(1 TB = 1024 GB)
1 gib (GB, GigaByte) = 1024 MB (2 to the power of 30 bytes)
(1 GB = 1024 MB)
1 MB (MegaByte) = 1024 kilobytes (20 to the power of 2)
(1 MB = 1024KB)
1024 bytes (KB, KiloByte) = bytes (10 to the power of 2 bytes)
1 Byte = 8 bits)

1.2 characters, in bytes, string
The key to understanding encoding is to understand the concept of character and byte accurately. These two concepts are easy to confuse. Here we will make a distinction:
Concept Description Example
The mark used by the character. It is an abstract symbol. '1', 'zhong', 'A', '$', '¥ ',......
A data storage unit in a byte computer, an 8-bit binary number, is a very specific storage space. 0x01, 0x45, 0xFA ,......
ANSI
If the character string is in the memory, if it is an ANSI encoded character, one character may be represented by one or more bytes, we call this string an ANSI string or multi-byte string. "Chinese 123"
(7 bytes)
UNICODE
If the character string is in memory and the character number exists in UNICODE, it is called a UNICODE string or a wide byte string. L "Chinese 123"
(10 bytes)
Because different ANSI encoding standards are different, we must know which encoding rule is used for a given multi-byte string, to know which "characters" it contains ". For a UNICODE string, the content of the "character" represented by it remains unchanged in any environment.

Character
Open classification: Programming

Z branch fú
A character is an abstract entity that can be expressed using multiple character schemes or code pages. For example, Unicode UTF-16 encoding represents a 16-bit integer sequence, while Unicode UTF-8 encoding represents the same character as an 8-bit sequence. The Common Language Runtime uses Unicode UTF-16 (Unicode conversion format, 16-bit encoding form) to represent characters.

Applications targeting the Common Language Runtime Library use encoding to map the character table format from the local character scheme to other schemes. The application uses decoding to map characters from a non-local program to a local program.

Ascii code
Open classification: computer technology, standards, code

ASCII code: American (country) Information exchange standard (generation) code, a solution that uses 7 or 8 binary digits for encoding, A maximum of 256 characters (including letters, numbers, punctuation marks, control characters, and other symbols) can be allocated (or specified.

ASCII code was proposed in 1968 to standardize data transmission in different computer hardware and software systems. It is used in most computers and all personal computers. The ASCII code is divided into two sets: The standard ASCII code of 128 characters and the additional 1 ...... the remaining full text>

Question about the length of the string byte

The length attribute indicates the length of characters. To calculate the number of bytes of a character, instead of multiplying 2, the size of a character byte is determined based on the character encoding position in unicode. For UTF-8 encoding, the size of Chinese characters is 3 or 4 bytes. Want to know the relationship between UTF-8 and unicode Baidu-> baike.baidu.com/view/40801.htm. A Chinese Character occupies bytes in UTF-8 encoding String. prototype. getBytesLength = function () {var totalLength = 0; var charCode; for (var I = 0; I <this. length; I ++) {charCode = this. charCodeAt (I); if (charCode <0x007f) {totalLength ++;} else if (0x0080 <= charCode) & (charCode <= 0x07ff )) {totalLength + = 2;} else if (0x0800 <= charCode) & (charCode <= 0 xffff) {totalLength + = 3 ;} else {totalLength + = 4 ;}} return totalLength ;}var str = "test"; cosnole. log (str. getBytesLength () to view the original post>

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.