Understanding of BIT, byte, and character concepts

Source: Internet
Author: User

The difference between byte (byte) and bit:
In computer science, bit is the smallest unit of information, which is called a binary bit. It is usually represented by 0 and 1. Byte is called a byte. It consists of eight (8 bit) bytes (1 byte), used to represent a character in the computer. Bit and byte can be converted. The conversion relationship is 1 byte = 8bit (or 1B = 8b, that is, 1bit is abbreviated to 1B (note that it is a lowercase English letter B), and 1byte is abbreviated to 1B (note that it is a capital English letter B ).

The hard disk capacity is 40 GB, 80 GB, and 100 Gb. Here, B Refers to byte, which is "byte ".
1 kb = 1024 bytes = 2 ^ 10 bytes
1 MB = 1024 kb = 2 ^ 20 bytes
1 GB = 1024 MB = 2 ^ 30 bytes

 

What are the characters?
A character is an abstract entity that can be expressed using multiple character schemes or code pages. For example, Unicode UTF-16 encoding represents a 16-bit integer sequence, while Unicode UTF-8 encoding represents the same character as an 8-bit sequence. The Common Language Runtime uses Unicode UTF-16 (UNICODE conversion format, 16-bit encoding form) to represent characters.

Applications targeting the Common Language Runtime Library use encoding to map the character table format from the local character scheme to other schemes. The application uses decoding to map characters from a non-local program to a local program.

Byte: bytes are the units in which information is transmitted over the network (or stored in hard disk or memory.

An English letter (case-insensitive) occupies the space of one byte, and a Chinese character occupies the space of two bytes.
Symbol: English Punctuation occupies one byte, and Chinese Punctuation occupies two bytes.

A sequence of binary numbers, which is generally an 8-bit binary number as a numerical unit in a computer. For example, an ascii code is a byte.


Key to understanding coding:

It is necessary to understand the concept of characters and the concept of bytes accurately. These two concepts are easy to confuse. Here we will make a distinction:
Concept Description Example
The mark used by the character. It is an abstract symbol. '1', 'zhong', 'A', '$', '¥ ',......
A data storage unit in a byte computer, an 8-bit binary number, is a very specific storage space. 0x01, 0x45, 0xfa ,......
ANSI string: (Multi-byte characters)
In memory, if the character is ANSI encoded, one character may be represented by one or more bytes, we call this string an ANSI string or multi-byte string. "123 Chinese characters" (7 bytes)
Unicode string: (wide character)
In memory, if the "character" exists as the serial number in UNICODE, it is called a unicode string or a wide byte string. L "Chinese 123" (10 bytes)
Because different ANSI encoding standards are different, we must know which encoding rule is used for a given multi-byte string, to know which "characters" it contains ". For a unicode string, the content of the "character" represented by it remains unchanged in any environment.


What are the wide and multi-byte characters in C?
The C language was originally designed in an English environment. The main character set is a 7-bit ASCII code, and the 8-bit byte (byte) is the most common character encoding unit. However, internationalization software must be able to express different characters, which are large in quantity and cannot use one byte encoding.

C95 standardizes two methods to represent a large character set: wide character set (wide character, each character in this character set uses the same bit length) and Multi-byte character set (multibyte character, each character can be one to multiple bytes, and the character value of a byte sequence is determined by the environment background of the string or stream ).

Since the addition in 1994, the C language not only provides the char type, but also provides the wchar_t type (wide character), which is defined in the stddef. h header file. The wide byte type specified by wchar_t is sufficient to indicate any element of a version extension character set.

In a multi-Byte Character Set, the encoding width of each character varies from one byte to multiple bytes. Both the source and running character sets may contain multi-byte characters. Multi-byte characters can be used for character constants, string literal, identifier, comment, and header files.

C language itself does not define or specify any encoding set, or any character set (except for the basic source code Character Set and basic running character set), but it specifies how to encode wide characters, and what type of multi-byte character encoding is supported.

Although the c Standard does not support Unicode character sets, many implementations use the Unicode conversion format UTF-16 and UTF-32 to handle wide characters. If the Unicode standard is followed, the wchar_t type is at least 16 or 32 characters long, and a value of the wchar_t type represents a Unicode character.

The UTF-8 is an implementation defined by Unicode Consortium, which can represent all characters in the Unicode Character Set. The size of the space used by UTF-8 characters can be from one byte to four bytes.

The main difference between multibyte characters and wide characters (that is, wchar_t) is that the number of bytes occupied by the wide characters is the same, while the number of multibyte characters varies, this representation makes multi-byte strings more difficult to process than wide strings. For example, even if the character 'a' can be expressed in one byte, but you need to find this character in a multi-byte string, you cannot use a simple byte comparison, because this byte is not necessarily a character, even if it finds a matching byte at a certain position, it may be part of another different character. However, multi-byte characters are suitable for storing text as files.

C provides some standard functions to convert multi-byte characters to wchar_t, or to convert wide characters to multi-byte characters. For example, if the C compiler uses the Unicode Standard UTF-16 and UTF-8, then the following call to the wctomb () function gets the multi-byte representation of characters (Note: wctomb = wide character to multibyte ).

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.