Byte and char in C language in-depth parsing _c language

Source: Internet
Author: User
Tags control characters

For example, the "You", "good", ",", "C", "!", "\ n" in the following source program are the characters that the program will handle.

Copy Code code as follows:

#include <stdio.h>
int main (void)
{
printf ("Hello, c!\n");
return 0;
}

Other characters in the source program belong to the writing source, which may also contain characters that are not clearly displayed, such as empty characters (space character), horizontal tab (horizontal tab), Vertical tab (vertical tab), and page breaks ( Form feed).

In a sense, an editor/compiler is a software that accepts character input and outputs an executable file, which produces executable files that are loaded into memory, and the program usually inevitably handles characters.

The editor/compiler does not necessarily run in the same environment as the application it produces, which means that the two may have to handle different character sets together.

The character to be processed by the editor/compiler is the character used to write the C language source program, which is called the source character set (Sourcecharacter set). The set of characters to be processed by the application is called the execution character set (execution character set).

For most C-language learners, the difference between the source character set and the execution character set may not be recognized because the editing/compiling environment is coincident with the application's operating environment.

• Source Character set (character set)
The character in the source character set is the character that writes the C language source program, which is the character that the C language requires for the environment that the editor/compiler is running. This set of characters consists of the base character set (basic character set), the character that represents the newline (new-line character), and the extended character (extended characters).

The base character set (basic character set) includes:

A B C D E F G H I J K L M

N O P Q R S T U V W X Y Z

A b c d e F g h i j k l M

N o p q R S t u v w x y Z

0 1 2 3 4 5 6 7 8 9

! "#% & ' () * +,-. / :

; < = >? [ \ ] ^ _ { | } ~

Spaces (space character)

Control characters: Horizontal tab, Vertical tab, form feed

Altogether is 95. This is the most basic requirement of the C language for the editor/compiler to run the environment, with the implication that the C language program can be written as long as the editor/compiler is running an environment that provides these 95 characters. In fact, the C language source program is also "mainly" composed of these 95 characters.

In addition, the C language requires that the 0~9 10-character numbering (encoding) must be sequential in an editing/compiler-run environment.

Unfortunately, some environments cannot fully provide these 95 characters. For example, it is said that some countries have no "[" Key on their keyboards.

Because of this situation, the C language also allows the so-called three-character sequence (trigraph) to represent characters that are not provided by the environment. Like using "?? < "Express" {, use "?? > "represents"} ". The following code, though somewhat bizarre, is still a legitimate C program.

Copy Code code as follows:

#include <stdio.h>
int main (void)
?? <
printf ("Hello, c!\n");
return 0;
?? >

The compiler can also extend itself to the basic character set, which is called the extended character (extended characters). The "You" and "good" in the preceding code belong to the extended character. These extended characters can only appear in identifiers, character constants, string literals, header names, comments, and some preprocessed words (preprocessing token that are never converted to a token). Extended characters in other parts of the code are undefined behavior.

The value of the extended character is defined by the specific compiler. The collection of all the characters that the source program can use is called the extended character set (extended character set).

• Execute Character Set
The character set (the execution character set) in the environment in which the application is running is also an extended character set (extended character set).

It must also include the 95 basic character sets in the source character set mentioned earlier, and the encoding of the 0~9 10 characters must also be continuous.

It is important to note that the C language does not require the same encoding for the basic character set in the execution environment and for the basic character set in the editing/compiling environment, although the "characters" of the two basic character sets are the same.

The characters that must be provided in the execution environment are alert,backspace,carriage return,new line and a character (null character) that everyone is 0.

Other characters that the program can handle in the execution environment are also called extended characters (extended characters), which are associated with the base character set and Alert,backspace,carriage return,new line and NULL Character together form the extended character set (extended character set) in the execution environment, or called the execution character set (the execution character set).

For the execution environment, the extended character (extended characters) is also defined by the compiler itself.

Byte in the C language
The byte in C language is similar to a type like int, and is also not a bit group that determines length. The C language only requires byte to be encoded in the execution environment and the basic character set in the editing/compiling environment. This makes it easy to understand that in some compilers, the byte in C is 9 digits, which does not violate the basic definition of C language.

Similarly, if the encoding of the basic character set is 8 digits in the editing/compiling environment, and the basic character set is encoded in 16 bits in the operating environment, then the byte size must be at least 16 bits.

Thus, the byte in the C context is not generally considered as a octet (8-bit group).

Char data types in the C language
The char data type in the C language is an integer type, which is defined as 1 byte. i.e.

sizeof (char) ≡1

If you need to know how many bits the byte of a specific compiler is, you can view the limits.h provided by the compiler. The symbolic constant char_bit defined is the number of digits of the char type, which is the byte number.

Add:

C Standard of contradiction?

"Addressable unit of data large enough to hold no member of the" basic character set of the execution Environment "(clause 3.6 of the C standard)

But in the 5.2.1-3,

The representation of the source and execution basic character sets shall fit in a byte.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.