C # Programming Summary (ix) character encoding

Source: Internet
Author: User
Tags alphabetic character character set control characters lowercase requires

I believe you must have encountered garbled problem, why is garbled? How does the output data be different from the input?

Recently in the summary of encryption problems, but also encountered the same trouble. So let's focus on solving the problem today.

What is a character?

Characters are letters, numbers, words, and symbols used in the computer, including: 1, 2, 3, A, B, C, ~! #¥% ...-* ()--+, etc.

Character Set (Charset)

A character set (Charset) is a collection of all abstract characters supported by a system.

Characters are the general name of various words and symbols, including the national characters, punctuation marks, graphic symbols, numbers and so on.

What is character encoding?

Character encoding (Character Encoding): The simple way is to establish the correspondence between natural language and machine language. is a set of rules that can be used to pair a set of characters in a natural language (such as an alphabet or a syllable table) with a set of other things, such as numbers or electrical pulses. It is a basic technology of information processing that establishes the correspondence between the symbolic set and the digital system. Usually people use symbolic collections (usually text) to express information. and computer-based Information Processing system is the use of components (hardware) The combination of different States to store and process the message. The combination of different states of a component can represent numbers in a digital system, so character encoding is the number of converts to a digital system acceptable to a computer, called a digital code.

The information in the computer includes the data information and the control information, and the data information can be divided into numerical and non-numeric information. Non-numerical information and control information include letters, various control symbols, graphical symbols, etc., which are stored in the computer in binary encoding and processed, and the binary Code encoding letters and symbols is called character code (Character). The common character encodings used in computers include ASCII code (US standard Information Interchange Code) and EBCDIC Code (extended BCD interchange Code).

In ASCII encoding, an English letter character store requires 1 bytes. In GB 2312 encoding or GBK encoding, a kanji character store requires 2 bytes. In UTF-8 encoding, an English letter character store requires 1 bytes, and a Chinese character store requires 3 to 4 bytes. In UTF-16 encoding, an alphabetic character or a Chinese character store requires 2 bytes (some Chinese characters stored in a Unicode expansion area need 4 bytes). In UTF-32 encoding, the storage of any character in the world requires 4 bytes.

Troubled and puzzled?

1. Why is there a character encoding?

The meaning of the introduction has been explained, the character encoding is to allow the computer to identify natural language.

2, why there are so many character sets?

Computer development in different stages, initially only the United States, the establishment of ASCII code, but some European countries can not use ASCII code, and then extended to the ASCII extension, later China to use the computer, in order to mark Chinese, defined GB2312, GBK, BIG5, etc. There are also some other character sets.

3, there is no one unified character set?

There, Unicode

4. What is the relationship between UTF8 and Unicode?

UTF8 is a way to implement Unicode.

Several common coding

1. ASCII code

ASCII (American Standard Code for Information Interchange, American Information Interchange standard codes) is a computer coding system based on the Latin alphabet.

ASCII Character set: mainly includes control characters (enter key, backspace, newline keys, etc.), and can display characters (English uppercase and lowercase characters, Arabic numerals, and Latin symbols).

Features: Single-byte encoding, contains only uppercase and lowercase letters, punctuation marks and other symbols.

ASCII encoding: A rule that converts an ASCII character set to the number of digital systems that the computer can accept. Uses 7 bits (BITS) to represent a character, a total of 128 characters, but a 7-bit coded character set can only support 128 characters, in order to represent more European common characters to extend the ASCII, the ASCII extended character set uses 8 bits (BITS) to represent one character and a total of 256 characters.

Computers are invented by Americans and can only satisfy themselves at the beginning, so this code is very limited. This encoding set cannot be used in other countries or other languages.

According to the encoding rules of ASCII code, can only identify 256 characters, but there are so many languages in the world, Chinese characters as many as 100,000, so many characters, obviously very limited.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.