Character set and encoding for Web pages

Source: Internet
Author: User
Tags control characters html page printable characters regular expression

What is a character set? What is encoding?

Character (Character) is the general name of words and symbols, including text, graphic symbols, mathematical symbols and so on.

A set of abstract characters is the character set (Charset).

Character sets often correspond to a specific language, in which all characters, or most commonly used symbols, form the character set of the text, such as the English character set.

A group of characters with common characteristics can also be composed of character sets, such as the traditional Chinese character set, the kanji character set.

A subset of the character set is also a character set.

To handle a variety of characters, the computer needs to correspond the characters to the binary code, which is the character encoding (Encoding).

Coding first determines the character set, sorts the characters within the character set, and then corresponds to the binary number. Depending on the number of characters in the character set, it is determined to encode in several bytes.

Each encoding defines a well-defined set of characters, called the encoded character set (coded Character set), which is another meaning of the character set. Generally speaking, the character set is mostly this meaning.

What are the character sets?

Ascii:

American Standard Code for Information Interchange, U.S. Information Interchange Standard code.

The most widely used character set and its code in the computer are developed by the U.S. National Standards Office (ANSI).

It has been set by the International Organization for Standardization (ISO) as a standard, known as ISO 646.

The ASCII character set consists of control characters and graphic characters.

In a computer's storage unit, an ASCII value occupies one byte (8 bits), and its highest bit (B7) is used as a parity bit.

The so-called parity, refers to the code in the process of transmission to verify that there are errors in a method, generally divided into odd and even check two.

Odd Check rule: The correct code 1 of the number of bytes must be odd, if not odd, in the highest bit B7 Tim 1.

Parity rule: The correct code 1 of the number of bytes must be even, if not even, in the highest bit B7 Tim 1.

ISO 8859-1:

ISO 8859, universal ISO/IEC 8859, is the standard of a series of 8-bit character sets developed jointly by the International Organization for Standardization (ISO) and IEC, and currently defines 15 character sets.

ASCII contains spaces and 94 "printable characters" that are sufficient for use in English.

However, other languages that use the Latin alphabet (mainly the languages of European countries) have a certain number of variable-tone letters, so they can be stored and represented in areas other than ASCII and control characters.

In addition to the Latin alphabet, the Cyrillic Eastern European language, Greek, Thai, modern Arabic, Hebrew, and so on, can be stored and represented in this form.

* ISO 8859-1 (Latin-1)-Western European languages

* ISO 8859-2 (Latin-2)-Central European languages

* ISO 8859-3 (Latin-3)-Southern European languages. Esperanto can also be displayed in this character set.

* ISO 8859-4 (Latin-4)-Nordic languages

* ISO 8859-5 (Cyrillic)-Slavic language

* ISO 8859-6 (Arabic)-Arabic

* ISO 8859-7 (Greek)-Greek

* ISO 8859-8 (Hebrew)-Hebrew (visual order)

* ISO 8859-8-i-Hebrew (logical order)

* ISO 8859-9 (Latin-5 or Turkish)-it wraps Latin-1 Icelandic letters and joins the Turkish alphabet.

* ISO 8859-10 (Latin-6 or Nordic)-North Germanic branch, used to replace Latin-4.

* ISO 8859-11 (Thai)-Thai, evolved from the TIS620 standard Word set in Thailand.

* ISO 8859-13 (Latin-7 or Baltic Rim)-Baltic languages

* ISO 8859-14 (Latin-8 or Celtic)-Celtic languages

* ISO 8859-15 (Latin-9)-Western European languages, add Latin-1-deficient French and Finnish accent letters, and euro sign.

* ISO 8859-16 (Latin-10)-South East European languages. It is mainly used in Romanian and is added to the euro symbol.

It is clear that the iso8859-1 encoding represents a narrow range of characters that cannot be represented in a Chinese character.

However, because it is single-byte encoding, and the computer's most basic unit of representation, so many times, still use iso8859-1 encoding to express.

And on many protocols, the encoding is used by default.

UCS:

The universal Character set (Universal Character Set,ucs) is a character encoding defined by ISO 10646 (or ISO/IEC 10646), with 4-byte encoding.

UCS contains all the characters of a known language.

In addition to Latin, Greek, Slavic, Hebrew, Arabic, Armenian, Georgian, as well as Chinese, Japanese, Korean hieroglyphs, UCS also includes a large number of graphic, printing, mathematical, scientific symbols.

* UCS-2: The 2byte encoding of Unicode is basically the same.

* UCS-4:4byte encoding, is currently in front of the UCS-2 plus 2 zero byte.

Unicode:

Unicode (Uniform Code, universal Code, single code) is a character encoding that is used on a computer.

It is a coded mechanism to enclose the world's most commonly used words.

It sets a uniform and unique binary encoding for each character in each language to meet the requirements for text conversion and processing across languages and platforms.

1990 began research and Development, 1994 officially announced. With the enhancement of computer working ability, Unicode has been popularized for more than more than 10 years since it was published.

But since unicode2.0 began, Unicode has used the same font and codewords as ISO 10646-1, and ISO has also pledged that ISO10646 will not assign values beyond 0X10FFFF UCS-4 encoding to keep them consistent.

Unicode is encoded in a way that corresponds to the general character set (Universal Character Set,ucs) concept of ISO 10646, and the current version of Unicode for practical use corresponds to UCS-2, using 16-bit encoding space.

That is, each character occupies 2 bytes, basically meet the use of various languages. In fact, the current version of Unicode is not filled with these 16-bit encodings and retains a lot of space for special use or future expansion.

UTF:

Unicode is implemented in a different way than encoding.

Unicode encoding of a character is determined, but in the actual transmission process, because the design of different system platform is not necessarily consistent, and for space-saving purposes, the implementation of Unicode encoding is different.

Unicode is implemented as a Unicode conversion format (Unicode Translation format, abbreviated as UTF).

* UTF-8:8bit variable length encoding, for most common character sets (0~127 characters in ASCII) it uses only Single-byte, and for other commonly used characters (especially Korean and Chinese), it uses 3 bytes.

* Utf-16:16bit code, is variable length code, roughly equivalent to 20-bit code, the value between 0 to 0x10ffff, is basically the implementation of Unicode encoding, and CPU word order.

Encoding

* GB2312 Word set is a set of simplified characters, all known as GB2312 (80) Word set, a total of 6,763 GB Simplified Chinese characters.

* BIG5 Character Set is a set of traditional Chinese characters, including the national standard of traditional kanji 13,053.

* GBK is a simple set of characters, including the GB character set, BIG5 Word set, and some symbols, a total of 21,003 characters.

* GB18030 is a national set of mandatory large set of standards, called gb18030-2000, its introduction makes the Chinese character set has a "unification" standard.

ANSI and Unicode Big Endia:

When we save a text file in a Windows system, we usually have the option of encoding ANSI, Unicode, Unicode big endian and UTF-8, and what is the code for ANSI and Unicode Endia here?

Ansi:

A variety of Chinese character extension encodings that use 2 bytes to represent a single character, called ANSI encoding.

Under the Simplified Chinese system, ANSI encoding represents GB2312 encoding, and ANSI encoding represents JIS code under Japanese operating systems.

Unicode Big Endia:

UTF-8 is a byte-encoded unit with no byte-order problem. UTF-16 is a two-byte coding unit, before interpreting a UTF-16 text, first figure out the byte order of each encoding unit.

The recommended method of marking byte order in the Unicode specification is the BOM (that is, byte mark).

In the UCS code there is a character called Zero WIDTH No-break space, and its encoding is Feff. Fffe is not present in UCS, so it should not appear in the actual transmission.

UCS specifications recommend that we transfer the byte stream before the transmission of the character zero WIDTH no-break space.

This means that if the recipient receives the Feff, the byte stream is Big-endian, and if Fffe is received, it indicates that the byte stream is Little-endian.

Therefore the character zero WIDTH no-break space is also called the BOM.

Windows uses a BOM to mark the encoding of a text file.

Iii. programming language and coding

C, C + +, Python2 internal strings are all using the current system default encoding

Python3, Java internal strings saved in Unicode

Ruby has an internal variable $kcode is used to represent the encoding of the recognizable multibyte string, and the value of the variable is EUC sjis UTF8 none.

When a value of $KCODE is EUC, the encoding of a string or regular expression is assumed to be EUC-JP.

Similarly, shift JIS is considered to be a sjis. If it is UTF8, it is considered as UTF-8.

If none, multibyte strings will not be recognized.

When you assign a value to the variable, only the 1th byte works, and the uppercase and lowercase letters are not case-sensitive.

E e means that euc,s s represents sjis,u u representing UTF8, while n N represents NONE.

The default value is None.

That is, Ruby treats a string as a single-byte sequence by default.

Four, why is garbled?

Garbled is an old problem, from the above we know that characters in the preservation of the encoding format if and to display the encoding format is not the same, there will be garbled problems.

Our web system, from the underlying database coding, Web application coding to HTML page encoding, if there is an inconsistency, there will be garbled.

Therefore, solve the garbled problem is difficult to say simple also simple, the key is to allow the interaction between the system coding.

Is there any snake balm?

With so many encodings and character sets to dazzle us, we just have to choose one of the best compatibility encodings and character sets, and let it be between our program subsystems

Interactive coding contract, then the annoying garbled problem is about to go away from us--The best code for compatibility is utf-8!

After all, gbk/gb2312 is the domestic standard, when we use foreign open-source software heavily, UTF-8 is the most common language in the coding world.

This article originates from: Jinan website Construction http://www.jinanwangzhanjianshe.com

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.