How the various coding sets in PHP are detailed and under what circumstances to use _php techniques

Source: Internet
Author: User
Tags control characters web services acer
A character set is a collection of multiple characters, with many character sets, and each character set contains a different number of characters, common character set names: ASCII character set, GB2312 character set, BIG5 character set, GB 18030 character set, Unicode character set, and so on. To accurately handle various character set characters, the computer needs to be coded so that the computer can recognize and store all kinds of text.

The number of Chinese characters is large, but also divided into Simplified Chinese and traditional Chinese two different writing rules of the text, and the computer was originally designed in English single-byte characters, therefore, the Chinese character encoding, is the technical basis of information exchange. In this paper, we will discuss several typical character sets according to the time order of character set, select several representative Chinese character sets, and study the origin, characteristics and technical characteristics of history.

ASCII Character Set

1. The origin of the name

ASCII (American Standard Code for Information Interchange, American Information Interchange standard codes) is a computer coding system based on the Roman alphabet.

2. Features

It is mainly used to display modern English and other Western European languages. It is the most common single byte coding system today and is equivalent to ISO 646.

3. Include content

Control characters: Enter, backspace, newline keys, and so on.

Display characters: uppercase and lowercase letters, Arabic numerals, and western symbols

4. Technical characteristics

7 bits (BITS) represent a character, a total of 128 characters

5.ASCII Extended Character Set

The 7-bit coded character set can only support 128 characters, in order to represent more European common characters to extend the ASCII, the ASCII extended character set uses 8 bits (BITS) to represent one character and a total of 256 characters.

The ASCII extended character set expands from the ASCII character set to include table symbols, computational symbols, Greek letters, and special Latin symbols.

GB2312 Character Set

1. The origin of the name

GB2312 also known as the gb2312-80 character set, the full name of "Information exchange with Chinese character encoding character set," The basic set, issued by the original China National Standards Bureau, May 1, 1981 implementation.

2. Features

GB2312 is the Chinese national standard Simplified Chinese character set. It has been included in the Chinese characters have covered 99.75% of the use of frequency, basically meet the needs of computer processing of Chinese characters. Widely used in mainland China and Singapore.

3. Include content

GB2312 included simplified and general symbols, serial numbers, numerals, Latin alphabet, Kana, Greek alphabet, Russian alphabet, Chinese pinyin symbol, Bopomofo Letter, a total of 7,445 graphic characters. These include 6,763 Chinese characters, including 3,755 Chinese characters, two Chinese characters 3,008, and 682 full-width characters including Latin alphabet, Greek alphabet, Japanese hiragana and katakana letters, Russian Cyrillic letters.

4. Technical characteristics

(1) The partition means:

In GB2312, the collected Chinese characters are "partitioned" and each area contains 94 characters/symbols. This representation is also called Location code.

Each area contains the following characters: 01-09 is a special symbol, 16-55 is a first-level Chinese character, sorted by pinyin, 56-87 is a two-level Chinese character, it is sorted by radical/stroke, and there is no coding in 10-15 and 88-94 districts.

(2) Double byte representation

The preceding byte in the two byte is the first byte and the following byte is the second byte. The first byte is customarily called "high byte", and the second byte is "low byte".

"High byte" uses 0xa1-0xf7 (the area code of area 01-87 plus 0xa0), "Low byte" uses 0xa1-0xfe (01-94 plus 0xa0).

5. Code examples

In the GB2312 character set of the first Chinese character "Ah" as an example, its area code 16, bit number 01, the location code is 1601, in most computer programs, high byte and low byte respectively add 0xa0 to get the program of Chinese character processing code 0XB0A1. The calculation formula is: 0xb0=0xa0+16, 0xa1=0xa0+1.

BIG5 Character Set

1. The origin of the name

Also called five yards or five big yards, 1984 by the Taiwan Consortium of Corporate Information industry and five software companies Acer (ACER), Avatar (MiTAC), Allison, 0 (Zero one), Volkswagen (FIC) was founded, so called Five yards.

Big5 code is produced, because at that time Taiwan different manufacturers to introduce different codes, such as the day code, IBM PS55, Wang Code, and so on, not compatible with each other; On the other hand, the Taiwan government has not yet launched the official encoding, and the GB2312 code in mainland China does not include traditional Chinese characters.

2. Features

The Big5 character set contains a total of 13,053 Chinese characters, which are used in Taiwan, China. The intriguing thing is that the character set repeatedly contains two identical words: "WU" (0xa461 and 0xc94a), "嗀" (0XDCD1 and 0xDDFC).

3. Character encoding method

The BIG5 code uses a two-byte storage method to encode a word in two bytes. The first byte is called "High byte", and the second byte is called "Low byte." High-byte coding range 0xa1-0xf9, low byte coding range 0x40-0x7e and 0xa1-0xfe.

The corresponding character types for each encoding range are as follows: 0XA140-0XA3BF for punctuation, Greek letters and special symbols, in addition to 0xa259-0xa261, storage of two-syllable units of weights and measures: 兙 兛 兞 兝 兡 兣 嗧 kw 糎; First by strokes and then by radical order; 0xc940-0xf9d5 for the second commonly used Chinese characters, but also first by strokes and then by the radical order.

Limitations of 4.big5

Although the BIG5 code contains more than 10,000 characters, it does not take into account the social circulation of names, names, dialect words, chemistry and biology, etc., and does not contain hiragana and katakana letters.

Taiwan, for example, sees "the word" as "a", so it does not include the word "in". Kangxi Dictionary of some Radical words (such as "lighten", "non-epileptic seizure", "辵", "癶", etc.), the common name of the word (such as "Kun", "Xuan", "" "," "Zhe" and so on) are not included in the BIG5.

GB18030 Character Set

1. The origin of the name

The full name of GB 18030 is gb18030-2000 "the expansion of the basic set of Chinese character coded character set for information interchange", which is the new encoding national standard issued by our government on March 17, 2000, and the software released in China after August 31, 2001 must conform to this standard.

2. Features

GB 18030 Character Set standard issued after extensive participation and demonstration, from domestic and foreign well-known information technology industry companies, the Ministry of Information Industry and the former National quality and technical supervision of the joint implementation.

The GB 18030 character set standard solves the problem of large character set computer coding consisting of Chinese characters, Kana, Korean and Chinese minority characters. The standard character total encoding space is more than 1.5 million code bits, contains 27,484 Chinese characters, covering Chinese, Japanese, Korean and Chinese minority language. To meet the requirements of multilingual, large volume, multi-purpose and unified coding format in East Asia, such as mainland China, Hong Kong, Taiwan, Japan and Korea. and compatible with the Unicode 3.0 version to fill the contents of the Unicode extended character Vocabulary "Unified Chinese character extension a". and is compatible with the previous national character encoding standard (gb2312,gb13000.1).

3. Coding method

The GB 18030 standard uses single-byte, Double-byte, and four-byte three ways to encode characters. The single-byte portion uses 0x00 to 0x7f code (corresponding to the ASCII code). The double-byte portion, the first byte code from 0X81 to 0xFE, and the tail byte code bits are 0x40 to 0x7e and 0x80 to 0xFE respectively. The four-byte part uses GB/T 11383 not adopted 0x30 to 0x39 as the suffix of the double byte code extension, this expands the four byte code, its range is 0x81308130 to 0xfe39fe39. The first to third byte encoding code bit is 0x81 to 0xFE, and the second to fourth byte code bit is 0x30 to 0x39.

4. What is included

The contents of the two-byte part mainly include GB13000.1 all CJK Kanji 20,902, related punctuation marks, ideographic text descriptors 13, supplemental Chinese characters and radicals/components 80, Double-byte coded euro notation, etc. The four-byte section contains all of the characters in GB 13000.1, including CJK Unified Chinese character expansion A, in addition to the Double-byte characters above.

Unicode character Set

1. The origin of the name

The Unicode character set encoding is the abbreviation for the Universal Multiple-octet coded Character set Universal eight-bit coded character set, which is called a Unicode Academic Society (Unicode Consortium) The organization's character coding system, which supports the exchange, processing and display of written texts in various languages of the world today. The code was developed in 1990, officially announced in 1994, and the latest version is the March 31, 2005 Unicode 4.1.0.

2. Characteristics

Unicode is a character encoding that is used on a computer. It sets a uniform and unique binary encoding for each character in each language to meet the requirements for text conversion and processing across languages and platforms.

3. Coding method

The Unicode standard always uses hexadecimal digits and, in writing, preceded by the prefix "u+", for example, the encoding of the letter "a" is 004116 and the character "?". The encoding is 20AC16. So the code for "A" is written as "u+0041".

4.utf-8 Code

UTF-8 is one of the ways in which Unicode is used. UTF is the Unicode translation format, which means that Unicode is transferred to a format.

UTF-8 facilitates the transmission of different languages and encoded text between different computers using the network, allowing Double-byte Unicode to be transmitted correctly on existing single-byte systems.

UTF-8 uses variable-length bytes to store Unicode characters, such as ASCII letters that continue to use 1-byte storage, accented text, Greek letters, or Cyrillic letters, which are stored in 2 bytes, while commonly used Chinese characters will use 3 bytes. The secondary plane character uses 4 bytes.

5.utf-16 and UTF-32 Coding

UTF-32, UTF-16, and UTF-8 are character encoding schemes for Unicode-standard coded character sets, which encode Unicode code points using a sequence of one or two unassigned 16-bit code units; UTF-32 each Unicode code point to be represented as a 32-bit integer of the same value.

How to solve the problem of using garbled PHP

1 Use the label to set the page encoding

The role of this label is to declare the client's browser with what character set code to display the page, XXX can be GB2312, GBK, UTF-8 (and MySQL is different, MySQL is UTF8) and so on. Therefore, most of the pages can be used in this way to tell the browser to display the page when the code, so that will not cause coding errors generated garbled. But sometimes we will find that there is no, no matter what XXX is, the browser is always a kind of coding, which I will talk about later.

Note that it is HTML information, just a declaration, that the server has uploaded HTML information to the browser.

2 header ("content-type:text/html; Charset=xxx ");

The function header () is to send the information inside the parentheses to the HTTP header. If the contents of the parentheses in the text, that the function and the label is basically the same, we compare the first to see the characters are similar. But the difference is that if you have this function, the browser will always use your request for the XXX code, will not be disobedient, so this function is very useful. Why does it have to be like that? The difference between HTTP headers and HTML information is:

HTTP headers are strings sent by the server before sending HTML information to the browser with the HTTP protocol. And the label is HTML information, so header () to send the content to reach the browser, popular point is header () priority is higher (do not know can say so). If a PHP page has both header ("Content-type:text/html;charset=xxx"), and then, the browser will only recognize the former HTTP headers and not to recognize Meta. Of course, this function can only be used within the PHP page.

There is also a question, why is the former absolutely effective, and the latter sometimes not? This is the reason for the next point about Apache.

3) Adddefaultcharset

In the Conf folder of the Apache root directory, there is the entire Apache configuration document HTTPD.CONF.

Open httpd.conf with a text editor, line No. 708 (different versions may be different) has adddefaultcharset xxx,xxx as the encoded name. This line of code means: Set the entire server within the page file HTTP header character set for your default XXX character set. With this line, it is equivalent to adding a line header to each file ("content-type:text/html; Charset=xxx "). This is clear why clearly set is Utf-8, can always use the browser gb2312 reasons.

If the page has header ("content-type:text/html; Charset=xxx "), the default character set is changed to the character set of your setting, so this function is always useful. If you put a "#" in front of adddefaultcharset xxx, comment out the sentence, and the page does not contain header ("Content-type ..."), then the META tag will work.

The above precedence sequence is listed below:

Header ("content-type:text/html; Charset=xxx ")

.. Adddefaultcharset xxx


If you are a web programmer, it is recommended that you add a header to each page ("Content-type:text/html;charset=xxx"), so that it can be displayed correctly on any server, portability is also relatively strong.

4) The Default_charset configuration in php.ini:

The Default_charset = "gb2312" in php.ini defines the default language character set for PHP. It is generally recommended that this line be commented out so that the browser automatically selects the language based on the charset in the header of the page rather than making a mandatory provision, so that Web services can be provided in multiple languages on the same server.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.