_php Techniques for parsing Chinese coding problems in PHP development

Source: Internet
Author: User
Tags php programming

The problem of Chinese coding in PHP programming has been plagued by a lot of people, the reason for this problem is actually very simple, each country (or region) has specified the computer information exchange for the character encoding set, such as the U.S. extended ASCII, China's gb2312-80, JIS and so on Japan. As the basis of information processing in this country/region, the character coding set plays an important role in the unified coding. The character encoding set is divided into SBCS (Single-byte character set) by length, DBCS (double-byte character set) two broad categories. Early software (especially the operating system), in order to solve the local character information computer processing, the emergence of various localized versions (L10N), in order to distinguish between the introduction of LANG, Codepage and other concepts. However, because of the overlapping of the local character set code scope, the information exchange between each other is difficult, each localized version of the software has higher cost of independent maintenance. Therefore, it is necessary to extract the commonality from the localization work, and make a consistent treatment to minimize the specific localized processing. This is the so-called internationalization (118N). Various language information is further normalized to Locale information. The underlying character set that is processed becomes Unicode, which contains almost all glyphs.

Now most of the software core character processing with internationalized features is based on Unicode, the local character encoding setting is determined according to the Ocale/lang/codepage setting at the time of software running, and the local characters are processed accordingly. There is a need to implement conversion between Unicode and the local character set during processing, or even between two different local character sets in Unicode. This approach is further extended under the network environment, and the character information at both ends of the network needs to be converted to acceptable content according to the set of character set.

The problem of character set encoding in database
The popular relational database system supports database character set encoding, which means that when you create a database, you can specify its own character set settings, and the database data is stored in the specified encoding form. When an application accesses data, a character set encoding is converted at both the entry and exit points. For Chinese data, the database character encoding should be set to ensure the integrity of the data. GB2312, GBK, UTF-8 are optional database character set encoding; Of course we can also choose iso8859-1 (8-bit), but we have to

Write the data with the program before the 16Bit of a Chinese character or Unicode split into two 8-bit characters, read the data also need to merge two bytes, but also to identify the SBCS characters, so we do not recommend the use of Iso8859-1 as a database character set code. This not only does not make full use of the database's own character set coding support, but also increases the complexity of programming. In programming, you can use the management function provided by the database management system to check whether the Chinese data is correct.

PHP program before querying the database, first executes mysql_query ("SET NAMES xxxx"); where xxxx is your page encoding (CHARSET=XXXX), if the page Charset=utf8, then Xxxx=utf8, if the page charset=gb2312, then xxxx=gb2312, almost all WEB programs, There is a connection to the database of common code, placed in a file, in this file, add mysql_query ("SET NAMES xxxx") on it.

The set NAMES displays what character set is used in the SQL statements sent by the client. Therefore, the set NAMES ' utf-8 ' statement tells the server that "future information from this client will be based on the character set Utf-8". It also specifies the character set for the result that the server sends back to the client (for example, if you use a SELECT statement that represents what character set the column value uses).

Techniques used to locate problems
The most stupid and effective way to locate a Chinese encoding problem is to print the inner code of the string after the program you think is suspected. By printing the inner code of the string, you can find out when Chinese characters are converted to Unicode, when Unicode is turned back into Chinese, and when one of the two Unicode characters is converted into a string of question marks, When was the high order of Chinese strings truncated ...

Taking the appropriate sample string also helps to distinguish between the types of problems. such as: "AA ah AA" @aa "and other medium and English, GB, GBK character characters are all strings. In general, no matter how the English character is converted or processed, it does not distort (if you do, you can try to increase the length of consecutive letters).

Solve the garbled problem of various applications

1) Use <meta http-equiv= "Content-type" content= "text/html;charset=xxx" > tags to set page encoding
The role of this label is to declare the client's browser with what character set code to display the page, XXX can be GB2312, GBK, UTF-8 (and MySQL is different, MySQL is UTF8) and so on. Therefore, most of the pages can be used in this way to tell the browser to display the page when the code, so that will not cause coding errors generated garbled. But sometimes we will find that there is no, no matter what XXX is, the browser is always a kind of coding, which I will talk about later.

Note that <meta> is HTML information, just a declaration, indicating that the server has uploaded HTML information to the browser.

2 header ("content-type:text/html; Charset=xxx ");
The function header () is to send the information inside the parentheses to the HTTP header. If the contents of the parentheses in the text, that the function and the label is basically the same, we compare the first to see the characters are similar. But the difference is that if you have this function, the browser will always use your request for the XXX code, will not be disobedient, so this function is very useful. Why does it have to be like that? The difference between HTTP headers and HTML information is:

HTTP headers are strings sent by the server before sending HTML information to the browser with the HTTP protocol. And the label is HTML information, so header () sent content to reach the browser, popular point is header () priority is higher than <meta> (I do not know can be said). If a PHP page has both header ("Content-type:text/html;charset=xxx"), and then, the browser will only recognize the former HTTP headers and not to recognize Meta. Of course, this function can only be used within the PHP page.

There is also a question, why is the former absolutely effective, and the latter sometimes not? This is the reason for the next point about Apache.

3) Adddefaultcharset
In the Conf folder of the Apache root directory, there is the entire Apache configuration document HTTPD.CONF.

Open httpd.conf with a text editor, line No. 708 (different versions may be different) has adddefaultcharset xxx,xxx as the encoded name. This line of code means: Set the entire server within the page file HTTP header character set for your default XXX character set. With this line, it is equivalent to adding a line header to each file ("content-type:text/html; Charset=xxx "). This is clear why clearly <meta> set is utf-8, can always use the browser gb2312 reasons.

If the page has header ("content-type:text/html; Charset=xxx "), the default character set is changed to the character set of your setting, so this function is always useful. If you put a "#" in front of adddefaultcharset xxx, comment out the sentence, and the page does not contain header ("Content-type ..."), then the META tag will work.

The above precedence sequence is listed below:
.. Header ("content-type:text/html; Charset=xxx ")
.. Adddefaultcharset xxx
.. <meta http-equiv= "Content-type" content= "text/html;charset=xxx" >

If you are a web programmer, it is recommended that you add a header to each page ("Content-type:text/html;charset=xxx"), so that it can be displayed correctly on any server, portability is also relatively strong.

4) The Default_charset configuration in php.ini:
The Default_charset = "gb2312" in php.ini defines the default language character set for PHP. It is generally recommended that this line be commented out so that the browser automatically selects the language based on the charset in the header of the page rather than making a mandatory provision, so that Web services can be provided in multiple languages on the same server.

Conclusion
In fact, PHP development in the Chinese code is not as complex as imagined, although the positioning and solve the problem is not fixed, a variety of operating environment is not necessarily, but the principle behind is the same. Understanding the character set's knowledge is the basis for solving character problems. However, with the change of Chinese character set, not only the PHP programming, the problem in China's processing will still exist for some time.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.