In fact, PHP development in the Chinese code is not as complex as imagined, although the location and solve the problem is not confine, all kinds of operating environment is not always, but the principle behind is the same.
Understanding Character Set knowledge is the basis for solving character problems.
The problem of Chinese coding in PHP programming has plagued a lot of people, the cause of this problem is actually very simple, each country (or region) has stipulated the computer information exchange with the character encoding set, such as the United States extended ASCII code, China's gb2312-80, Japan's JIS and so on. As the basis of information processing in this country/region, character encoding set plays an important role in unified coding. The character encoding set is divided into SBCS (single-byte character set) and DBCS (double-byte character set) by length. Early software (especially the operating system), in order to solve the local character information computer processing, there have been various localized versions (L10N), in order to differentiate, introduced the LANG, Codepage and other concepts. However, due to the overlapping of the local character set code, it is difficult to exchange information with each other, and the software has higher independent maintenance cost for each localized version. Therefore, it is necessary to extract the commonality in the localization work, and to make a consistent processing, so that the special localization processing content is minimized. This is also called internationalization (118N). Various language information is further regulated as locale information. The underlying character set for processing becomes Unicode, which contains almost all glyphs.
Most of the software core character processing with internationalized features is now based on Unicode, which determines the local character encoding settings based on the Ocale/lang/codepage settings at the time of the software operation and handles local characters accordingly. The conversion between Unicode and local character sets is required during processing, or even two different local character sets in the middle of Unicode. This approach is further extended in the network environment, and the character information on either side of the network needs to be converted to acceptable content based on the settings of the character set.
Character set encoding problems in the database
The popular relational database system supports database character set encoding, which means that its own character set settings can be specified when the database is created, and the database data is stored in the specified encoding format. When an application accesses data, there is a character set encoding conversion at both the entrance and exit. For Chinese data, the database character encoding settings should guarantee the integrity of the data. GB2312, GBK, UTF-8, etc. are optional database character set encoding; Of course we can choose iso8859-1 (8-bit), but we have to
Using the program to write data before the 16Bit of a Chinese character or Unicode split into two 8-bit characters, after reading the data also need to combine two bytes, but also to identify the SBCS characters, so we do not recommend the use of Iso8859-1 as the database character set encoding. This not only makes full use of the database's own character set encoding support, but also increases the complexity of programming. When programming, you can use the management functions provided by the database management system to check if the Chinese data is correct.
PHP program before querying the database, first execute mysql_query ("SET NAMES xxxx"); where xxxx is your page encoding (CHARSET=XXXX), if the page Charset=utf8, then Xxxx=utf8, if the page charset=gb2312, then xxxx=gb2312, almost all WEB programs, There is a connection to the database of common code, put in a file, in this file, add mysql_query ("SET NAMES xxxx") on it.
Set NAMES shows what character set is used in the SQL statement sent by the client. Therefore, the set NAMES ' utf-8 ' statement tells the server that "the information coming from this client will be in character set Utf-8". It also specifies a character set for the result that the server sends back to the client (for example, if you use a SELECT statement, it indicates what character set the column values use).
Common tips for locating problems
The problem with locating Chinese encoding is usually the stupidest and most effective way to print the inner code of a string after you think the program is suspect. By printing the inner code of the string, you can find out when the Chinese characters are converted to Unicode, when the Unicode is returned to the Chinese code, when the text is two Unicode characters, when the string is translated into a string of question marks, When is the high of the Chinese string truncated ...
Taking the appropriate sample string also helps to differentiate between types of problems. such as: "AA ah [email protected]" and other Chinese and English, GB, GBK character strings. In general, no matter how the English characters are converted or processed, it will not be distorted (if encountered, you can try to increase the length of consecutive English letters).
Solve garbled problems in various applications
1) Use <meta http-equiv= "Content-type" content= "text/html;charset=xxx" > Label Settings page encoding
The purpose of this tag is to declare the client's browser with what character set encoding to display the page, XXX can be GB2312, GBK, UTF-8 (unlike MySQL, MySQL is UTF8) and so on. As a result, most pages can use this method to tell the browser what code to use when displaying this page, so that it does not cause coding errors and garbled characters. But sometimes we will find that with this sentence or not, regardless of xxx is the kind of, the browser is always a kind of coding, this situation I will talk about later.
Please note that <meta> is HTML information, just a statement that the server has uploaded HTML information to the browser.
2) header ("content-type:text/html; Charset=xxx ");
The function of the header () is to send the information inside the parentheses to the HTTP header. If the contents of the parentheses are the same as in the text, the function and the label are basically the same, and the characters are similar to the first one. But the difference is that if you have this function, the browser will always take the XXX code you require, absolutely will not be disobedient, so this function is very useful. Why is that? Then you have to talk about the difference between HTTP headers and HTML information:
An HTTP header is a string that the server sends HTML messages to the browser before the HTTP protocol. The label is HTML information, so the header () sent to the browser first, the popular point is the header () priority is higher than <meta> (I do not know can be said). If a PHP page has both a header ("content-type:text/html;charset=xxx"), and the browser will only recognize the HTTP header instead of Meta. Of course, this function can only be used within PHP pages.
There is also the question of why the former is absolutely working, and the latter sometimes not. That's why we're going to talk about Apache next.
3) Adddefaultcharset
Apache root directory in the Conf folder, there is the entire Apache configuration document HTTPD.CONF.
With a text editor open httpd.conf, line No. 708 (different versions may be different) has adddefaultcharset xxx,xxx as the encoded name. This line of code means: Set the entire server within the page file HTTP header of the character set for your default XXX character set. With this line, it is equivalent to adding a row header to each file ("content-type:text/html; Charset=xxx "). This will understand why the Ming <meta> set up is Utf-8, can browser always use gb2312 reason.
If there is a header in the page ("content-type:text/html; Charset=xxx "), the default character set is changed to the character set you are setting, so this function is always useful. If you put Adddefaultcharset xxx in front of a "#", comment out this sentence, and the page does not contain the header ("Content-type ..."), then this time the META tag has worked.
These are listed below in order of precedence:
.. Header ("content-type:text/html; Charset=xxx ")
.. Adddefaultcharset xxx
.. <meta http-equiv= "Content-type" content= "text/html;charset=xxx" >
If you are a web programmer, it is recommended to add a header ("content-type:text/html;charset=xxx") to each of your pages, so that it can be displayed correctly on any server and is more portable.
4) Default_charset configuration in php.ini:
default_charset = "gb2312" in php.ini defines the default language character set for PHP. Generally recommended to comment out this line, so that the browser according to the charset in the header of the page to automatically select the language instead of making a mandatory rule, so that you can provide multiple languages on the same server Web services.
For questions about PHP coding, you can also refer to the following articles:
Analysis of PHP string encoding problem
PHP two methods for judging character encoding
A function that automatically detects the encoding in content and transforms it
PHP code for GB2312 and UTF8 encoding conversion
Http://www.cnblogs.com/GarfieldTom/archive/2012/11/02/2750776.html
PHP Big5 Utf-8 GB2312 encoding Mutual transfer Solution
PHP Coding, garbled problem
Conclusion
In fact, PHP development in the Chinese code is not as complex as imagined, although the location and solve the problem is not confine, all kinds of operating environment is not always, but the principle behind is the same. Understanding Character Set knowledge is the basis for solving character problems. However, with the change of the Chinese character set, not only PHP programming, the problem will still exist for a period of time.
Parsing the Chinese coding problem in PHP development