A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
The Chinese encoding problem in PHP programming has plagued many people. The cause of this problem is actually very simple. Every country (or region) specifies the character collation set for computer information exchange, such as the expanded ASCII code of the United States, GB2312-80 of China, JIS of Japan, etc. As the basis for information processing in the country/region, character encoding sets play an important role in unified encoding. The character Collation is divided into SBCS (single-byte character set) and DBCS (dubyte Character Set) by length. Early software (especially the operating system), in order to solve the computer processing of local character information, various local versions (L10N) were introduced. to distinguish, LANG, Codepage and other concepts were introduced. However, the Code ranges of local character sets overlap, making it difficult to exchange information between them. The independent maintenance costs of each localized version of the software are high. Therefore, it is necessary to extract the commonalities in the localization work for consistent processing, so as to minimize the content of special localization processing. This is the so-called International (118N ). The language information is further standardized as Locale information. The underlying character set to be processed becomes Unicode that contains almost all glyphs.
Currently, most of the software's core Character Processing Systems with internationalization features are Unicode-based. During software running, the corresponding local character encoding settings are determined based on the current ocale/Lang/Codepage settings, and handle local characters accordingly. In the process, Unicode and local character sets must be converted to each other, or two different local character sets with Unicode as the center must be converted to each other. This method is further extended in the network environment. The character information at both ends of any network needs to be converted to acceptable content according to the character set settings.
Character Set encoding in the database
Popular Relational Database Systems Support database character set encoding. That is to say, when creating a database, you can specify its own character set settings. database data is stored in the specified encoding format. When an application accesses data, character set encoding is converted at the entry and exit. For Chinese data, the character encoding settings of the database should ensure data integrity. GB2312, GBK, UTF-8 and so on are optional database character set encoding; of course we can also choose ISO8859-1 (8-bit), but we have
Before writing data to a program, split a 16-bit Chinese character or Unicode character into two 8-bit characters. After reading the data, you also need to combine the two bytes, we also need to identify the SBCS characters, so we do not recommend using ISO8859-1 as the database character set encoding. This not only does not make full use of the database's own character set encoding support, but also increases programming complexity. During programming, you can check whether the Chinese data is correct with the management function provided by the database management system.
Before the PHP program queries the database, it first executes mysql_query ("set names xxxx"). xxxx indicates the code of Your webpage (charset = xxxx). If charset = utf8 In the webpage, then xxxx = utf8. If charset = gb2312 In the webpage, then xxxx = gb2312. Almost all WEB programs have a public code to connect to the database and put it in a file, in this file, add mysql_query ("set names xxxx.
Set names displays the character sets used in the SQL statements sent by the client. Therefore, the set names UTF-8 statement tells the server that "the information sent from this client will use the character SET UTF-8 ". It also specifies the character set for the results sent from the server back to the client (for example, if you use a SELECT statement, it indicates the character set used by the column value ).
Frequently Used troubleshooting skills
The most stupid and effective way to locate Chinese encoding problems is to print the string's internal code after you think the program is suspected of processing it. By printing the character string's internal code, you can find out when Chinese characters are converted to Unicode, when Unicode is converted back to Chinese characters, and when a Chinese character is converted to two Unicode characters, when is the Chinese string converted into a question mark? When is the high position of the Chinese string truncated ......
Selecting the appropriate sample string also helps to identify the type of the problem. For example, "aa, aa? @ Aa "is a string of all Chinese and English characters including GB and GBK. In general, English characters are not distorted no matter how they are converted or processed (if you encounter it, you can try to increase the length of consecutive English letters ).
Solve the garbled problem of various applications
1) use tags to set page Encoding
The role of this label is to declare the client browser with what character set encoding to display the page, xxx Can Be GB2312, GBK, UTF-8 (and MySQL is UTF8) and so on. Therefore, most pages can use this method to tell the browser what encoding is used to display the page, so as not to cause code errors and generate garbage codes. However, sometimes we will find that this sentence still does not work. No matter which type of xxx is, the browser always uses an encoding. I will talk about this later.
Please note that, It is HTML information. It is just a declaration that the server has passed HTML information to the browser.
2) header ("content-type: text/html; charset = xxx ");
This function header () sends the information in the brackets to the http header. If the content in the brackets is as described in this article, the function and The tags are basically the same. You can check the first one and find that all the characters are similar. But the difference is that if there is this function, the browser will always use the xxx code you requested, and it will never be disobedient, so this function is very useful. Why? Let's talk about the differences between http headers and HTML information:
The http header is the string sent by the server before the server sends HTML information to the browser over http. While The tag belongs to the HTML information, so the content sent by the header () first reaches the browser. The common point is that the priority of the header () is higher (I don't know if I can do this ). If a php page contains both the header ("content-type: text/html; charset = xxx") and the browser recognizes the http header of the former instead of the meta. Of course, this function can only be used on the php page.
There is also a problem. Why is the former absolutely effective, while the latter sometimes does not? This is the reason for Apache.
The conf folder in the Apache root directory contains the entire Apache configuration file httpd. conf.
Open httpd. conf in a text editor. Line 1 (different versions may be different) has adddefacharcharset xxx, and xxx is the encoding name. Set the character set in the http header of the webpage file on the server as your default xxx character set. This line adds a header ("content-type: text/html; charset = xxx") to each file "). Now you can understand why If UTF-8 is set, the reason why the browser can always use gb2312 is true.
If the webpage contains a header ("content-type: text/html; charset = xxx"), change the default character set to the character set you set, so this function will always be useful. Add "#" before adddefacharcharset xxx, comment out this sentence, and the page does not contain header ("content-type... "), Then it is the time for the meta tag to take effect.
The preceding priorities are listed below:
. Header ("content-type: text/html; charset = xxx ")
.. Adddefacharcharset xxx
If you are a web programmer, we recommend that you add a header ("content-type: text/html; charset = xxx") to each page "), in this way, it can be correctly displayed on any server, and the portability is also strong.
4) default_charset configuration in php. ini:
Default_charset = "gb2312" in php. ini defines the default language character set of php. It is generally recommended to comment out this line so that the browser can automatically select a language based on the charset in the web page header, rather than making a mandatory provision, so that Web Services in multiple languages can be provided on the same server.
In fact, the Chinese encoding in php development is not as complicated as imagined. Although there are no rules for locating and solving problems, and various runtime environments are different, the principles behind them are the same. Understanding character sets is the basis for solving character problems. However, as the Chinese Character Set changes, not only php programming, Chinese Information Processing problems still exist for a period of time.
Start building with 50+ products and up to 12 months usage for Elastic Compute Service