Solve the Problem of Garbled text thrown by browsers (such as HTML and PHP)

Source: Internet
Author: User
When compiling html or php code on Windows, the local editor sets the file encoding mode to UTF-8 for saving, but garbled characters often occur when the browser opens the page, in addition, the browser automatically detects that the page is encoded in the GBK format. At this time, I began to wonder? Why? I used notepad to open the written file and save it

When compiling html or php code on Windows, the local editor sets the file encoding mode to UTF-8 for saving, but garbled characters often occur when the browser opens the page, in addition, the browser automatically detects that the page is encoded in the GBK format. At this time, I began to wonder? Why? I used notepad to open the written file and save it

When compiling html or php code on Windows, the file encoding format set by the local editor is UTF-8, but garbled characters often occur when the browser opens the page, in addition, the page code detected by the browser is in GBK format. At this time, I began to wonder?

Why is this? I used notepad to open the written file, save it as, select UTF-8 In the encoding mode, save it, and open it in the browser. The file is displayed normally. At this time, I began to think about why the editor could not work while setting the encoding format UTF-8 in notepad. After analysis found that in these editors set UTF-8 without BOM, the browser will appear garbled and detection page encoding for GBK, set to UTF-8 with BOM, the page can be displayed normally.

The UTF-8 does not require BOM, although Unicode standards allow BOM to be used in the UTF-8. So does not include BOM UTF-8 is the standard form, In the UTF-8 file placed BOM is mainly the habit of Microsoft (by the way: it is Microsoft's habit to call a small-end UTF-16 with BOM as Unicode without a detailed description ). BOM (byte order mark) is prepared for the UTF-16 and UTF-32, used to mark the byte order ). Microsoft uses BOM in UTF-8 because it can clearly distinguish UTF-8 from ASCII codes, but such files will cause problems in operating systems outside of Windows.
The difference between UTF-8 and UTF-8 with BOM is that there is no BOM. Whether the file starts with U + FEFF.
The BOM should not be used for Web code that UTF-8 on other non-windows operating systems, otherwise errors may often occur.

Analysis of UTF-8 and BOM

What is BOM. This is not explained. It is detailed on Wikipedia. Http://en.wikipedia.org/wiki/byte_order_mark.
Using BOM on a webpage is an error. BOM is not designed to support HTML and XML. To identify text encoding, HTML has the charset attribute and XML has the encoding attribute, so it is not necessary to pull the BOM. Although theoretically BOM can be used to identify HTML pages of UTF-16 code, few people do this in actual engineering. After all, this encoding of UTF-16 even ASCII are dubyte, it is not applicable to do web pages.
In fact, BOM is not a bad habit. BOM is also part of the Unicode standard and has a specific applicability. Usually bomis used to mark the unicodepure character stream, used to identify a convenient character processing program reading the. txt file which is Unicode encoding (UTF-8, UTF-16BE, UTF-16LE ). Windows processes BOM better because it integrates Unicode recognition codes into APIs, mainly CreateFile (). When a text file is opened, it automatically identifies and removes the BOM. Windows has a historical reason, because it was originally originated from a multi-page environment. When Unicode is introduced, Windows designers hope to be able to be compatible with Unicode and non-Unicode (Multiple byte) text files without your attention, so they can only use this small trick. In contrast, Linux systems such as Linux have a short deployment time in Multi-locale environments. In addition, the Community itself has enough power to move forward with light load (spof: microsoft's requirements for compatibility is indeed a very paranoid point, any point undermine the compatibility of the practice is not allowed, so many times is bound to their own hands), so simply one step into the UTF-8. Of course, there is a transitional period in the middle, such as from the initial full UTF-8 of GTK + 2.0 released to basically all GTK developers are not using multiple locale GTK + 1.2, I have been there for at least three to four years.
BOM is not popular in UNIX environments, because many UNIX programs do not bird BOM. The main problem lies in the first line of all the scripting languages of UNIX #! This depends on shell parsing. Many shells do not check BOM for compatibility reasons. Therefore, when adding BOM, shell will interpret it as a common character input, causing damage #! Mark, this is troublesome. In fact, many modern scripting languages, such as Python, can process BOM in their interpreters themselves, but shell is stuck here, there is no way, you can only lie down and shot. This cannot be blamed on shell, because BOM itself violates a Common UNIX design principle, that is, the data in the document must be visible. BOM cannot be edited as visible characters in the text editor, which is not satisfactory to many UNIX developers.
By the way, even if the script language can process BOM, it is not recommended to use BOM everywhere. Each scripting language has its own set of Unicode processing. Python #-*-coding: UTF-8-*-And Perl's use utf8 are simpler and more reliable than BOM. Another good news is that even friends who have to switch between Windows and UNIX will not be miserable. Thanks to the UNIX environment, we also have the VIM artifact. Even in the case of BOM barrier, we can solve the problem by running the set nobomb; set fileencoding = utf8; w command.
In the end, it seems that only Windows insist on BOM.

Character encoding is believed to be a nightmare for every programmer. As long as there is a Chinese character, there will always be various encoding problems, and this problem is still very difficult, especially in linux, because many of the above software is developed for English-speaking countries, other language encoding issues will not be considered. After encountering numerous coding difficulties, I decided to carefully study the coding problem, because it is like a hurdle that has been standing in front of you. Every time you get here, you will fall down. Every time you get up, if you have nothing to do, such a person is called a warrior, a real warrior. Unfortunately, it is a powerful warrior. As an intellectual warrior of the new age, of course, it cannot fall there and then continue to fall here.
File storage method:
Files all have their own storage formats, such as the most common txt, cpp, h, c, xml, png, rmvb formats, and custom formats. Regardless of the format, these files are stored in binary lattice storage on the computer's hard disk. They correspond to different file formats and have different software resolutions. This article does not talk about how files are stored, but about how files are parsed.
Text File Parsing:
A text file corresponds to a text file that can be read by humans. How can we convert it from a binary system to a text file? Since the computer was invented in the United States, we naturally consider how to express it in English. There are a total of 26 English letters, with special characters and 128 characters. Seven characters can be expressed in bytes. This is the well-known ascill code. The correspondence is simple. One character corresponds to one byte.
However, we soon found that the texts in other non-English countries far exceed the ascill Code. At this time, we certainly want to unify the codes. Different countries have their own encoding methods, china's gb2312 encoding method is self-developed. In this way, every country has its own encoding method, which is too troublesome to switch back and forth. At this time, there is a new encoding method, unicode encoding method, to unify the encoding, so it specifies the unicode code corresponding to each character.
1. Many files are ascii encoded. It is too wasteful to use unicode.
2. If there is no flag, the several bytes are parsed as a symbol.
At this time, the world-saving utf emerged. utf is a unicode implementation, but it is smarter. Utf16 occupies two or four bytes, and utf32 occupies four bytes. Utf8 is a clever representation.
1. for single-byte symbols, the first byte is 0, and the last seven digits indicate Byte encoding.
2. For the n-byte symbol, the first n bits of the first byte are set to 1, the n + 1 bits are 0, and the remaining bits are encoded.
For different encodings, there are different signs at the beginning of the text. unicode usually has two characters to represent ff fe, or feff, respectively, fffe indicates litte-endian encoding. feff indicates big-endian encoding. Utf8 starts with efbbbf. We can see that UTF-8 is self-explanatory, so most programs can recognize it without this flag. However, some programs cannot recognize this flag. For example, php directly parses this flag as text and does not ignore it. I believe that many people who encounter php output text parsing garbled characters or parsing errors have encountered such problems.
Finally, let's talk about how to remove or add bom. If vim exists, it would be better to remove the command:
Set encoding = UTF-8
Set nobomb
Add command:
Set encoding = UTF-8
Set bomb

Windows systems are saved with BOM, so you can see, with notepad to save a UTF-8 txt, is actually a BOM, this need to pay attention. In addition, different text editor for BOM title is also slightly different, such as EditPlus, BOM called UTF-8 +, BOM called UTF-8, and Notepad ++, A standard UTF-8 with BOM is called, while a UTF-8 without BOM is called a without BOM.

  
HTML5 title
 

HTML5 content! Hello

 
 

I wrote it in Notepad. After saving it, garbled characters appeared on the webpage. Change to GB2312 to display Chinese characters correctly.

  
HTML5 title
 
 

HTML5 content! Hello

 
 

But after all, the standards are different. Or use Utf-8. Finally, if the Code has no problem, the problem is found on the notepad. I just told the browser to use UTF-8 for explanation, and the document encoding is determined by your choice when saving. If you save ANSI and use UTF-8 to explain it, it must be garbled. In notepad, the default file format is ANSI. So you need to change it to a uif-8 when saving. Note when writing notepad.


When I used EditPlus to write the html page, I found that garbled characters still appeared when I set it to UTF-8,

Remember the page
The encoding problem during saving, EditPlus defaults to the ANSI encoding method while saving the page, not the UTF-8. So when saving the page, pay attention to the selection of encoding method for UTF-8. (Note that the browser encoding must be set to automatic selection mode)
To modify the Default encoding method for EditPlus, you can configure the Default encoding as the UTF-8 in tools --- configure user tools. (Description on Editplus set the document save encoding UTF-8 or garbled, Editplus must be set to UTF-8 + BOM, but notepad Save settings UTF-8 display OK)
Such as Configuration:



Cause:
When saving a file to a computer's hard disk, the content stored in the file has been related to the encoding used by the local machine and is stored in a certain encoding method on the computer.
In simplified Chinese systems, ANSI stands for GB2312 and Traditional Chinese systems seem to be GBK. In Japanese operating systems, ANSI stands for JIS encoding.
It uses two bytes to represent the font encoding of a character in various languages. It is called ANSI encoding.
Therefore, if ANSI is used for saving, the meta defined in the Html page is Or
Because GBK evolved on the basis of GB2312, both of them can be explained by the browser after they are saved.
If ANSI is used for saving and meta is used for saving It will appear garbled, because in the UTF-8 to resolve such ANSI
When the code is not properly parsed, garbled characters will appear.

Solution: Use the encoding method and resolution method.


Note:

The four encoding methods used to save notepad as a file.

1. ANSI (default): the ANSI encoding method in the Chinese environment is our familiar GB2312

2. Unicode: UTF-16

3. Unicode big endian: After half a day, I still don't understand it. I only know that it is different from UTF in encoding order.

4. UTF-8: the famous universal UTF-8, I think it should be a trend. In the web environment, the style can be personalized and changing, but the rules should be unified.

Test environment: Firefox, IE, Chrome


First, I created six new html pages, all created using notepad. Select the GB2312 (ANSI) and UTF-8 storage method as the test.

1. Save (ANSI) with GB2312 and declare charset = GB2312.

Result: It is displayed normally.

2. Save (ANSI) with GB2312 and declare charset = UTF-8.

Result: normal, but the browser code is still GB2312.

3. Save with UTF-8 and declare charset = UTF-8.

Result: It is displayed normally.

4. Save with UTF-8 and declare charset = GB2312.

Result: The webpage code is UTF-8.

5. Save with GB2312 (ANSI), but do not declare charset.

Result: It is displayed normally.

6. Save with UTF-8, but do not declare charset.

Result: It is displayed normally.

Summary: I think the charset attribute is defined for the browser to accept. the browser will directly display the webpage using the encoding method received from charset. If it is not declared, the browser will detect the page encoding method. ComparisonSpecifically, after saving with a UTF-8 but declaring charset = GB2312, view the browser's encoding method are automatically changed to UTF-8, and GB2312 save but after declaring charset = UTF-8, the browser's encoding method is changed to gb2312 automatically, and the charset attribute values are not received. What is the problem? Do you only need to set the encoding method of the document? The answer is obviously not.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.