Front-end engineer code encounters

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Lead: Because of historical reasons, Taobao page coding has always been GBK,F2E manual also has a clear specification, just the beginning of a period of time, the f2e did not encounter too troublesome garbled problem, we all peaceful, but with Taobao more and more partners, the Partners API interface coding is a variety of, Taobao's system and Third-party data docking after exposing a variety of garbled problems. It is necessary to keep this question straight.

I think, may be in the first Taobao page, engineers just write code, and forget to look at the editor's default encoding settings, and then mistake until today, if a little attention, may not commit such a low-level error. Yes, "coding Convention" in the total station specification occupies and important weight, unfortunately, and this extremely important problem is very easy to ignore, after all, it is not only "unified page Coding Convention" So simple, even the whole station security policy (this is obvious). And as f2e face all kinds of garbled problems, but also to understand the source of the problem, make clear the way to solve the problem, rather than relying on a browser for some kind of adaptability to temporarily avoid garbled, symptoms not cure.

Stream from binary to display character

As we all know, there are two ways to encode characters, one that is very old, an extension of a subset of ASCII code, such as Big5 and gb2312, which is a combination of traditional and Simplified Chinese, which is incompatible with a similar code and an ISO series, Each Latin encoding set is not compatible with each other, the advantage of this coding is that the code set is small, the disadvantage is not to use multiple languages at the same time, so there is another coding convention: "Universal Code", the world all languages into a code table, that is, the Unicode code table, it is obvious that the disadvantage of this Code table is too large, The advantage is that you use multiple languages at the same time. The so-called utf-7, Utf-8, and so on, are some of the more efficient implementations of Unicode, which belong to the same Unicode superset, regardless of whether a character is encoded in UTF for several bytes. We often encounter the Chinese code is gb2312, GBK, GB18030 and utf-8, not strictly speaking, the first three are roughly compatible with each other, but are incompatible with the utf-8. If a piece of text is encoded in GBK (Code table), the reading software can only be read according to GBK (code table) decoding. But the machine code is displayed as the final character lattice, the reading software (browser, text editor, etc.) also needs to correspond to the corresponding lattice after decoding the character code (which converts successive binary codes to characters through the Lookup Code table), obviously, if the font used to display the text does not contain a lattice of a character, This character is naturally not displayed. More background information can refer to this ppt.

Because the browser is relatively good compatibility, generally do not appear because of font problems and garbled situation, but when the engineers write code occasionally encountered, such as using Vim to gb2312 code to edit a file, when the file appears "Rong" word, is unable to save this file. This is because the word does not exist in the GB2312 Code table, but the page that specifies GB2312 encoding can display the word, because browsers typically use Windows system encoding to parse the GB pages, usually GBK.

If you use EditPlus or Notepad, just save as ANSI encoding, these editors will be based on the code stream to identify whether the GBK or gb2312 or GB18030 encoding, so many text editors in the window did not force a specific encoding, unified use of system coding , usually in the Chinese win system, you can think that ANSI is GBK. If you use a Linux system, you can refer to this to correctly handle your editor's coding.

How the browser sends a URL with Chinese

So, how do you deal with coding when the browser opens a Web page, from the typing URL to the final rendering of the UI? The whole process is divided into two phases, 1, sending a URL request, 2, receiving the data and rendering it. The HTTP standard has the same requirements for URL encoding (RFC 1738):

“... Only alphanumerics [0-9a-za-z], the special characters "$-_.+!* ' ()," [not including the quotes-ed], and reserved charact ERs used for their reserved purposes the May is used unencoded a URL. "
"Only letters and numbers [0-9a-za-z], some special symbols" $-_.+!* "()," [excluding double quotes], and some reserved words can be used without encoding directly for URLs. " ”

Here we have to mention URL encoding, although in accordance with the requirements of the RfC, the URL containing Chinese is illegal URL, but there is no regulation of how to transcoding, Nanyi this article details how in Firefox and ie in the Chinese URL to encode the HTTP request. Specifically, the URL is a coding "method", the result of the encoding depends on the use of "Code table", that is, the internal code representation of Chinese characters. Therefore, the same Chinese characters have n a variety of URL encoding results, "Taobao" UTF8 code for "%e6%b7%98%e5%ae%9d", GBK encoded as "%CC%D4%B1%A6."

Notice here that a GB coded HTML page in the form submission, the form of the Chinese encoding will be URL-coded, but in GBK format for transcoding, UTF8 page form submission, in UTF8 format for transcoding. Look at these two examples:

GBK the form submission of the page, the form submission of the Utf-8 page

At this time need and background program to do a good deal, the page how to encode, background logic needs corresponding decoding. Taobao's search page, for example, is the same as the Baidu page, receiving only GBK URL encoding. There are also more information on the Internet, and there is no more to repeat.

But what happens if you do URL coding via JavaScript encodeURI (encodeuricomponent)? JS only utf-8 in the form of URL encoding, the same reference to the above two examples, whether the page is GBK or utf-8,js URL encoding is always the same. This in the use of JS Library simulation form submission needs to pay particular attention to the direct submission of the results of the form is normal, with a JS library or framework provided by the simulation of the Mosaic query string submission form on the problem, that is the reason. Therefore, in the use of JS stitching query URL to be careful, you need to pay attention to the background program needs the encoding is what format.

How the browser renders the page with the correct encoding

HTTP response data at least three places can be buried encoded information:

Content-type in the head of 1,http
2,html the META tag in the page to specify CharSet
3, the page body data (browser can parse the text binary code to determine the code)

The browser can obtain the encoding of HTTP response messages from these three locations, in addition to two factors, the browser default encoding and the operating system language type.

Taobao home page is GBK encoding, HTTP response header specified in the document encoding for GB2312, while the charset and the text of the page in Meta is GBK encoded, the browser renders correctly.

Encoding settings in Content-type

Meta tags in the source code

If the three encodings are inconsistent, the browser will first read the Content-type in the HTTP header, if the encoding is not set, and then find the CharSet setting in the META tag in the page, if it is not, the default encoding is displayed, if the default code is not specified, The browser will determine the encoding by parsing the contents of the body. So, the page is GBK encoded, even if the meta attribute is set to Charset=utf-8, as long as the Content-type is set to GBK (or GB2312, GB18030), the page will display normally, if there is no content-type code set, The browser will be in Meta CharSet attribute, the page appears garbled.

In PHP, you can set the Content-type code like this:

Header ("content-type:text/html; CHARSET=GBK ");

Proper loading of JS files

HTML page Loading JS file, you need to specify the JS file encoding to correct reference, such as:

<script src= "Gbk.js" charset= "GBK" ></script>

You can refer to this demo

Chinese processing in the JSONP

According to the above example, we know that loading the external JS file as long as the specified charset can be, for the JSONP is also true, but there is a more thorough way to eliminate garbled, JS file for Unicode encoding, this is because the JS engine's inner code is Unicode, So as long as it is Unicode text JS can be identified. Just like this:

The Json_encode function in PHP can use the array as a Unicode transcoding directly. Through JS can also be encoded in Unicode, reference to this demo.

Get URL through JS

We note that when you include Chinese in an address that is open with Firefox, the copy is pasted into another place without getting Chinese, but the encoded URL. This is because the browser is smart to decode the URL in Chinese to display, when we need to crawl URLs need to be particularly careful, this demo in Firefox and IE under Open, JS get the URL is inconsistent.

Firefox:

Ie:

If this page does not interact with the background data, directly through the document.location.href to get the URL is ok, once the interaction with the background, you need to be very careful, the most common problem scenario is brought into the login page callback address. For example, through this address into Taobao home:

http://www.taobao.com/? taobao

In Firefox and IE can be normal access, then click on the ceiling of the "login" to enter the landing page, you can see under the Firefox callback address is:

The callback address under IE is:

At this time, login Taobao, page jump to Taobao homepage, you can see the address bar Firefox URL is correct, and in IE's address bar appears garbled

Firefox under

IE under

The solution is not to use JS to crawl URL write callback, through the login page ref or other ways to crawl.

GBK page How to get the URL encoding in GBK format via JS

We know that GBK's page submission form can be based on GBK URL encoding, based on this, we can encapsulate a function to achieve in the GBK page using JS to get GBK format URL encoding. Refer to this demo,demo to simulate submitting a form, then grab the results of the form submission, and get the URL encoding in the GBK format.

In this way, through JS can control I want to get the URL code. But it's not possible on utf-8 pages. This requires particular attention.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Front-end engineer code encounters

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Front-end engineer code encounters

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support