The internet has made the world less and less, and most Web sites have to consider adapting to a variety of coding problems. This article combined with Google's approach, a simple talk about how to enable the site to effectively support multiple languages, for the moment without talking about the background of various coding.
Usually a website that provides search services involves coding mainly in the following points:
First, the interface language display
Chinese user browsing google.com usually shows the Chinese interface directly. How does Google do that? Look at the request made by the browser:
get/http/1.1
Accept:image/gif, Image/x-xbitmap, Image/jpeg, Image/pjpeg, Application/x-shockwave-flash, Application/msword, */*
Accept-language:zh-cn
Accept-encoding:gzip, deflate
user-agent:mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1. NET CLR 1.1.4322;. NET CLR 2.0.50727)
Host:www.google.com
Connection:keep-alive
The HTTP1.1 protocol provides for the Accept-language section, which enables multiple-language servers to return the correct results based on this entry. If you do not specify Accept-language. Telnet to go to try:
[fancao@free:~] $ telnet www.google.com 80
Trying 64.233.189.104 ...
Connected to www-china.l.google.com.
Escape character is ' ^] '.
get/http/1.1
http/1.1 302 Found
location:http://www.google.com/intl/zh-cn/
Cache-control:private
Set-cookie:pref=id=01710c01d6f2632d:nw=1:tm=1144067151:lm=1144067151:s=r3d24oiqqia9u6i_; Expires=sun, 17-jan-2038 19:14:07 GMT; path=/; Domain=.google.com
Content-type:text/html
server:gws/2.1
content-length:230
Date:mon, APR 2006 12:25:51 GMT
<title>302 moved</title>
The document has moved
<a href= "http://www.google.com/intl/zh-CN/" >HERE</A>.
</BODY></HTML>
Google is still directed to the Chinese site http://www.google.com/intl/zh-CN/. You can see that the actual request is connected to the www-china.l.google.com.
Again through the browser test, this time remove the browser language settings, that is, do not specify Accept-language, respectively, browse www.google.com.cn, www.google.com.tw, www.google.co.jp, www.google.co.kr, you can see the results of different languages. This means that the default interface language provided by different servers is not the same.
If you specify Accept-language=zh, and then browse http://www.google.co.kr/again, this time the result is not Korean, but the simplified Chinese content. You can see that the accept-language priority is higher than the server default language.
Google can let the user set the interface language, set to where. Let's take a look at the browser request after setting the Simplified Chinese Interface (cookie part):
cookie:pref=id=6f389883f4bc8b9b:lr=lang_ja|lang_zh-cn|lang_zh-tw:ld=zh-cn:nr=10:nw=1:tm=1144071286:lm= 1144071338:s=91yuwz0pwgg8eb0w
The original is planted in the client cookies, so no matter where the IP, and no matter what accept-language is, can be displayed according to the user's settings.
According to Google's approach, summed up the following 3 points:
• Read cookies According to user settings
• When there is no cookie, determine the accept-language in the HTTP request
• Deploy servers in different default languages, resolving servers based on IP addresses to local languages
Second, the input content of the code recognition
The identification of the code is very difficult, the browser to the Web page recognition is based on two points:
· CharSet in the HTTP protocol response header
· CharSet items in an HTML page
The charset priority in the HTTP protocol header is higher than the charset in the HTML file, and you can do a small test:
<?php
Header ("content-type:text/html; Charset=iso-8859-1 ");
?>
<meta http-equiv= "Content-type" content= "text/html; charset=gb2312 ">
<body>
Chinese text content
</body>
The above PHP program does not display Chinese characters correctly because the charset=iso-8859-1 is specified in the HTTP header
If both items are not specified. Browsers can also be tricky. The usual way to identify code is to guess by byte code, and it's more reasonable to look at that code area.
So how does a search site determine the encoding of a user's input character? To do the test again,
Submitting "China" to a form in a UTF-8 page will be interpreted as:
Http://localhost/test.php?test=%E4%B8%AD%E5%9B%BD
Submitting "China" to a form in a GB2312 page will be interpreted as:
Http://localhost/test.php?test=%D6%D0%B9%FA
%E4%B8%AD%E5%9B%BD is the result of the UrlEncode after the UTF-8 encoding of the word "China", while%D6%D0%B9%FA is the GB2312 code of "China". This makes it clear that the browser will be based on the current page language get or post data.
In this case, as long as the page code for UTF-8, and then follow the UTF-8 to resolve the query data can be resolved.
Let's take a look at how Google plays. Google page code for UTF-8, Google search "China", see the URL is:
Http://www.google.com/search?sourceid=navclient&hl=zh-CN&ie=UTF-8&rls=AMSA,AMSA:2006-11,AMSA:zh-CN &q=%e4%b8%ad%e5%9b%bd
So familiar with the%E4%B8%AD%E5%9B%BD.
The above URL has ie=utf-8 an item, analyzes the discovery IE is the input encode. This time, please Google, do not specify input encode, and then according to UTF-8 and GB2312 code query Google:
Http://www.google.com/search?q=%E4%B8%AD%E5%9B%BD
Http://www.google.com/search?q=%D6%D0%B9%FA
Google is directly identified in the default language encoding and UTF-8 two encodings. How to do it. UTF-8 code has certain rules, you can make a judgment according to this rule, Google is so judged. But this kind of judgment sometimes goes wrong, for example:
Http://www.google.com/search?q=%D1%A7%CF%B0
%d1%a7%cf%b0 is "learning" two words of GB2312 coding done urlencode. The reason for the mistake is to mistake the "learning" GB2312 code as a UTF-8.
According to Google's approach, you can propose the following scenarios:
• Unified UTF-8 Encoding: pages are encoded with UTF-8 and query data are parsed according to UTF-8 encoding
• Complex, try to analyze query data according to UTF-8 rules, if not conform to the rules, it is considered the current language encoding
Third: more effective search results
Users are always looking for more search results. Chinese Web page content has GBK, Big5, UTF-8 and other coding, Japan and South Korea all kinds of coding also have a part of Chinese characters. How can you make the search more effective for users to cover these pages? I'm not doing any testing with Google this time. Based on the background knowledge of the coding, the following solutions are proposed:
• Simple multiplication conversion
The main reason for the transformation is to search the user search request in two ways. Whether the user enters the "mass" or "the public", the two requests are searched at the same time.
• The original Web page is converted to UTF-8 encoding for indexing
The original web page into the UTF-8 to do the index is very convenient, so that the various coding can be unified to deal with.