Combination of Google, analysis of the search site some coding problems __ code

Source: Internet
Author: User
Tags intl urlencode

The internet has made the world less and less, and most Web sites have to consider adapting to a variety of coding problems. This article combined with Google's approach, a simple talk about how to enable the site to effectively support multiple languages, for the moment without talking about the background of various coding.

Usually a website that provides search services involves coding mainly in the following points:

First, the interface language display

Chinese user browsing google.com usually shows the Chinese interface directly. How does Google do that? Look at the request made by the browser:

get/http/1.1

Accept:image/gif, Image/x-xbitmap, Image/jpeg, Image/pjpeg, Application/x-shockwave-flash, Application/msword, */*

Accept-language:zh-cn

Accept-encoding:gzip, deflate

user-agent:mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1. NET CLR 1.1.4322;. NET CLR 2.0.50727)

Host:www.google.com

Connection:keep-alive

The HTTP1.1 protocol provides for the Accept-language section, which enables multiple-language servers to return the correct results based on this entry. If you do not specify Accept-language. Telnet to go to try:

[fancao@free:~] $ telnet www.google.com 80

Trying 64.233.189.104 ...

Connected to www-china.l.google.com.

Escape character is ' ^] '.

get/http/1.1

http/1.1 302 Found

location:http://www.google.com/intl/zh-cn/

Cache-control:private

Set-cookie:pref=id=01710c01d6f2632d:nw=1:tm=1144067151:lm=1144067151:s=r3d24oiqqia9u6i_; Expires=sun, 17-jan-2038 19:14:07 GMT; path=/; Domain=.google.com

Content-type:text/html

server:gws/2.1

content-length:230

Date:mon, APR 2006 12:25:51 GMT

<title>302 moved</title>

The document has moved

<a href= "http://www.google.com/intl/zh-CN/" >HERE</A>.

</BODY></HTML>

Google is still directed to the Chinese site http://www.google.com/intl/zh-CN/. You can see that the actual request is connected to the www-china.l.google.com.

Again through the browser test, this time remove the browser language settings, that is, do not specify Accept-language, respectively, browse www.google.com.cn, www.google.com.tw, www.google.co.jp, www.google.co.kr, you can see the results of different languages. This means that the default interface language provided by different servers is not the same.

If you specify Accept-language=zh, and then browse http://www.google.co.kr/again, this time the result is not Korean, but the simplified Chinese content. You can see that the accept-language priority is higher than the server default language.

Google can let the user set the interface language, set to where. Let's take a look at the browser request after setting the Simplified Chinese Interface (cookie part):

cookie:pref=id=6f389883f4bc8b9b:lr=lang_ja|lang_zh-cn|lang_zh-tw:ld=zh-cn:nr=10:nw=1:tm=1144071286:lm= 1144071338:s=91yuwz0pwgg8eb0w

The original is planted in the client cookies, so no matter where the IP, and no matter what accept-language is, can be displayed according to the user's settings.

According to Google's approach, summed up the following 3 points:

• Read cookies According to user settings

• When there is no cookie, determine the accept-language in the HTTP request

• Deploy servers in different default languages, resolving servers based on IP addresses to local languages

Second, the input content of the code recognition

The identification of the code is very difficult, the browser to the Web page recognition is based on two points:

· CharSet in the HTTP protocol response header

· CharSet items in an HTML page

The charset priority in the HTTP protocol header is higher than the charset in the HTML file, and you can do a small test:

<?php

Header ("content-type:text/html; Charset=iso-8859-1 ");

?>

<meta http-equiv= "Content-type" content= "text/html; charset=gb2312 ">

<body>

Chinese text content

</body>

The above PHP program does not display Chinese characters correctly because the charset=iso-8859-1 is specified in the HTTP header

If both items are not specified. Browsers can also be tricky. The usual way to identify code is to guess by byte code, and it's more reasonable to look at that code area.

So how does a search site determine the encoding of a user's input character? To do the test again,

Submitting "China" to a form in a UTF-8 page will be interpreted as:

Http://localhost/test.php?test=%E4%B8%AD%E5%9B%BD

Submitting "China" to a form in a GB2312 page will be interpreted as:

Http://localhost/test.php?test=%D6%D0%B9%FA

%E4%B8%AD%E5%9B%BD is the result of the UrlEncode after the UTF-8 encoding of the word "China", while%D6%D0%B9%FA is the GB2312 code of "China". This makes it clear that the browser will be based on the current page language get or post data.

In this case, as long as the page code for UTF-8, and then follow the UTF-8 to resolve the query data can be resolved.

Let's take a look at how Google plays. Google page code for UTF-8, Google search "China", see the URL is:

Http://www.google.com/search?sourceid=navclient&hl=zh-CN&ie=UTF-8&rls=AMSA,AMSA:2006-11,AMSA:zh-CN &q=%e4%b8%ad%e5%9b%bd

So familiar with the%E4%B8%AD%E5%9B%BD.

The above URL has ie=utf-8 an item, analyzes the discovery IE is the input encode. This time, please Google, do not specify input encode, and then according to UTF-8 and GB2312 code query Google:

Http://www.google.com/search?q=%E4%B8%AD%E5%9B%BD

Http://www.google.com/search?q=%D6%D0%B9%FA

Google is directly identified in the default language encoding and UTF-8 two encodings. How to do it. UTF-8 code has certain rules, you can make a judgment according to this rule, Google is so judged. But this kind of judgment sometimes goes wrong, for example:

Http://www.google.com/search?q=%D1%A7%CF%B0

%d1%a7%cf%b0 is "learning" two words of GB2312 coding done urlencode. The reason for the mistake is to mistake the "learning" GB2312 code as a UTF-8.

According to Google's approach, you can propose the following scenarios:

• Unified UTF-8 Encoding: pages are encoded with UTF-8 and query data are parsed according to UTF-8 encoding

• Complex, try to analyze query data according to UTF-8 rules, if not conform to the rules, it is considered the current language encoding

Third: more effective search results

Users are always looking for more search results. Chinese Web page content has GBK, Big5, UTF-8 and other coding, Japan and South Korea all kinds of coding also have a part of Chinese characters. How can you make the search more effective for users to cover these pages? I'm not doing any testing with Google this time. Based on the background knowledge of the coding, the following solutions are proposed:

• Simple multiplication conversion

The main reason for the transformation is to search the user search request in two ways. Whether the user enters the "mass" or "the public", the two requests are searched at the same time.

• The original Web page is converted to UTF-8 encoding for indexing

The original web page into the UTF-8 to do the index is very convenient, so that the various coding can be unified to deal with.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.