Obtain keywords from search engines

Source: Internet
Author: User

Generally, the pages accessed through keywords are all the content that the user wants. For some pages (such as list pages) where search records are not highly correlated with keywords ), we need to guide the user based on the keywords searched by the user, so as to improve the user experience, but also improve the page PV.

The principle of this article is to obtain the Source Page, analyze the structure of the source URL, and extract keywords. These are simple things. This article describes how to differentiate several common URL encoding methods and then decode them accordingly. Because the application is implemented in ASP, if there are code-related examples in this article, they are all debugged under ASP. In other languages, the idea is the same and the implementation is relatively simple.

Extract Keywords of various search engines

Currently, all major search engines are based on the GET request method, that is, there is a string of parameters behind the URL. For example, I search for "Qiu Han blog"

Google: http://www.google.cn/search? Sourceid = navclient & HL = ZH-CN & Ie = UTF-8 & rlz = 1t4ggll_zh-cn ___ cn352 & Q = % E7 % A7 % 8B % E5 % af % 92% E5 % 8d % 9A % e5 % AE % A2

Baidu: http://www.baidu.com/s? WD = % C7 % EF % Ba % AE % B2 % A9 % BF % Cd

Bing: http://cn.bing.com/search? Q = % E7 % A7 % 8B % E5 % af % 92% E5 % 8d % 9A % E5 % AE % A2 & form = qblh & filt = all

Google and Google q = % E7 % A7 % 8B % E5 % af % 92% E5 % 8d % 9A % E5 % AE % A2, baidu's WD = % C7 % EF % Ba % AE % B2 % A9 % BF % Cd, the underlined part of Bing q = % E7 % A7 % 8B % E5 % af % 92% E5 % 8d % 9A % E5 % AE % A2 is the keyword. Other search engines are also roughly the same. You can see the keyword field through the URL, and then you can use the regular expression to propose the keyword.

The regular expressions used to extract mainstream search keywords are as follows:

(? : Yahoo. +? [\? | &] P = | openfind. +? Query = | Google. +? Q = | Lycos. +? Query = | onseek. +? Keyword = | Search \. Tom. +? WORD = | Search \. QQ \. com. +? WORD = | zhongsou \. com. +? WORD = | Search \. MSN \. com. +? Q = | yisou \. com. +? P = | sina. +? WORD = | sina. +? Query = | sina. +? _ Searchkey = | Sohu. +? WORD = | Sohu. +? Key_word = | Sohu. +? Query = | 163. +? Q = | Baidu. +? WD = | Baidu. +? KW = | Baidu. +? WORD = | 3721 \. com. +? P = | alltheweb. +? Q = | Soso. +? W = | 115. +? Q = | youdao. +? Q = | sogou. +? Query = | bing. +? Q = | 114. +? KW =) ([^ &] *)

The above regular expressions are modified based on the regular expressions on the Internet, so that they support Soso, 115, youdao, sogou, Bing, 114 (or 118114) searches. Thank you for providing them. I cannot mark the original author because I have reposted too much on the Internet and haven't indicated the original source.

Encoding Type Recognition

Gb2312 and UTF-8

From the above example, we can see that the search is also "Qiu Han blog", but different searches have different URL encoding strings. "% E7 % A7 % 8B % E5 % af % 92% E5 % 8d % 9A % E5 % AE % A2" under Google and Bing ", in Baidu, "% C7 % EF % Ba % AE % B2 % A9 % BF % Cd" is displayed ". I believe that all the friends who know about webpage encoding should know what is going on. Different charsets encode the same character and the resulting encoding is different. The UTF-8 produces the encoding of the UTF-8 by default, and the default decoding is also the UTF-8. The same is true for gb2312. Google and must be applied to the UTF-8 encoding, and Baidu is gb2312 encoding, resulting in the same keyword in Different searches produced different strings.

Decoding of UTF-8 and gb2312 can be achieved, but if you do not know what encoding method the target is, you do not know how to decode it. The two strings above are not the encoding method. Of course, the first response is to identify the encoding method through search. This method is indeed feasible and effective. But let's look at the above regular expression. If this method is used, then there will be a lot of or after an if statement. This method is usually used on the Internet. I think this is not the best method. The information we learned is: 1, % XX format are URL encoding (UTF-8 or gb2312); 2, gb2312 Chinese characters need two groups of % XX to form a Chinese character, UTF-8 to three groups of % XX to form a Chinese character; 3, if the UTF-8 Decoding Method to Solve gb2312 encoding will appear garbled.

We cannot identify the encoding method at all from the length and the encoding range. We can only identify the encoding mode by other methods. According to the above three points, we can make a hypothesis: If the UTF-8 decoding method is used to solve the gb2312 encoding, what will happen? Because the UTF-8 is composed of three groups of % XX, and gb2312 is composed of two groups of % XX, if the decoding method of UTF-8 to solve gb2312 and can be decoded successfully, the decoded characters must be shorter. For example: gb2312url encoded "Akio blog" (% C7 % EF % Ba % AE % B2 % A9 % BF % Cd) is decoded with a UTF-8. If decoding succeeds, the decoded characters are two and a half Chinese characters in length. If the decoding fails, the gb2312 encoding method is not required.

OK. Now we know how to identify the URL encoding type. The steps are as follows:

1. Get the keyword encoding string;

2. Obtain the number of encoding strings (x). For example, "% C7 % EF % Ba % AE % B2 % A9 % BF % Cd is 8;

3. decoding the obtained encoding string by UTF-8;

4. If decoding fails (that is, a program error occurs), skip to step 1;

5. If the decoding succeeds, obtain the decoded String Length (Y), and compare it with the number of complete encoding strings divided by 3;

6. If X! = Y then jump to step 2;

7. If x = Y, the resulting string is a keyword;

8. decoded using gb2312. The obtained string is a keyword;

Note the following points in the preceding steps:

1. Remove the URL escape and English characters of non-Chinese Characters in step 1;

The encoding strings in steps 2, 2nd, 3, and 5 are the strings mentioned above;

3. Decoding in steps 7th and 8 requires the original string;

Unicode

During the test, when soguo jumps from a webpage to an image, the URL encoding method is changed to Unicode. The Unicode method is relatively easy to distinguish, that is, to determine whether it is in the form of \ uxxxx or % uxxxx. This is relatively simple. Only the decoding functions in ASP are shared.

The following code is from the csdn Forum:

Method 1:

Response. Write vbsunescape ("\ u5c0f \ u867e \ u7c73 ")

Function vbsunescape (STR) 'decryption

Dim I, S, C

S = ""

For I = 1 to Len (STR)

C = mid (STR, I, 1)

If mid (STR, I, 2) = "\ U" and I <= Len (STR)-5 then

If isnumeric ("& H" & Mid (STR, I + 2, 4) then

S = S & chrw (CINT ("& H" & Mid (STR, I + 2, 4 )))

I = I + 5

Else

S = S & C

End if

Elseif c = "%" and I <= Len (STR)-2 then

If isnumeric ("& H" & Mid (STR, I + 1, 2) then

S = S & chrw (CINT ("& H" & Mid (STR, I + 1, 2 )))

I = I + 2

Else

S = S & C

End if

Else

S = S & C

End if

Next

Vbsunescape = s

End Function

Method 2:

S = "\ u5c0f \ u867e \ u7c73"

S = Replace (S, "\ U", "% u ")

Response. Write Unescape (s)

No functions or methods can be found in C # To identify URL encoding formats. I wonder if there are any better methods in PHP and Java? Welcome to the discussion.

Source: Submitted by the reader Shen Li, original article address.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.