Refresh the search term to see the differences between the three major search engines

Source: Internet
Author: User
Search engine Detection
Currently, the search engine provides some smart guidance functions. when you enter a keyword to search, the returned content includes the search link of the relevant content. For example, we search for "Movies" in Google and Baidu, and the returned pages contain "free movies", "online movies", "Movie Downloads", and many other Search links. It should be that the engine makes a judgment based on the keyword to present the relevant content that the user may be interested in to the user.
This function is useful for website promotion. For example, in order to link the keywords of a website, a promotion method introduced on the Internet is to use the combination of these keywords to refresh the engine in large quantities. In this way, the returned results of the search engine are affected.
Now we have a combination of dozens of keywords in our hands. The first practice was to start the company's staff to open the webpage and enter the search. Then use these keywords to search. This method is too dependent on people, and the script program is most suitable for such repetitive work. Of course, the refresh frequency cannot be too high to avoid being blacklisted by the engine. However, whether this practice is really effective is only known to the search engine.

Well, our solution is to use python, which is a scripting language and very simple. Just a few words. For example:
Import urllib
F = urllib. urlopen ('HTTP: // www.google.com ')
Print F. Read ()
These statements mean accessing the Google homepage. Now we can add the keyword to be searched. You only need to modify the address in the urlopen function:
Baidu's: F = urllib. urlopen ('HTTP: // www.baidu.com/s? WD = % Ca % D3 % Ce % C0 % Cd % f8 ')
Google China: F = urllib. urlopen ('HTTP: // www.google.cn/search? Hl = ZH-CN & Q = % E8 % A7 % 86% E5 % 8d % AB % E7 % BD % 91 & btng = Google + % E6 % 90% 9C % E7 % B4 % a2 & meta = & AQ = f ')
Yahoo China: F = urllib. urlopen ('HTTP: // search.cn.yahoo.com/search? Ei = GBK & Fr = FP-tab-web-ycn & pid = ysearch & source = yahoo_yhp_0706_search_button & P = % Ca % D3 % Ce % C0 % Cd % F8 ')
Run. Different results are returned. Both Baidu and Yahoo can pass smoothly. Google returns an error page:

The above page probably means you do not have the permission to access it. It's strange to copy the above address and directly enter it in the browser. Is there any problem? What Mysterious weapons does Google have?
So I opened the packet capture software and directly captured HTTP packets accessed through a browser and python.

This is a message accessed through a browser:

This is the packet accessed through Python:

Besides incomplete header information, the content of "User-Agent" is different. I think Google is a large python app. They certainly think that someone will use a Python script to brush the engine. It is estimated that this script access is illegal when the results are returned.

Three engines, with the same access content, only Google protects itself. It seems that the search boss is indeed superior.

Google
Through the analysis of the above packet capture content, it is found that Google cannot return normally only because of the difference in "User-Agent" in HTTP. In another way, what will happen when constructing a request like a real browser?
Re-compile the script. Another method is used this time:

Import httplib, urllib
Params = urllib. urlencode ({})
Headers = {
"Accept-charset": "ISO-8859-1, UTF-8; q = 0.7, *; q = 0.7 ",
"Accept-encoding": "gzip, deflate ",
"Accept-language": "ZH-CN, en; q = 0.7, en-US; q = 0.3 ",
"Accept": "text/XML, application/XML, application/XHTML + XML, text/html; q = 0.9, text/plain; q = 0.8, image/PNG, */*; q= 0.5 ",
"User-Agent": "Mozilla/5.0 (windows; U; Windows NT 5.1; en-US; RV: 1.8.1.13) Gecko/20080311 Firefox/2.0.0.13 ",
}
Conn = httplib. httpconnection ("www.google.cn ")
Conn. Request ("get", "/search? Hl = ZH-CN & Q = % E8 % A7 % 86% E5 % 8d % AB % E7 % BD % 91 & btng = Google + % E6 % 90% 9C % E7 % B4 % a2 & meta = & AQ = f ", params, headers)
Response = conn. getresponse ()
Print response. Status, response. Reason
Data = response. Read ()
# Print data
Conn. Close ()

Run again and capture packets. We found that a virtual Firefox request message was constructed this time and the access was successful! After multiple debugging, we found that the header content is not necessary and can be removed. The code can be simplified:

Import httplib, urllib
Conn = httplib. httpconnection ("www.google.cn ")
Conn. Request ("get", "/search? Hl = ZH-CN & Q = % E8 % A7 % 86% E5 % 8d % AB % E7 % BD % 91 & btng = Google + % E6 % 90% 9C % E7 % B4 % a2 & meta = & AQ = f ")
Response = conn. getresponse ()
Print response. Status, response. Reason
Data = response. Read ()
Conn. Close ()

The rest is physical activity.
Well, the rest of the work is to add the keywords to be searched, plus the timing and loop logic. All of them are physically active.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.