Example of compiling a Python script to get Google search results

Source: Internet
Author: User
This article mainly introduces the example of compiling a Python script to get Google search results. it is also a simple implementation of programming crawlers using Python, if you need it, you can refer to how you have been studying how to capture search engine results using python for a while. you have encountered many problems in the implementation process, I have recorded all the problems I encountered and hope that the children's shoes with the same problems will not take detours in the future.

1. search engine selection

Selecting a good search engine means you can get more accurate search results. I have used four search engines: Google, Bing, Baidu, and Yahoo !. As a programmer, I prefer Google. But when I saw my favorite Google return to me all a bunch of js code, there was no search result I wanted. So I switched to the Bing camp. after a while, I found that the search results returned by Bing were not ideal for my problem. Just as I was desperate, Google saved me. Google has another search method to take care of users who prohibit browsers from using js. See the following search URL:

Https://www.google.com.hk/search? Hl = en & q = hello

Hl specifies the language to be searched. q is the keyword to be searched. Okay, thanks to Google. the search result page contains the content I want to capture.

PS: A lot of Web web Web crawlers of Google search results using python or using https://ajax.googleapis.com/ajax/services/search/web... method. Please note that this method Google is no longer recommended, see https://developers.google.com/web-search/docs. Google now provides the Custom Search API, but the API limits 100 requests per day. if you need more, you can only pay for it.

2. capture and analyze web pages using Python

It is very convenient to capture web pages using Python. For more information, see the code:

def search(self, queryStr):   queryStr = urllib2.quote(queryStr)   url = 'https://www.google.com.hk/search?hl=en&q=%s' % queryStr   request = urllib2.Request(url)   response = urllib2.urlopen(request)   html = response.read()   results = self.extractSearchResults(html)


The source code of the 6th-line html page is the search result page we crawled. Python provides both urllib and urllib2 modules, which are related to URL requests. However, it provides different functions. urllib can only receive URLs, urllib2 can accept an instance of the Request class to set the headers of the URL Request, which means that you can pretend to be your user agent (which will be used below ).

Now we can use Python to capture and save the webpage. next we can extract the desired search result from the source code page. Python provides the htmlparser module, but it is relatively troublesome to use. Here we recommend a good web analysis package BeautifulSoup. the official website will detail the usage of BeautifulSoup. I will not talk about it here.

Using the code above, a small number of queries are still relatively OK, but if you want to perform thousands of queries, the above method will no longer work, and Google will detect the source of your request, if we use a machine to frequently crawl Google's search results, Google will block your IP address in a short time and return the 503 Error page to you. This is not the expected result, so we will continue to explore

Previously we mentioned that using urllib2, we can set the headers of URL requests to disguise our user agent. In short, the user agent is a special network protocol used by client browsers and other applications. it is sent to the server each time the browser (mail client/search engine spider) sends an HTTP request, the server knows what browser the user uses (mail client/search engine spider) to access. Sometimes for some purpose, we have to cheat the server in good faith to tell it that I am not using a machine to access you.

So our code looks like this:

user_agents = ['Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20130406 Firefox/23.0', \     'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) Gecko/20100101 Firefox/18.0', \     'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533+ \     (KHTML, like Gecko) Element Browser 5.0', \     'IBM WebExplorer /v0.94', 'Galaxy/1.0 [en] (Mac OS X 10.5.6; U; en)', \     'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)', \     'Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14', \     'Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) \     Version/6.0 Mobile/10A5355d Safari/8536.25', \     'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) \     Chrome/28.0.1468.0 Safari/537.36', \     'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0; TheWorld)'] def search(self, queryStr):   queryStr = urllib2.quote(queryStr)   url = 'https://www.google.com.hk/search?hl=en&q=%s' % queryStr   request = urllib2.Request(url)   index = random.randint(0, 9)   user_agent = user_agents[index]   request.add_header('User-agent', user_agent)   response = urllib2.urlopen(request)   html = response.read()   results = self.extractSearchResults(html)


Do not be scared by the list of user_agents, which is actually 10 user agent strings. this is a better disguise. if you need more user agents, please refer to UserAgentString here.

Rows 17-19 indicate that a user agent string is randomly selected, and a user agent is disguised using the add_header method of the request.

By disguising the user agent, we can continuously capture search engine results. if this is not the case, we recommend that you sleep for a period of time between every two queries, which will affect the crawling speed, but it allows you to capture results more continuously. if you have multiple IP addresses, the crawling speed will also increase.

Github has all the source code for this article. you can download it from the following URL:

Https://github.com/meibenjin/GoogleSearchCrawler

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.