A period of time has been studying how to use Python to crawl search engine results, in the implementation of the process encountered a lot of problems, I have to record the problems I encountered, I hope to encounter the same problem of children's shoes to avoid detours.
1. Selection of search engines
Choosing a good search engine means you can get more accurate search results. There are four types of search engines I've used: Google, Bing, Baidu, Yahoo! As a programmer, I prefer Google. But when I saw my favorite Google returned to me is all a bunch of JS code, there is no search results I want. So I turned to Bing, and after a while I found that the search results returned by Bing were not ideal for my problem. When I was desperate, Google saved me. The original Google in order to take care of those who prohibit the browser to use JS users, there is another way to search, see the following search URL:
Https://www.google.com.hk/search?hl=en&q=hello
HL Specifies the language to search, and q is the keyword you want to search for. Well, thanks to Google, the search results page contains what I want to crawl.
PS: Much of the web uses Python to crawl Google search results or use https://ajax.googleapis.com/ajax/services/search/web ... The method. Note that this method Google is no longer recommended to use, see https://developers.google.com/web-search/docs/. Google Now offers the custom Search API, but the API is limited to 100 requests per day and can only be paid if more is needed.
2. Python crawls and analyzes web pages
Using Python to crawl Web pages is convenient, not to mention, see the code:
def search (self, querystr): querystr = Urllib2.quote (querystr) url = ' https://www.google.com.hk/search?hl=en &q=%s '% querystr request = Urllib2. Request (URL) response = Urllib2.urlopen (request) HTML = response.read () results = Self.extractsearchresults (HTML)
The 6th line of HTML is our crawl of the search results page source code. Those who have used python will find that Python provides both urllib and URLLIB2 two modules, which are modules related to URL requests, but provide different functions, Urllib can only receive URLs, Instead, URLLIB2 can accept an instance of the request class to set the headers of URL requests, which means you can disguise your user agent and so on (which is used below).
Now that we can crawl the page with Python and save it, we can extract the search results we want from the source page. Python provides the Htmlparser module, but it is relatively troublesome to use, here is recommended a very useful page analysis package BeautifulSoup, about beautifulsoup with the judge of the network has detailed introduction, here I no longer say.
Using the above code, for a small number of queries is still more OK, but if you want to make thousands of queries, the above method is no longer valid, Google will detect the source of your request, if we use the machine frequently crawl Google search results, not long Google will block your IP, and returns you to the 503 error page. This is not the result we want, so we have to continue to explore
As mentioned earlier, using URLLIB2 we can set URL request headers, masquerading as our user agent. Simply put, the user agent is a special network protocol used by applications such as client browsers, which is sent to the server every time the browser (mail client/search engine spider) makes an HTTP request, and the server knows what browser the user is using (mail client/ Search engine spiders) to access the. Sometimes in order to achieve some purpose, we have to go to a good-natured deception server to tell it I am not using the machine to access you.
As a result, our code looks like this:
user_agents = [' mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) gecko/20130406 firefox/23.0 ', \ ' mozilla/5.0 (Windows NT 6.1; WOW64; rv:18.0) gecko/20100101 firefox/18.0 ', \ ' mozilla/5.0 (Windows; U Windows NT 6.1; En-US) applewebkit/533+ \ (khtml, like Gecko) Element Browser 5.0 ', \ ' IBM webexplorer/v0.94 ', ' GALAXY/1. 0 [en] (Mac OS X 10.5.6; U EN) ', \ ' mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; trident/6.0) ', \ ' opera/9.80 (Windows NT 6.0) presto/2.12.388 version/12.14 ', \ ' mozilla/5.0 (IPAD; CPU os 6_0 like Mac os X applewebkit/536.26 (khtml, like Gecko) \ version/6.0 mobile/10a5355d safari/8536.25 ', \ ' mozilla/5.0 (Windows NT 6.1) applewebkit/537.36 (khtml, like Gecko) \ chrome/28.0.1468.0 safari/537.36 ', \ ' mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; trident/5.0; TheWorld) ' Def search (self, querystr): Querystr = urllib2.quote (querystr) url = ' https://www.google.com.hk/search?hl=en&q=%s '% querystr request = Urllib2. Request (URL) index = random.randint (0, 9) user_agent = User_agents[index] Request.add_header (' user-agent ', use R_agent) response = Urllib2.urlopen (request) HTML = Response.read () results = self.extractsearchresults (HTML)
Do not be user_agents that list scare, that is actually 10 user agent string, this is to let us disguise better, if you need more user agent please see here useragentstring.
The 17-19 line means randomly selecting a user agent string and then masquerading a user agent with the request's Add_header method.
By disguising the user agent can let us continue to crawl search engine results, if this is not enough, I suggest that every two queries randomly sleep for a period of time, this will affect the crawl speed, but can let you more sustainable crawl results, if you have multiple IPs, then the speed of crawling up.