1) Urllib2+beautifulsoup Crawl Goolge Search link
Recently, participating projects need to process Google search results before learning about Python's tools for working with Web pages. In practice, the use of URLLIB2 and BeautifulSoup to crawl Web pages, but when crawling Google search results, found that if it is directly to the Google search results page of the source code processing, you will get a lot of "dirty" links.
Look for the results of the search for "Titanic James":
The red mark in the figure is not required, and the blue flag is required to be crawled.
This "dirty link" can of course be filtered out by the method of rule filtering, but the complexity of the program is high. Just when I was writing the filter rules. Students reminded that Google should provide the relevant API, only to suddenly understand.
(2) Google Web Search api+ Multithreading
An example of a search using Python is given in the documentation:
The red mark in the figure is not required, and the blue flag is required to be crawled.
This "dirty link" can of course be filtered out by the method of rule filtering, but the complexity of the program is high. Just when I was writing the filter rules. Students reminded that Google should provide the relevant API, only to suddenly understand.
(2) Google Web Search api+ Multithreading
An example of a search using Python is given in the documentation:
Import Simplejson # The request also includes the Userip parameter which provides the end # user ' s IP address. Doing so would help distinguish this legitimate # Server-side traffic from traffic which doesn ' t come from an end-user. url = (' Https://ajax.googleapis.com/ajax/services/search/web '? v=1.0&q=paris%20hilton&userip= Users-ip-address ') request = Urllib2. Request (URL, None, {' Referer ':/* Enter the URL of your site here */}) Response = Urllib2.urlopen (Request) # Process The JSON string. Results = Simplejson.load (response) # now has some fun with the results ... import Simplejson # The request also include s the Userip parameter which provides the end# user ' s IP address. Doing so would help distinguish this legitimate# server-side traffic from traffic which doesn ' t come from an end-user.url = (' Https://ajax.googleapis.com/ajax/services/search/web '? v=1.0&q=paris%20hilton&userip= Users-ip-address ') request = Urllib2. Request (URL, None, {' ReFerer ':/* Enter the URL of your site here * *}) Response = Urllib2.urlopen (Request) # Process the JSON string.results = s Implejson.load (response) # now has some fun with the results.
Many of Google's Web pages may need to be crawled in real-world applications, so you'll need to use multithreading to share your crawl tasks. For more information on using the Google Web search API, see here (the standard URL Arguments is described here). In addition to pay special attention, the URL parameter Rsz must be 8 (including 8) below the value, if more than 8, will be error!
(3) Code implementation
The code implementation still has the problem, but can run, the robustness is poor, also needs to make the improvement, hoped that the big God pointed out the mistake (beginner Python), greatly appreciates.
#-*-coding:utf-8-*-import urllib2,urllibimport simplejsonimport os, time,threading import common, html_filter#input th E keywords keywords = raw_input (' Enter the Keywords: ') #define Rnum_perpage, pages Rnu M_perpage=8pages=8 #定义线程函数 def thread_scratch (URL, rnum_perpage, page): Url_set = [] try:requ EST = urllib2. Request (URL, None, {' Referer ': ' http://www.sina.com '}) Response = Urllib2.urlopen (Request) # Process the JSON string. Results = simplejson.load (response) info = results[' responsedata ' [' Results '] except Exception,e:print ' ERROR OCCU Red ' Print e else:for minfo in info:url_set.append (minfo[' url ') "Print minfo[' url ') #处理链接 i = 0 for u in Url_set:try:request_url = Urllib2. Request (U, None, {' Referer ': ' http://www.sina.com '}) request_url.add_header (' user-agent ', ' CSC ') res Ponse_data = Urllib2.urlopen (Request_url). Read () #过滤文件 #content_data = Html_filteR.filter_tags (response_data) #写入文件 filenum = i+page filename = dir_name+ '/related_html_ ' +str (filenum) PR int ' Write Start:related_html_ ' +str (filenum) f = open (filename, ' w+ ',-1) f.write (response_data) #print con Tent_data f.close () print ' Write Down:related_html_ ' +str (filenum) except Exception, E:print ' ERROR OCCU Red 2 ' print e i = i+1 return #创建文件夹 dir_name = ' related_html_ ' +urllib.quote (keywords) if os.path.exists (dir_name): print ' exists file ' Common.delete_dir_or_file (dir_name) os.makedirs (dir_name) #抓取网页 print ' Start to scratch Web Pag es: ' For x in range (pages): print "page:%s"% (x+1) page = x * rnum_perpage url = (' Https://ajax.googleapis.com/ajax/servi Ces/search/web '? v=1.0&q=%s&rsz=%s&start=%s ')% (urllib.quote (keywords), rnum_perpage,page) Print URL T = Threading. Thread (Target=thread_scratch, args= (url,rnum_perpage, page)) T.start () #主线程等待子线程抓取完 main_thread = Threading.currentthreaD () for T in Threading.enumerate (): If T is main_thread:continue t.join ()