Python multi-threaded crawl Google search link page

Source: Internet
Author: User
1) Urllib2+beautifulsoup Crawl Goolge Search link

Recently, participating projects need to process Google search results before learning about Python's tools for working with Web pages. In practice, the use of URLLIB2 and BeautifulSoup to crawl Web pages, but when crawling Google search results, found that if it is directly to the Google search results page of the source code processing, you will get a lot of "dirty" links.

Look for the results of the search for "Titanic James":

The red mark in the figure is not required, and the blue flag is required to be crawled.

This "dirty link" can of course be filtered out by the method of rule filtering, but the complexity of the program is high. Just when I was writing the filter rules. Students reminded that Google should provide the relevant API, only to suddenly understand.

(2) Google Web Search api+ Multithreading

An example of a search using Python is given in the documentation:

The red mark in the figure is not required, and the blue flag is required to be crawled.

This "dirty link" can of course be filtered out by the method of rule filtering, but the complexity of the program is high. Just when I was writing the filter rules. Students reminded that Google should provide the relevant API, only to suddenly understand.

(2) Google Web Search api+ Multithreading

An example of a search using Python is given in the documentation:

Import Simplejson # The request also includes the Userip parameter which provides the end # user ' s IP address. Doing so would help distinguish this legitimate # Server-side traffic from traffic which doesn ' t come from an end-user. url = (' Https://ajax.googleapis.com/ajax/services/search/web '? v=1.0&q=paris%20hilton&userip= Users-ip-address ') request = Urllib2.  Request (URL, None, {' Referer ':/* Enter the URL of your site here */}) Response = Urllib2.urlopen (Request) # Process The JSON string. Results = Simplejson.load (response) # now has some fun with the results ... import Simplejson # The request also include s the Userip parameter which provides the end# user ' s IP address.  Doing so would help distinguish this legitimate# server-side traffic from traffic which doesn ' t come from an end-user.url = (' Https://ajax.googleapis.com/ajax/services/search/web '? v=1.0&q=paris%20hilton&userip= Users-ip-address ') request = Urllib2. Request (URL, None, {' ReFerer ':/* Enter the URL of your site here * *}) Response = Urllib2.urlopen (Request) # Process the JSON string.results = s Implejson.load (response) # now has some fun with the results.

Many of Google's Web pages may need to be crawled in real-world applications, so you'll need to use multithreading to share your crawl tasks. For more information on using the Google Web search API, see here (the standard URL Arguments is described here). In addition to pay special attention, the URL parameter Rsz must be 8 (including 8) below the value, if more than 8, will be error!

(3) Code implementation

The code implementation still has the problem, but can run, the robustness is poor, also needs to make the improvement, hoped that the big God pointed out the mistake (beginner Python), greatly appreciates.

#-*-coding:utf-8-*-import urllib2,urllibimport simplejsonimport os, time,threading import common, html_filter#input th E keywords keywords = raw_input (' Enter the Keywords: ') #define Rnum_perpage, pages Rnu M_perpage=8pages=8 #定义线程函数 def thread_scratch (URL, rnum_perpage, page): Url_set = [] try:requ EST = urllib2.    Request (URL, None, {' Referer ': ' http://www.sina.com '}) Response = Urllib2.urlopen (Request) # Process the JSON string. Results = simplejson.load (response) info = results[' responsedata ' [' Results '] except Exception,e:print ' ERROR OCCU  Red ' Print e else:for minfo in info:url_set.append (minfo[' url ') "Print minfo[' url ') #处理链接 i = 0 for u in Url_set:try:request_url = Urllib2. Request (U, None, {' Referer ': ' http://www.sina.com '}) request_url.add_header (' user-agent ', ' CSC ') res Ponse_data = Urllib2.urlopen (Request_url). Read () #过滤文件 #content_data = Html_filteR.filter_tags (response_data) #写入文件 filenum = i+page filename = dir_name+ '/related_html_ ' +str (filenum) PR int ' Write Start:related_html_ ' +str (filenum) f = open (filename, ' w+ ',-1) f.write (response_data) #print con Tent_data f.close () print ' Write Down:related_html_ ' +str (filenum) except Exception, E:print ' ERROR OCCU    Red 2 ' print e i = i+1 return #创建文件夹 dir_name = ' related_html_ ' +urllib.quote (keywords) if os.path.exists (dir_name): print ' exists file ' Common.delete_dir_or_file (dir_name) os.makedirs (dir_name) #抓取网页 print ' Start to scratch Web Pag es: ' For x in range (pages): print "page:%s"% (x+1) page = x * rnum_perpage url = (' Https://ajax.googleapis.com/ajax/servi  Ces/search/web '? v=1.0&q=%s&rsz=%s&start=%s ')% (urllib.quote (keywords), rnum_perpage,page) Print URL T = Threading. Thread (Target=thread_scratch, args= (url,rnum_perpage, page)) T.start () #主线程等待子线程抓取完 main_thread = Threading.currentthreaD () for T in Threading.enumerate (): If T is main_thread:continue t.join () 
  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.