Python multi-threaded crawl Google search link page

Last Update:2016-10-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1) Urllib2+beautifulsoup Crawl Goolge Search link

Recently, participating projects need to process Google search results before learning about Python's tools for working with Web pages. In practice, the use of URLLIB2 and BeautifulSoup to crawl Web pages, but when crawling Google search results, found that if it is directly to the Google search results page of the source code processing, you will get a lot of "dirty" links.

Look for the results of the search for "Titanic James":

The red mark in the figure is not required, and the blue flag is required to be crawled.

This "dirty link" can of course be filtered out by the method of rule filtering, but the complexity of the program is high. Just when I was writing the filter rules. Students reminded that Google should provide the relevant API, only to suddenly understand.

(2) Google Web Search api+ Multithreading

An example of a search using Python is given in the documentation:

The red mark in the figure is not required, and the blue flag is required to be crawled.

(2) Google Web Search api+ Multithreading

An example of a search using Python is given in the documentation:

Import Simplejson # The request also includes the Userip parameter which provides the end # user ' s IP address. Doing so would help distinguish this legitimate # Server-side traffic from traffic which doesn ' t come from an end-user. url = (' Https://ajax.googleapis.com/ajax/services/search/web '? v=1.0&q=paris%20hilton&userip= Users-ip-address ') request = Urllib2.  Request (URL, None, {' Referer ':/* Enter the URL of your site here */}) Response = Urllib2.urlopen (Request) # Process The JSON string. Results = Simplejson.load (response) # now has some fun with the results ... import Simplejson # The request also include s the Userip parameter which provides the end# user ' s IP address.  Doing so would help distinguish this legitimate# server-side traffic from traffic which doesn ' t come from an end-user.url = (' Https://ajax.googleapis.com/ajax/services/search/web '? v=1.0&q=paris%20hilton&userip= Users-ip-address ') request = Urllib2. Request (URL, None, {' ReFerer ':/* Enter the URL of your site here * *}) Response = Urllib2.urlopen (Request) # Process the JSON string.results = s Implejson.load (response) # now has some fun with the results.

Many of Google's Web pages may need to be crawled in real-world applications, so you'll need to use multithreading to share your crawl tasks. For more information on using the Google Web search API, see here (the standard URL Arguments is described here). In addition to pay special attention, the URL parameter Rsz must be 8 (including 8) below the value, if more than 8, will be error!

(3) Code implementation

The code implementation still has the problem, but can run, the robustness is poor, also needs to make the improvement, hoped that the big God pointed out the mistake (beginner Python), greatly appreciates.

#-*-coding:utf-8-*-import urllib2,urllibimport simplejsonimport os, time,threading import common, html_filter#input th E keywords keywords = raw_input (' Enter the Keywords: ') #define Rnum_perpage, pages Rnu M_perpage=8pages=8 #定义线程函数 def thread_scratch (URL, rnum_perpage, page): Url_set = [] try:requ EST = urllib2.    Request (URL, None, {' Referer ': ' http://www.sina.com '}) Response = Urllib2.urlopen (Request) # Process the JSON string. Results = simplejson.load (response) info = results[' responsedata ' [' Results '] except Exception,e:print ' ERROR OCCU  Red ' Print e else:for minfo in info:url_set.append (minfo[' url ') "Print minfo[' url ') #处理链接 i = 0 for u in Url_set:try:request_url = Urllib2. Request (U, None, {' Referer ': ' http://www.sina.com '}) request_url.add_header (' user-agent ', ' CSC ') res Ponse_data = Urllib2.urlopen (Request_url). Read () #过滤文件 #content_data = Html_filteR.filter_tags (response_data) #写入文件 filenum = i+page filename = dir_name+ '/related_html_ ' +str (filenum) PR int ' Write Start:related_html_ ' +str (filenum) f = open (filename, ' w+ ',-1) f.write (response_data) #print con Tent_data f.close () print ' Write Down:related_html_ ' +str (filenum) except Exception, E:print ' ERROR OCCU    Red 2 ' print e i = i+1 return #创建文件夹 dir_name = ' related_html_ ' +urllib.quote (keywords) if os.path.exists (dir_name): print ' exists file ' Common.delete_dir_or_file (dir_name) os.makedirs (dir_name) #抓取网页 print ' Start to scratch Web Pag es: ' For x in range (pages): print "page:%s"% (x+1) page = x * rnum_perpage url = (' Https://ajax.googleapis.com/ajax/servi  Ces/search/web '? v=1.0&q=%s&rsz=%s&start=%s ')% (urllib.quote (keywords), rnum_perpage,page) Print URL T = Threading. Thread (Target=thread_scratch, args= (url,rnum_perpage, page)) T.start () #主线程等待子线程抓取完 main_thread = Threading.currentthreaD () for T in Threading.enumerate (): If T is main_thread:continue t.join ()



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python multi-threaded crawl Google search link page

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python multi-threaded crawl Google search link page

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support