Python captures Google results

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python multi-thread crawling Google search link webpage
1) urllib2 + beautifulsoup captures goolge Search links recently. Participating projects need to process Google search results. I have learned Python-related tools to process web pages. Actually should...

1) urllib2 + beautifulsoup capture goolge Search links

Recently, participating projects need to process Google search results. I have learned Python tools related to webpage processing. In practical applications, urllib2 and beautifulsoup are used to capture webpages. However, when capturing Google search results, it is found that if the source code of the Google search results page is directly processed, you will get a lot of "dirty" links.

The search result is "Titanic James:

In the figure, the red mark is not required, and the blue mark is to be captured.

These "Dirty links" can be filtered out by rule filtering, but the complexity of the program is high. Just as I was frowning about filtering rules. I was reminded that Google should provide related APIs to make it easy.

(2) Google Web Search API + Multithreading

This document provides an example of searching using Python:

12345678910111213141516171819202122232425262728293031 import
simplejson # The request also includes the userip parameter which provides the end # user's IP address. Doing so will help distinguish this legitimate # server-side traffic from traffic which doesn't come from an end-user. url= ('https://ajax.googleapis.com/ajax/services/search/web' '?v=1.0&q=Paris%20Hilton&userip=USERS-IP-ADDRESS') request= urllib2.Request( url,None, {'Referer':/*
Enter the URL of your site here */})response= urllib2.urlopen(request) # Process the JSON string. results= simplejson.load(response)# now have some fun with the results... import
simplejson # The request also includes the userip parameter which provides the end# user's IP address. Doing so will help distinguish this legitimate# server-side traffic from traffic which doesn't come from an end-user.url= ('https://ajax.googleapis.com/ajax/services/search/web' '?v=1.0&q=Paris%20Hilton&userip=USERS-IP-ADDRESS') request= urllib2.Request( url,None, {'Referer':/*
Enter the URL of your site here */})response= urllib2.urlopen(request) # Process the JSON string.results= simplejson.load(response)# now have some fun with the results..

In actual applications, many web pages of Google may need to be crawled. Therefore, multithreading is required to share the capture task. For more information about using the Google Web Search API, see here (Standard URL arguments is introduced here ). In addition, note that the rsz parameter in the URL must be values below 8 (including 8). If the value is greater than 8, an error is returned!

(3) code implementation

Code implementation still has problems, but it can run and has poor robustness and needs to be improved. I hope all the experts will point out the errors (beginner Python) and I am very grateful.

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879 #-*-coding:utf-8-*- import
urllib2,urllib import
simplejson import
os, time,threading import
common, html_filter #input the keywords keywords= raw_input('Enter the keywords: ') #define rnum_perpage, pages rnum_perpage=8pages=8 # Defining thread Functionsdef
thread_scratch(url, rnum_perpage, page): url_set= [] try: request= urllib2.Request(url,None, {'Referer':'http://www.sina.com'}) response= urllib2.urlopen(request) # Process the JSON string. results= simplejson.load(response) info= results['responseData']['results'] except
Exception,e: print
'error occured' print
e else: for
minfo in info: url_set.append(minfo['url']) print
minfo['url'] # Processing links i= 0 for
u in url_set: try: request_url= urllib2.Request(u,None, {'Referer':'http://www.sina.com'}) request_url.add_header( 'User-agent', 'CSC' ) response_data= urllib2.urlopen(request_url).read() # Filter files #content_data = html_filter.filter_tags(response_data) # Writing files filenum= i+page filename= dir_name+'/related_html_'+str(filenum) print
' write start: related_html_'+str(filenum) f= open(filename,'w+',
-1) f.write(response_data) #print content_data f.close() print
' write down: related_html_'+str(filenum) except
Exception, e: print
'error occured 2' print
e i= i+1 return # Creating foldersdir_name= 'related_html_'+urllib.quote(keywords)if
os.path.exists(dir_name): print
'exists file' common.delete_dir_or_file(dir_name)os.makedirs(dir_name) # Webpage captureprint
'start to scratch web pages:'for
x in range(pages): print
"page:%s"%(x+1) page= x
* rnum_perpage url= ('https://ajax.googleapis.com/ajax/services/search/web' '?v=1.0&q=%s&rsz=%s&start=%s')% (urllib.quote(keywords), rnum_perpage,page) print
url t= threading.Thread(target=thread_scratch, args=(url,rnum_perpage, page)) t.start()# The main thread waits for the sub-thread to capturemain_thread= threading.currentThread()for
t in threading.enumerate(): if
t is main_thread: continue t.join()

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python captures Google results

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python captures Google results

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support