Python multi-thread crawling Google search link webpage
1) urllib2 + beautifulsoup captures goolge Search links recently. Participating projects need to process Google search results. I have learned Python-related tools to process web pages. Actually should...
1) urllib2 + beautifulsoup capture goolge Search links
Recently, participating projects need to process Google search results. I have learned Python tools related to webpage processing. In practical applications, urllib2 and beautifulsoup are used to capture webpages. However, when capturing Google search results, it is found that if the source code of the Google search results page is directly processed, you will get a lot of "dirty" links.
The search result is "Titanic James:
In the figure, the red mark is not required, and the blue mark is to be captured.
These "Dirty links" can be filtered out by rule filtering, but the complexity of the program is high. Just as I was frowning about filtering rules. I was reminded that Google should provide related APIs to make it easy.
(2) Google Web Search API + Multithreading
This document provides an example of searching using Python:
12345678910111213141516171819202122232425262728293031 |
import
simplejson # The request also includes the userip parameter which provides the end
# user's IP address. Doing so will help distinguish this legitimate
# server-side traffic from traffic which doesn't come from an end-user.
url
= ( 'https://ajax.googleapis.com/ajax/services/search/web' '?v=1.0&q=Paris%20Hilton&userip=USERS-IP-ADDRESS' )
request
= urllib2.Request(
url,
None , { 'Referer' :
/ *
Enter the URL of your site here * / })
response
= urllib2.urlopen(request)
# Process the JSON string.
results
= simplejson.load(response)
# now have some fun with the results...
import
simplejson # The request also includes the userip parameter which provides the end # user's IP address. Doing so will help distinguish this legitimate # server-side traffic from traffic which doesn't come from an end-user. url
= ( 'https://ajax.googleapis.com/ajax/services/search/web' '?v=1.0&q=Paris%20Hilton&userip=USERS-IP-ADDRESS' ) request
= urllib2.Request( url,
None , { 'Referer' :
/ *
Enter the URL of your site here * / }) response
= urllib2.urlopen(request) # Process the JSON string. results
= simplejson.load(response) # now have some fun with the results.. |
In actual applications, many web pages of Google may need to be crawled. Therefore, multithreading is required to share the capture task. For more information about using the Google Web Search API, see here (Standard URL arguments is introduced here ). In addition, note that the rsz parameter in the URL must be values below 8 (including 8). If the value is greater than 8, an error is returned!
(3) code implementation
Code implementation still has problems, but it can run and has poor robustness and needs to be improved. I hope all the experts will point out the errors (beginner Python) and I am very grateful.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879 |
#-*-coding:utf-8-*-
import
urllib2,urllib import
simplejson import
os, time,threading import
common, html_filter #input the keywords
keywords
= raw_input ( 'Enter the keywords: ' )
#define rnum_perpage, pages
rnum_perpage = 8 pages = 8 # Defining thread Functions
def
thread_scratch(url, rnum_perpage, page): url_set
= [] try :
request
= urllib2.Request(url,
None , { 'Referer' :
'http://www.sina.com' })
response
= urllib2.urlopen(request)
# Process the JSON string.
results
= simplejson.load(response)
info
= results[ 'responseData' ][ 'results' ]
except
Exception,e: print
'error occured' print
e else :
for
minfo in info:
url_set.append(minfo[ 'url' ])
print
minfo[ 'url' ]
# Processing links
i
= 0 for
u in url_set:
try :
request_url
= urllib2.Request(u,
None , { 'Referer' :
'http://www.sina.com' })
request_url.add_header(
'User-agent' ,
'CSC' )
response_data
= urllib2.urlopen(request_url).read()
# Filter files
#content_data = html_filter.filter_tags(response_data)
# Writing files
filenum
= i + page
filename
= dir_name + '/related_html_' + str (filenum)
print
' write start: related_html_' + str (filenum)
f
= open (filename,
'w+' ,
- 1 )
f.write(response_data)
#print content_data
f.close()
print
' write down: related_html_' + str (filenum)
except
Exception, e: print
'error occured 2' print
e i
= i + 1 return # Creating folders
dir_name
= 'related_html_' + urllib.quote(keywords)
if
os.path.exists(dir_name): print
'exists file' common.delete_dir_or_file(dir_name)
os.makedirs(dir_name)
# Webpage capture
print
'start to scratch web pages:' for
x in range
(pages): print
"page:%s" % (x + 1 )
page
= x
* rnum_perpage url
= ( 'https://ajax.googleapis.com/ajax/services/search/web' '?v=1.0&q=%s&rsz=%s&start=%s' )
% (urllib.quote(keywords), rnum_perpage,page)
print
url t
= threading.Thread(target = thread_scratch, args = (url,rnum_perpage, page)) t.start()
# The main thread waits for the sub-thread to capture
main_thread
= threading.currentThread()
for
t in threading.
enumerate ():
if
t is main_thread:
continue t.join() |