Python captures Google results

Source: Internet
Author: User
Python multi-thread crawling Google search link webpage
1) urllib2 + beautifulsoup captures goolge Search links recently. Participating projects need to process Google search results. I have learned Python-related tools to process web pages. Actually should...

1) urllib2 + beautifulsoup capture goolge Search links

Recently, participating projects need to process Google search results. I have learned Python tools related to webpage processing. In practical applications, urllib2 and beautifulsoup are used to capture webpages. However, when capturing Google search results, it is found that if the source code of the Google search results page is directly processed, you will get a lot of "dirty" links.

The search result is "Titanic James:

In the figure, the red mark is not required, and the blue mark is to be captured.

These "Dirty links" can be filtered out by rule filtering, but the complexity of the program is high. Just as I was frowning about filtering rules. I was reminded that Google should provide related APIs to make it easy.

(2) Google Web Search API + Multithreading

This document provides an example of searching using Python:

12345678910111213141516171819202122232425262728293031 import
simplejson     # The request also includes the userip parameter which provides the end 
# user's IP address. Doing so will help distinguish this legitimate 
# server-side traffic from traffic which doesn't come from an end-user. 
url
= ('https://ajax.googleapis.com/ajax/services/search/web'       '?v=1.0&q=Paris%20Hilton&userip=USERS-IP-ADDRESS')
    request
= urllib2.Request(
    url,
None, {'Referer':
/*
Enter the URL of your site here */})
response
= urllib2.urlopen(request)
    # Process the JSON string. 
results
= simplejson.load(response)
# now have some fun with the results...
  import
simplejson   # The request also includes the userip parameter which provides the end# user's IP address. Doing so will help distinguish this legitimate# server-side traffic from traffic which doesn't come from an end-user.url
= ('https://ajax.googleapis.com/ajax/services/search/web'       '?v=1.0&q=Paris%20Hilton&userip=USERS-IP-ADDRESS')   request
= urllib2.Request(    url,
None, {'Referer':
/*
Enter the URL of your site here */})response
= urllib2.urlopen(request)   # Process the JSON string.results
= simplejson.load(response)# now have some fun with the results..

In actual applications, many web pages of Google may need to be crawled. Therefore, multithreading is required to share the capture task. For more information about using the Google Web Search API, see here (Standard URL arguments is introduced here ). In addition, note that the rsz parameter in the URL must be values below 8 (including 8). If the value is greater than 8, an error is returned!

(3) code implementation

Code implementation still has problems, but it can run and has poor robustness and needs to be improved. I hope all the experts will point out the errors (beginner Python) and I am very grateful.

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879 #-*-coding:utf-8-*- 
import
urllib2,urllib import
simplejson import
os, time,threading    import
common, html_filter #input the keywords 
keywords
= raw_input('Enter the keywords: ')                                 
   #define rnum_perpage, pages 
rnum_perpage=8pages=8   # Defining thread Functions
def
thread_scratch(url, rnum_perpage, page):  url_set
= []   try:
   request
= urllib2.Request(url,
None, {'Referer':
'http://www.sina.com'})
   response
= urllib2.urlopen(request)
   # Process the JSON string. 
   results
= simplejson.load(response)
   info
= results['responseData']['results']
 except
Exception,e:    print
'error occured'   print
e  else:
   for
minfo in
info:
      url_set.append(minfo['url'])
      print
minfo['url']
  # Processing links
 i
= 0 for
u in
url_set:
   try:
     request_url
= urllib2.Request(u,
None, {'Referer':
'http://www.sina.com'})
     request_url.add_header(
     'User-agent',
     'CSC'     )
     response_data
= urllib2.urlopen(request_url).read()
     # Filter files
     #content_data = html_filter.filter_tags(response_data) 
     # Writing files
     filenum
= i+page
     filename
= dir_name+'/related_html_'+str(filenum)
     print
'  write start: related_html_'+str(filenum)
     f
= open(filename,
'w+',
-1)
     f.write(response_data)
     #print content_data 
     f.close()
     print
'  write down: related_html_'+str(filenum)
   except
Exception, e:      print
'error occured 2'     print
e    i
= i+1 return   # Creating folders
dir_name
= 'related_html_'+urllib.quote(keywords)
if
os.path.exists(dir_name):    print
'exists  file'   common.delete_dir_or_file(dir_name)
os.makedirs(dir_name)
   # Webpage capture
print
'start to scratch web pages:'for
x in
range
(pages):   print
"page:%s"%(x+1)
  page
= x
* rnum_perpage   url
= ('https://ajax.googleapis.com/ajax/services/search/web'                  '?v=1.0&q=%s&rsz=%s&start=%s')
% (urllib.quote(keywords), rnum_perpage,page)
  print
url   t
= threading.Thread(target=thread_scratch, args=(url,rnum_perpage,
page))
  t.start()
# The main thread waits for the sub-thread to capture
main_thread
= threading.currentThread()
for
t in
threading.
enumerate():
  if
t is
main_thread:
    continue  t.join()
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.