Ruby on Httpwatch script

Source: Internet
Author: User

HTTPwatch official: http://www.httpwatch.com/rubywatir/

Ruby on httpwatch example: http://www.httpwatch.com/rubywatir/site_spider.zip (this example may be updated on the official website)

After this example is obtained, some Chinese comments are made to delete some codes. The main changes are as follows:

1. In url = gets. chomp! Add ($ * [0]. nil?) above ?)? (Url = url) :( url = $ * [0]). Currently, the URL can be loaded in the command line or fixed in the script. Command Line usage: ruby Script Name: website name, for specific usage instructions, see the notes in the script. Do not add http: // before the URL ://

2. Check out two breaks. There is no problem with ruby186 and there will be errors in later versions like ruby192.

3. Check plugin. Container. Quit (); that is, do not exit IE. After running, the tester needs to check the result.

Running problem: If the testing machine's low network speed may cause timeout and exit

C:/Ruby192/lib/ruby/gems/1.9.1/gems/watir-classic-3.0.0/lib/watir-classic/ie-class.rb: 374: in 'method _ missing ': (in OLE method 'navigate':) (WIN32OLERuntimeError) OLE error code: 800C000E in <Unknown> <No Description> HRESULT error code: 0x80020009. From C:/Ruby192/lib/ruby/gems/1.9.1/gems/watir-classic-3.0.0/lib/watir-classic/ie-class.rb: 374: in 'Goto 'from C: /Documents and Settings/Administrator/desktop/site_spider/site_spider.rb: 55: in '<main>'
Site_spider.rb

1 # A Site Spider that use HttpWatch, Ruby And Watir 2 #3 # For more information about this example please refer to http://www.httpwatch.com/rubywatir/ 4 #5 MAX_NO_PAGES = 200 # How many pages are accessed at A time, MAX_ON_PAGES controls 6 7 require 'win32ole' # win32ole to drive the HttpWatch tool. For versions earlier than HttpWatch6.0, 8 require 'rubygems '9 require 'watir '10 require 'cannot be called '. /url_ops.rb '# url_ops.rb should be placed in the same directory of the script. 11 url = "www.gaopeng.com /? ADTAG = beijing_from_beijing "# URL to be tested. You can also do not add http: // 12 13 # Create HttpWatch 14 control = WIN32OLE before reading from the command line. new ('httpwatch. controller ') 15 httpWatchVer = control. version 16 if httpWatchVer [0... 1] = "4" or httpWatchVer [0... 1] = "5" 17 puts "\ nERROR: You are running HttpWatch # {httpWatchVer }. this sample requires HttpWatch 6.0 or later. press Enter to exit... "; $ stdout. flush 18 gets 19 # break # ruby1 There is no problem with version 86. It may be wrong in a later version like ruby192, you need to watch out for 20 end 21 22 # Get the domain name to spider 23 puts "Enter the domain name of the site to check (press enter for url): \ n"; $ stdout. flush 24 ($ * [0]. nil ?)? (Url = url) :( url = $ * [0]) # upload the file name from the command line and read the 25 # url = gets. chomp! Of the command line first! # If you add the code of the previous line, you must check this line. 26 if url. empty? 27 url = url 28 end 29 hostName = url. HostName 30 if hostName. empty? 31 puts "\ nPlease enter a valid domain name. press Enter to exit... "; $ stdout. flush 32 gets 33 # break # ruby186 is correct. There may be errors in later versions like ruby192. You need to watch 34 end 35 36 # start IE 37 ie = Watir: IE. new 38 ie. logger. level = Logger: ERROR 39 40 # locate IE window 41 plugin = control. ie. attach (ie. ie) 42 43 # Start to record HTTP traffic 44 plugin. clear () 45 plugin. log. enableFilter (false) 46 plugin. record () 47 48 49 url = url. canonicalUrl 50 urls Visited = Array. new; urlsToVisit = Array. new (1, url) 51 # access page 52 53 while urlsToVisit. length> 0 & urlsVisited. length <MAX_NO_PAGES 54 55 nextUrl = urlsToVisit. pop 56 puts "Loading" + nextUrl + "... "; $ stdout. flush 57 58 ie. goto (nextUrl) # get WATIR to load URL 59 urlsVisited. push (nextUrl) # store this URL in the list that has been visited 60 61 begin 62 # Look at each link on the page and Decide if it needs to be visited 63 ie. links (). each () do | link | 64 65 linkUrl = link. href. canonicalUrl 66 # if the url has already been accessed or if it is a download or if it from a different domain 67 if! Url. IsSubDomain (linkUrl. HostName) | 68 linkUrl. Path. include? (". Exe") | linkUrl. Path. include? (". Zip") | linkUrl. Path. include? (". Csv") | 69 linkUrl. Path. include? (". Pdf") | linkUrl. Path. include? (". Png") | 70 urlsToVisit. find {| aUrl = linkUrl }! = Nil | 71 urlsVisited. find {| aUrl = linkUrl }! = Nil 72 # Don't add this URL to the list 73 next 74 end 75 # Add this URL to the list 76 urlsToVisit. push (linkUrl) 77 end 78 rescue 79 puts "Failed to find links in" + nextUrl + "" + $ !; $ Stdout. flush 80 end 81 82 end 83 84 if (urlsVisited. length = MAX_NO_PAGES) 85 puts "\ nThe spider has stopped because # {MAX_NO_PAGES} pages have been visited. (Change MAX_NO_PAGES if you want to increase this limit) "; $ stdout. flush 86 end 87 88 # Stop Recording HTTP data in HttpWatch 89 plugin. stop () 90 91 puts "\ nAnalyzing HTTP data .. "; $ stdout. flush 92 93 94 # Look at each HTTP request in Log to compile list of URLs 95 # for each error 96 errorUrls = Hash. new 97 plugin. Log. Entries. each do | entry | 98 if! Entry. Error. empty? & Entry. Error! = "Aborted" | entry. StatusCode >=400 99 if! ErrorUrls. has_key? (Entry. results) 100 errorUrls [entry. result] = Array. new (1, entry. url) 101 else102 if errorUrls [entry. result]. find {| aUrl = entry. url }== nil 103 errorUrls [entry. result]. push (entry. url) 104 end 105 end106 end107 end108 109 # Display summary statistics for whole log110 summary = plugin. log. entries. summary111 112 printf "Total time to load page (secs): %. 3f \ n ", summary. time113 printf "Number of bytes attached ed on network: % d \ n", summary. bytesReceived114 115 printf "HTTP compression saving (bytes): % d \ n", summary. compressionSavedBytes116 printf "Number of round trips: % d \ n", summary. roundTrips117 printf "Number of errors: % d \ n", summary. errors. count118 119 # Print out errors120 summary. errors. each do | error | 121 numErrors = error. occurrences122 description = error. description123 puts "# {numErrors} URL (s) caused a # {description} error:" 124 errorUrls [error. result]. each do | aUrl | 125 puts "-> # {aUrl}" 126 end127 128 end129 130 # exit IE. comment it out here. After running, the tester needs to check the result 131 # plugin. container. quit (); 132 133 puts "\ r \ nPress Enter to exit"; $ stdout. flush134 # gets
Url_ops.rb

 1 # Helper functions used to parse URLs 2 class String 3   def HostName 4       matches = scan(/^(?:https?:\/\/)?([^\/]*)/) 5       if matches.length > 0 && matches[0].length > 0 6          return matches[0][0].downcase 7       else 8           return "" 9       end10   end11   def IsSubDomain( hostName)12     thisHostName = self.HostName13     if thisHostName.slice(0..3) == "www."14         thisHostName = thisHostName.slice(4..-1)15     end16     if thisHostName == hostName ||17       (hostName.length > thisHostName.length &&18        hostName.slice( -thisHostName.length ..-1) == thisHostName)19         return true20     end21     return false22   end23   def Protocol24       matches = scan(/^(https?:\/\/)/)25       if matches.length > 0 && matches[0].length > 026           return matches[0][0].downcase27       else28           return "http://"29       end30   end  31   def Path32       if scan(/^(https?:\/\/)/).length > 0 33         matches = scan(/^https?:\/\/[^\/]+\/([^#]+)$/)34       else35         matches = scan(/^[^\/]+\/([^#]+)$/)36           end        37       if matches != nil && matches.length == 1 && matches[0].length == 138           return matches[0][0].downcase39       else40           return ""41       end42   end   43   def CanonicalUrl44       return self.Protocol + self.HostName + "/" + self.Path45   end   46 end

The two scripts are placed in the same directory, and url_ops.rb is not changed. Just execute it in cmd.

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.