Tutorial on web Information capturing using Ruby program, rubyweb

Last Update:2015-04-16 Source: Internet

Author: User

Tags oauth

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Tutorial on web Information capturing using Ruby program, rubyweb

Websites no longer simply cater to human readers. Many websites now support APIs that allow computer programs to obtain information. Screen capture-time-saving technology for parsing HTML pages into forms that are easier to understand-is still very convenient. However, the chances of using APIs to simplify Web data extraction are increasing rapidly. According to the ProgrammableWeb information, at the time of this article, more than 10,000 website APIS exist-an increase of 3,000 in the past 15 months. (ProgrammableWeb provides an API to search for and retrieve APIs, mashup, member profiles, and other data from its directory .)

This article first introduces modern Web crawling and compares it with API methods. The Ruby example shows how to use APIs to extract structured information from some popular Web attributes. You need to have a basic understanding of the Ruby language, concrete State Transfer (REST), and JavaScript Object Notation (JSON) and XML concepts.
Capture and API

There are currently multiple capture solutions. Some of them convert HTML to other formats, such as JSON, so that it is easier to extract the desired content. For other solutions to read HTML, you can define the content as a function of the HTML layered structure, where the data has been tagged. One such solution is Nokogiri, which supports parsing HTML and XML documents in Ruby. Other open-source crawling tools include pjscrape for JavaScript and Beautiful Soup for Python. Pjscrape implements a command line tool to capture fully rendered pages, including JavaScript content. Beautiful Soup is fully integrated into Python 2 and 3 environments.

Assume that you want to use the capture function and Nokogiri to identify the number of IBM employees reported by CrunchBase. The first step is to understand the markup of a specific HTML page listing the number of IBM employees on CrunchBase. Figure 1 shows the page opened in the Firebug tool in Mozilla Firefox. The upper part of the image shows the HTML, and the lower part shows the HTML source code of the interested part.

The Ruby script in Listing 1 uses Nokogiri to capture the number of employees from the web page in Figure 1.
Listing 1. Using Nokogiri to parse HTML (parse. rb)

#!/usr/bin/env rubyrequire 'rubygems'require 'nokogiri'require 'open-uri'# Define the URL with the argument passed by the useruri = "http://www.crunchbase.com/company/#{ARGV[0]}"# Use Nokogiri to get the documentdoc = Nokogiri::HTML(open(uri))# Find the link of interestlink = doc.search('tr span[1]')# Emit the content associated with that linkputs link[0].content

In the HTML source code displayed by Firebug (as shown in Figure 1), you can see the data (number of employees) You are interested in is embedded in an HTML Unique ID <span> tag. The <span id = "num_employees"> tag is the first of the two <span> IDs. Therefore, the last two commands in Listing 1 use link = doc. search ('tr span [1] ') requests the first <span> flag, and then uses puts link [0]. content sends the resolved link content.

CrunchBase also discloses a rest api, which can access much more data than the data accessed through the crawling function. Listing 2 shows how to use this API to extract company employees from the CrunchBase site.
Listing 2. Using CrunchBase rest api and JSON parsing (api. rb) together)

#!/usr/bin/env rubyrequire 'rubygems'require 'json'require 'net/http'# Define the URL with the argument passed by the useruri = "http://api.crunchbase.com/v/1/company/#{ARGV[0]}.js"# Perform the HTTP GET request, and return the responseresp = Net::HTTP.get_response(URI.parse(uri))# Parse the JSON from the response bodyjresp = JSON.parse(resp.body)# Emit the content of interestputs jresp['number_of_employees']

In Listing 2, you define a URL (the company name is passed in as a script parameter ). Then, use the HTTP class to send a GET request and return a response. The response is parsed as a JSON object. You can reference data items of interest through a Ruby data structure.

The console session in listing 3 shows the results of running the capture script in Listing 1 and the API-Based Script in Listing 2.
Listing 3. Demonstrate capture and API methods

$ ./parse.rb ibm388,000$ ./api.rb ibm388000$ ./parse.rb cisco63,000$ ./api.rb cisco63000$ ./parse.rb paypal300,000$ ./api.rb paypal300000$

When the capture script runs, you receive a formatted count, and the API script generates an original integer. As shown in listing 3, you can promote the use of each script and obtain the number of employees from other companies that CrunchBase traces. The general structure of the URL provided by each method makes this versatility possible.

So what can we get through the API? For capturing, You need to analyze HTML to understand its structure and identify the data to be extracted. Then it is easy to parse HTML with Nokogiri and get the data you are interested in. However, if the structure of the HTML document changes, you may need to modify the script to parse the new structure correctly. According to the API contract, the API method does not have this problem. Another important advantage of the API method is that you can access all data exposed through the interface (through the returned JSON object. There is much less CrunchBase data exposed through HTML and available for users.

Now let's take a look at how to use other APIs to extract various types of information from the Internet. We also need to use Ruby scripts. First, let's take a look at how to collect personal data from a social network site. Then we can see how to use other API sources to find less personal data.

Use LinkedIn to extract personal data

LinkedIn is a professional social network website. It is useful to contact other developers, find a job, study a company, or join a group to collaborate on interesting topics. LinkedIn also integrates a recommendation engine to recommend jobs and companies based on your profile.

LinkedIn users can access the REST and JavaScript APIs of the site to obtain information that can be accessed through its human-readable Website: contact information, social sharing streams, content groups, communications (messages and contact invitations), and company and work information.

To use the LinkedIn API, you must register your application. After registration, you will obtain an API key and secret key, as well as a User Token and secret key. LinkedIn uses the OAuth protocol for authentication.

After Authentication, you can send a REST request through the access token object. A response is a typical HTTP response, so you can parse the body into a JSON object. Then, the JSON object can be iterated to extract the data of interest.

The Ruby script in Listing 4 provides the company recommendations and work suggestions that interest LinkedIn users after authentication.
Listing 4. Use the LinkedIn API (lkdin. rb) to view company and work suggestions

#!/usr/bin/rubyrequire 'rubygems'require 'oauth'require 'json'pquery = "http://api.linkedin.com/v1/people/~?format=json"cquery='http://api.linkedin.com/v1/people/~/suggestions/to-follow/companies?format=json'jquery='http://api.linkedin.com/v1/people/~/suggestions/job-suggestions?format=json' # Fill the keys and secrets you retrieved after registering your appapi_key = 'api key'api_secret = 'api secret'user_token = 'user token'user_secret = 'user secret' # Specify LinkedIn API endpointconfiguration = { :site => 'https://api.linkedin.com' } # Use the API key and secret to instantiate consumer objectconsumer = OAuth::Consumer.new(api_key, api_secret, configuration) # Use the developer token and secret to instantiate access token objectaccess_token = OAuth::AccessToken.new(consumer, user_token, user_secret)# Get the username for this profileresponse = access_token.get(pquery)jresp = JSON.parse(response.body)myName = "#{jresp['firstName']} #{jresp['lastName']}"puts "\nSuggested companies to follow for #{myName}"# Get the suggested companies to followresponse = access_token.get(cquery)jresp = JSON.parse(response.body)# Iterate through each and display the company namejresp['values'].each do | company |  puts " #{company['name']}"end# Get the job suggestionsresponse = access_token.get(jquery)jresp = JSON.parse(response.body)puts "\nSuggested jobs for #{myName}"# Iterate through each suggested job and print the company namejresp['jobs']['values'].each do | job |  puts " #{job['company']['name']} in #{job['locationDescription']}"endputs "\n"

The console session in listing 5 shows the output of running the Ruby script in Listing 4. The three independent calls to the LinkedIn API in the script have different output results (one for identity authentication and the other two for the company suggestion and work suggestion links respectively ).
Listing 5. Demonstrate the LinkedIn Ruby script

$ ./lkdin.rbSuggested companies to follow for M. Tim Jones Open Kernel Labs, Inc. Linaro Wind River DDC-I Linsyssoft Technologies Kalray American Megatrends JetHead Development Evidence Srl Aizyc TechnologySuggested jobs for M. Tim Jones Kozio in Greater Denver Area Samsung Semiconductor Inc in San Jose, CA Terran Systems in Sunnyvale, CA Magnum Semiconductor in San Francisco Bay Area RGB Spectrum in Alameda, CA Aptina in San Francisco Bay Area CyberCoders in San Francisco, CA CyberCoders in Alameda, CA SanDisk in Longmont, CO SanDisk in Longmont, CO$

LinkedIn APIs can be used in combination with any language that provides OAuth support.

Use the Yelp API to retrieve business data

Yelp discloses a rich rest api for executing enterprise searches, including rating, comments, and geographic searches (location, city, and geocode ). Using the Yelp API, you can search for a given type of enterprise (such as "hotel") and restrict the search to a geographical boundary; near a geographical coordinate; or a neighbor, address, or city nearby. The JSON response contains a large amount of information related to the enterprise that matches the condition, including address information, distance, rating, and transaction, and other types of URL information (such as the enterprise's image and mobile format information.

Like LinkedIn, Yelp uses OAuth for authentication, so you must register with Yelp to get a set of creden。 for authentication through this API. After the script completes authentication, You can construct a REST-based URL request. In Listing 6, I hardcoded a hotel request for Boulder, Colorado. The response body is parsed into a JSON object and iterated to send the desired information. Note: I have excluded closed enterprises.
Listing 6. Using the Yelp API (yelp. rb) to retrieve enterprise data

#!/usr/bin/rubyrequire 'rubygems'require 'oauth'require 'json'consumer_key = 'your consumer key'consumer_secret = 'your consumer secret'token = 'your token'token_secret = 'your token secret'api_host = 'http://api.yelp.com'consumer = OAuth::Consumer.new(consumer_key, consumer_secret, {:site => api_host})access_token = OAuth::AccessToken.new(consumer, token, token_secret)path = "/v2/search?term=restaurants&location=Boulder,CO"jresp = JSON.parse(access_token.get(path).body)jresp['businesses'].each do | business |  if business['is_closed'] == false   printf("%-32s %10s %3d %1.1f\n",         business['name'], business['phone'],         business['review_count'], business['rating'])  endend

The console session in listing 7 shows the sample output of running the script in Listing 6. To make it simpler, I only show the previous group of enterprises returned, rather than supporting the restriction/Offset feature of this API (to retrieve the entire list by executing multiple calls ). This example shows the company name, phone number, number of received comments, and average score.
Listing 7. Demonstrate the Yelp API Ruby script

$ ./yelp.rbFrasca Food and Wine       3034426966 189 4.5John's Restaurant         3034445232  51 4.5Leaf Vegetarian Restaurant    3034421485 144 4.0Nepal Cuisine           3035545828  65 4.5Black Cat Bistro         3034445500  72 4.0The Mediterranean Restaurant   3034445335 306 4.0Arugula Bar E Ristorante     3034435100  48 4.0Ras Kassa's Ethiopia Restaurant  3034472919 101 4.0L'Atelier             3034427233  58 4.0Bombay Bistro           3034444721  87 4.0Brasserie Ten Ten         3039981010 200 4.0Flagstaff House          3034424640  86 4.5Pearl Street Mall         3034493774  77 4.0Gurkhas on the Hill        3034431355  19 4.0The Kitchen            3035445973 274 4.0Chez Thuy Restaurant       3034421700  99 3.5Il Pastaio            3034479572 113 4.53 Margaritas           3039981234  11 3.5Q's Restaurant          3034424880  65 4.0Julia's Kitchen                 8 5.0$

Yelp provides an API with excellent documentation, as well as data descriptions, examples, and error handling. Although the Yelp API is very useful, its usage is limited. As an original software developer, you can perform up to 100 API calls per day and 1,000 API calls for testing purposes. If your application meets the display requirements of Yelp, you can perform 10,000 calls per day (or more ).

Contains a simple mashup domain location

The next example links two source codes to generate information. In this example, you want to convert a Web domain name to its general geographic location. How does the Ruby script in listing 8 use Linux? The host command and OpenCrypt IP Location API Service are used to retrieve Location information.
Listing 8. retrieving the location information of a Web domain

#!/usr/bin/env rubyrequire 'net/http'aggr = ""key = 'your api key here'# Get the IP address for the domain using the 'host' commandIO.popen("host #{ARGV[0]}") { | line | until line.eof?  aggr += line.gets end}# Find the IP address in the response from the 'host' commandpattern = /\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$/if m = pattern.match(aggr)  uri = "http://api.opencrypt.com/ip/?IP=#{m[0]}&key=#{key}"  resp = Net::HTTP.get_response(URI.parse(uri))  puts resp.bodyend

In listing 8, you first use the local host command to convert the domain name to an IP address. (The host command uses an internal API and DNS resolution to resolve the domain name to an IP address .) You use a simple regular expression (and the match method) to parse the IP address from the host command output. With an IP address, you can use the IP address location service on OpenCrypt to retrieve general geographic location information. OpenCrypt API allows you to perform a maximum of 50,000 free API calls.

OpenCrypt API call is simple: the URL you construct contains the IP address you want to locate and the key provided to you during the OpenCrypt registration process. The HTTP response body contains the IP address, country code, and country name.

The console session in listing 9 shows the output of two sample domain names.
Listing 9. Using a simple domain location script

$ ./where.rb www.baynet.ne.jpIP=111.68.239.125CC=JPCN=Japan$ ./where.rb www.pravda.ruIP=212.76.137.2CC=RUCN=Russian Federation$

Google API Query

Google is the undisputed winner of Web APIs. Google has so many APIs that it provides another API to query them. Using the Google API Discovery Service, you can list available APIs provided by Google and extract their metadata. Although identity authentication is required for interaction with most Google APIs, you can access the query API through a secure socket connection. For this reason, listing 10 uses the Ruby https class to construct a connection to a secure port. The defined URL specifies the REST request, and the response adopts JSON encoding. Iteratively respond to and issue a small portion of the preferred API data.
Listing 10. Using Google API Discovery Service (gdir. rb) to list Google APIs

#!/usr/bin/rubyrequire 'rubygems'require 'net/https'require 'json'url = 'https://www.googleapis.com/discovery/v1/apis'uri = URI.parse(url)# Set up a connection to the Google API Servicehttp = Net::HTTP.new( uri.host, 443 )http.use_ssl = truehttp.verify_mode = OpenSSL::SSL::VERIFY_NONE# Connect to the servicereq = Net::HTTP::Get.new(uri.request_uri)resp = http.request(req)# Get the JSON representationjresp = JSON.parse(resp.body)# Iterate through the API Listjresp['items'].each do | item | if item['preferred'] == true  name = item['name']  title = item['title']  link = item['discoveryLink']  printf("%-17s %-34s %-20s\n", name, title, link) endend

The console session in listing 11 shows the response example obtained by running the script in listing 10.
Listing 11. Using a simple Google Directory Service Ruby script

$ ./gdir.rbadexchangebuyer  Ad Exchange Buyer API       ./apis/adexchangebuyer/v1.1/restadsense      AdSense Management API       ./apis/adsense/v1.1/restadsensehost    AdSense Host API          ./apis/adsensehost/v4.1/restanalytics     Google Analytics API        ./apis/analytics/v3/restandroidpublisher Google Play Android Developer API ./apis/androidpublisher/v1/restaudit       Enterprise Audit API        ./apis/audit/v1/restbigquery     BigQuery API            ./apis/bigquery/v2/restblogger      Blogger API            ./apis/blogger/v3/restbooks       Books API             ./apis/books/v1/restcalendar     Calendar API            ./apis/calendar/v3/restcompute      Compute Engine API         ./apis/compute/v1beta12/restcoordinate    Google Maps Coordinate API     ./apis/coordinate/v1/restcustomsearch   CustomSearch API          ./apis/customsearch/v1/restdfareporting   DFA Reporting API         ./apis/dfareporting/v1/restdiscovery     APIs Discovery Service       ./apis/discovery/v1/restdrive       Drive API             ./apis/drive/v2/rest...storage      Cloud Storage API         ./apis/storage/v1beta1/resttaskqueue     TaskQueue API           ./apis/taskqueue/v1beta2/resttasks       Tasks API             ./apis/tasks/v1/resttranslate     Translate API           ./apis/translate/v2/resturlshortener   URL Shortener API         ./apis/urlshortener/v1/restwebfonts     Google Web Fonts Developer API   ./apis/webfonts/v1/restyoutube      YouTube API            ./apis/youtube/v3alpha/restyoutubeAnalytics YouTube Analytics API       ./apis/youtubeAnalytics/v1/rest$

The output in listing 11 shows the API names, their titles, and further analyzes the URL paths of each API.

Conclusion

The example in this article demonstrates the powerful functions of public APIs in extracting information from the Internet. Compared with spidering, Web APIs provide the ability to access specific targeted information. The Internet is constantly creating new values, not only through the use of these APIs, but also by combining them in a novel way, so as to provide more and more Web users with new data.

However, you must pay for using APIs. Restrictions are often complained. Similarly, the fact that you may change the API rules without notice must be taken into account when building an application. Recently, Twitter has changed its API to provide "a more consistent experience ". This change is undoubtedly a disaster for many third-party applications that may be considered a typical competitor of Twitter Web clients.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More