A tutorial on web information capture using Ruby program _ruby topics

Source: Internet
Author: User
Tags emit error handling oauth representational state transfer unique id

Websites no longer cater to human readers alone. Many sites now support APIs that enable computer programs to access information. Screen capture--it is still convenient to parse HTML pages into a time-saving technique for forms that are easier to understand. But the opportunity to simplify WEB data extraction with APIs is increasing rapidly. According to Programmableweb information, more than 10,000 web sites already exist in this article, api-has increased by 3,000 in the past 15 months. (Programmableweb itself provides an API to search and retrieve APIs, mashups, member profiles, and other data from its directory.) )

This article first introduces the modern Web crawl and compares it with the API method. The Ruby example then shows how to use the API to extract structured information from popular WEB properties. You need a basic understanding of Ruby language, representational state transfer (REST), and JavaScript object Notation (JSON) and XML concepts.
Crawl and API

There are now a variety of crawl solutions. Some of them convert HTML to other formats, such as JSON, which makes it easier to extract what you want. Other solutions read HTML, you can define content as a function of an HTML hierarchy in which data is tagged. One such solution is Nokogiri, which supports parsing HTML and XML documents using the Ruby language. Other open source crawlers include pjscrape for JavaScript and beautiful Soup for Python. Pjscrape implements a command-line tool to crawl fully rendered pages, including JavaScript content. The beautiful Soup is fully integrated into the Python 2 and 3 environments.

Suppose you want to use the crawl feature and Nokogiri to identify the number of IBM employees reported by Crunchbase. The first step is to understand the markup for a specific HTML page that lists the number of IBM employees on Crunchbase. Figure 1 shows this page opened in the Firebug tool in Mozilla Firefox. The top half of the diagram shows the rendered HTML, and the bottom half shows the HTML source code for the section of interest.

The Ruby script in Listing 1 uses Nokogiri to crawl the number of employees from the pages in Figure 1.
Listing 1. Parsing HTML using Nokogiri (PARSE.RB)

#!/usr/bin/env Ruby
require ' rubygems '
require ' nokogiri ' require ' Open-uri ' # Define the

URL with the Argument passed by the user
URI = ' http://www.crunchbase.com/company/#{argv[0]} ' # use Nokogiri to get the

documen T
doc = nokogiri::html (open (URI))

# Find the link of interest
link = doc.search (' tr span[1] ')

# Emit the Content associated with that link
puts link[0].content

In the HTML source code shown in Firebug (shown in Figure 1), you can see that the data of interest (the number of employees) is embedded within an HTML unique ID <span> tag. You can also see that the <span id= "num_employees" > tag is the first of the two <span> ID tags. So the last two instructions in Listing 1 are to request the first <span> tag using link = doc.search (' tr span[1] '), and then use puts link[0].content to emit the content of the parsed link.

Crunchbase also exposes a REST API that can access more data than is accessed through the crawl feature. Listing 2 shows how to use this API to extract the number of employees from the Crunchbase site.
Listing 2. Combined with Crunchbase REST APIs and JSON parsing (API.RB)

#!/usr/bin/env Ruby
require ' rubygems '
require ' json ' require ' net/http '

# Define the URL with the Argument passed by the user
uri = "Http://api.crunchbase.com/v/1/company/#{argv[0]}.js"

# Perform the HTTP get re Quest, and return the response
resp = Net::http.get_response (Uri.parse (URI))

# Parse the JSON from the response bo Dy
jresp = json.parse (resp.body)

# Emit The content of interest puts jresp[
' number_of_employees ']

In Listing 2, you define a URL (the company name is passed in as a script parameter). The HTTP class is then used to emit a GET request and return the response. The response is resolved to a JSON object, and you can reference the data item of interest through a RUBY data structure.

The console session in Listing 3 shows the results of running the crawl script in Listing 1 and the API-based script in Listing 2.
Listing 3. Demo Crawl and API methods

$./parse.rb IBM
$./api.rb IBM
$./parse.rb Cisco
./API.RB Cisco 63000
$./parse.rb PayPal
$/api.rb PayPal

When the crawl script runs, you receive a formatted count, and the API script generates an original integer. As shown in Listing 3, you can promote the use of each script and obtain the number of employees from other companies that are tracked by Crunchbase. The general structure of the URLs provided by each method makes this versatility possible.

So what do we get with the API approach? For crawling, you need to parse the HTML to understand its structure and identify the data to extract. It's easy to use Nokogiri to parse HTML and get the data you're interested in. However, if the structure of the HTML document changes, you may need to modify the script to correctly parse the new structure. The API method does not have the problem, according to the API contract. Another important advantage of the API approach is that you have access to all the data that is exposed through the interface (through the returned JSON object). Crunchbase data that is exposed through HTML and available for human use is much less.

Now look at how you can use some other APIs to extract all kinds of information from the Internet, as well as using Ruby scripts. First look at how to collect personal data from a social networking site. You will then see how to find less personal data from other API sources.

Extracting personal data from LinkedIn

LinkedIn is a social networking site for professional careers. It is useful for contacting other developers, looking for work, researching a company, or joining a group to collaborate on interesting topics. LinkedIn also incorporates a recommendation engine that can recommend work and company based on your profile.

LinkedIn users can access the REST and JavaScript APIs of the site to obtain information that can be accessed via their human-readable Web site: contact information, social sharing flows, content groups, communications (messages and contact invitations), and Company and work information.

To use the LinkedIn API, you must register your application. After registration, you will get an API key and secret key, as well as a user token and secret key. LinkedIn uses the OAuth protocol for authentication.

After authentication is performed, you can make a REST request by accessing the token object. The response is a typical HTTP response, so you can resolve the body to a JSON object. You can then iterate over the JSON object to extract the data of interest.

The Ruby script in Listing 4 provides the company's recommendation and work recommendations to be followed by the authenticated LinkedIn user.
Listing 4. Use the LinkedIn API (LKDIN.RB) to view company and work recommendations

#!/usr/bin/ruby require ' rubygems ' require ' oauth ' require ' json ' pquery = ' http://api.linkedin.com/v1/people/~?format= JSON "cquery= ' Http://api.linkedin.com/v1/people/~/suggestions/to-follow/companies?format=json ' jquery= ' http:// Api.linkedin.com/v1/people/~/suggestions/job-suggestions?format=json ' # Fill the keys and secrets your retrieved after re  Gistering your app Api_key = ' API key ' Api_secret = ' API secret ' User_token = ' User token ' User_secret = ' user Secret ' # Specify LINKEDIN API Endpoint Configuration = {: Site => ' https://api.linkedin.com '} # Use the API key and secret T o Instantiate Consumer Object consumer = oauth::consumer.new (Api_key, Api_secret, configuration) # Use the developer Tok

En and secret to instantiate access token object Access_token = oauth::accesstoken.new (consumer, User_token, User_secret) # get the username to this profile response = Access_token.get (pquery) Jresp = Json.parse (response.body) myname = "#{jre sp[' FirstName ']} #{jresp[' LastName '}' puts ' \nsuggested companies to follow for #{myname} ' # Get the suggested companies to follow response = Access_token.ge  T (cquery) Jresp = Json.parse (response.body) # Iterate through each and display the company name jresp[' values '].each do |
  Company | Puts "#{company[' name"} "End # Get the job suggestions response = Access_token.get (jquery) Jresp = Json.parse (response. Body) puts "\nsuggested jobs for #{myname}" # Iterate through all suggested job and print the company name jresp[' Jobs ' [' Values '].each do |
  Job |

 Puts ' #{job[' [' Name ']} in #{job[' Locationdescription ']} ' end puts ' \ n '

The console session in Listing 5 shows the output of the Ruby script running listing 4. 3 separate calls to the LinkedIn API in the script have different outputs (one for authentication and two for company recommendations and work recommendations).
Listing 5. Demo LinkedIn Ruby Script


suggested companies to follow for M. Tim Jones
 Open Kernel Labs, Inc.
 Linaro Wind
 linsyssoft Technologies
 American Megatrends Jethead Development
 Evidence Srl
 aizyc Technology

suggested jobs for M. Tim Jones
 Kozio in Greater Denver A
 Samsung Semiconductor Inc San Jose, CA
 Terran Systems in Sunnyvale, CA
 Magnum Semiconductor in San Fra Ncisco Bay area
 RGB Spectrum into Alameda, CA
 Aptina in San Francisco Bay area
 cybercoders in San Francisco, CA
 Cybercoders in Alameda, CA
 SanDisk in Longmont, CO
 SanDisk in Longmont, CO


The LinkedIn API can be used in conjunction with any language that provides OAuth support.

Retrieving business data using the Yelp API

Yelp exposes a rich REST API to perform enterprise searches, including ratings, reviews, and geographical searches (lots, cities, geocoding). Using the Yelp API, you can search for a given type of enterprise (such as a "restaurant") and limit the search to a geographical boundary, near a geographical location, or near a neighbor, address, or city. The JSON response contains a large amount of relevant information about the enterprise that matches the criteria, including address information, distance, ratings, transactions, and the URL of other types of information (such as pictures of the enterprise, mobile format information, and so on).

Like LinkedIn, Yelp uses OAuth to perform authentication, so you must register with Yelp to get a set of credentials for authentication through the API. After the script completes authentication, you can construct a REST-based URL request. In Listing 6, I hard-coded a hotel request for Colorado State Boulder. The response body is parsed into a JSON object and is iterated to emit the desired information. Note that I have excluded the closed enterprise.
Listing 6. Retrieving enterprise data using the Yelp API (YELP.RB)

require ' rubygems ' require ' oauth ' require '
json '

consumer_key = ' Your consumer key '
Consumer_secret = ' Your consumer secret ' token = ' Your token
Token_secret = ' Your token secret '
Api_hos t = ' http://api.yelp.com '

consumer = oauth::consumer.new (Consumer_key, Consumer_secret, {: Site => api_host})
Access_token = oauth::accesstoken.new (consumer, token, token_secret)

path = "/v2/search?term=restaurants &location=boulder,co "

jresp = Json.parse (access_token.get (path). Body)

jresp[' businesses '].each do | Business |
  If business[' is_closed '] = = False
   printf ("%-32s%10s%3d%1.1f\n", 
        business[' name '), business[' phone '], 
        business[' Review_count '], business[' rating ')
  End End

The console session in Listing 7 shows the sample output running the listing 6 script. For the sake of simplicity, I only show the previous set of companies that are returned, rather than the limit/offset feature that supports the API (to perform multiple tuning to retrieve the entire list). This example output shows the enterprise name, phone number, number of comments received, and average score.
Listing 7. Demo Yelp API Ruby script

 $./yelp.rb Frasca Food and Wine 3034426966 189 4.5 John ' s restaurant 3034 445232 4.5 Leaf Vegetarian Restaurant 3034421485 144 4.0 Nepal Cuisine 3035545828 4.5 Black Cat bistr  o 3034445500 4.0 The Mediterranean restaurant 3034445335 306 4.0 arugula Bar E Ristorante 3034435100 48           4.0 Ras Kassa ' s Ethiopia restaurant 3034472919-4.0 L ' Atelier 3034427233 4.0 Bombay 3034444721 4.0 Brasserie Ten Ten 3039981010 4.0 Flagstaff House 3034424640 MB 4.5 Pearl Street M All 3034493774 4.0 Gurkhas on the Hill 3034431355 4.0 the Kitchen 3035445973 274 Che Z Thuy Restaurant 3034421700 3.5 Il Pastaio 3034479572 113 4.5 3 Margaritas 3039981234 11 3.5 Q ' s restaurant 3034424880 4.0 Julia ' s Kitchen 8 5.0 $ 

YELP provides an API with excellent documentation, as well as data descriptions, samples, error handling, and so on. Although the Yelp API is useful, its use has certain limitations. As a software original developer, you can perform up to 100 API calls per day, and perform 1,000 calls for test purposes. If your application meets Yelp's display requirements, you can perform 10,000 calls a day (or more times).

A domain location that contains a simple mashup

The next example joins the two-segment source code to generate information. In this case, you want to convert a WEB domain name to its general geographic location. The Ruby script in Listing 8 uses Linux? Host command and Opencrypt IP Location API Service to retrieve location information.
Listing 8. Retrieving location information for a WEB domain

#!/usr/bin/env Ruby
require ' net/http '

aggr = "
key = ' Your API key here '

Main using the ' host ' command
io.popen ("host #{argv[0]}") {| line |
 Until line.eof?
  Aggr + = Line.gets End} # Find the IP addresses in the response from the

' host ' command pattern
=/\d{1,3 }\.\d{1,3}\.\d{1,3}\.\d{1,3}$/
if M = Pattern.match (aggr)
  uri = "http://api.opencrypt.com/ip/?" Ip=#{m[0]}&key=#{key} "
  resp = Net::http.get_response (Uri.parse (URI))
  puts Resp.body

In Listing 8, you first convert a domain name to an IP address using the Local Host command. (the host command itself resolves a domain name to an IP address using an internal API and DNS resolution.) You use a simple regular expression (and the Match method) to resolve the IP address from the host command output. With an IP address, you can use the IP location service on Opencrypt to retrieve general geographic information. The Opencrypt API allows you to perform up to 50,000 free API calls.

The Opencrypt API call is simple: the URL you construct contains the IP address you want to locate and the key that the Opencrypt registration process provides to you. The HTTP response Body contains the IP address, country code, and country name.

The console session in Listing 9 shows the output of two sample domain names.
Listing 9. Use a simple domain location script

$./where.rb www.baynet.ne.jp
www.pravda.ru ip=
Cn=russian Federation

Google API Query


Web API aspect an indisputable winner is Google. Google has so many APIs that it provides another API to query them. With the Google API Discovery Service, you can list the available APIs that Google provides and extract their metadata. Although the interaction with most Google APIs requires authentication, you can access the query API through a secure socket connection. For this reason, listing 10 uses Ruby's HTTPS class to construct a connection to a secure port. The defined URL specifies the REST request, and the response is JSON encoded. Iterate over the response and issue a small number of preferred API data.
Listing 10. Use Google API Discovery service (GDIR.RB) to list Google APIs

require ' rubygems ' require ' net/https ' require '
json '

url = ' https:// Www.googleapis.com/discovery/v1/apis '

uri = uri.parse (URL)

# Set up a connection to the Google API Service
HT TP = Net::http.new (uri.host, 443)
Http.use_ssl = True
Http.verify_mode = Openssl::ssl::verify_none

# Connect to the service
req = net::http::get.new (Uri.request_uri)
resp = http.request (req)

# get the JSON R Epresentation
jresp = Json.parse (resp.body)

# Iterate through the API List
jresp[' items '].each do | item |
 if item[' preferred] = = = True
  name = item[' name ']
  title = item[' title ']
  link = item[' discoverylink ' ]
  printf ("%-17s%-34s%-20s\n", name, title, link)

The console session in Listing 11 shows an example of the response from a script running listing 10.
Listing 11. Use a simple Google directory service Ruby script

$./gdir.rb adexchangebuyer Ad Exchange Buyer API./apis/adexchangebuyer/v1.1/rest AdSense AdSense Management API./apis/adsense/v1.1/rest adsensehost AdSense Host API./apis/adsensehost/v4.1/rest Analytics Go Ogle Analytics API./apis/analytics/v3/rest androidpublisher Google play Android Developer API./apis/androidpublis Her/v1/rest Audit Enterprise Audit API./apis/audit/v1/rest bigquery bigquery API./apis/bigqu Ery/v2/rest Blogger blogger API./apis/blogger/v3/rest books and Books API./APIS/BOOKS/V1/R EST Calendar Calendar API./apis/calendar/v3/rest COMPUTE COMPUTE Engine API./APIS/COMPUTE/V1          Beta12/rest coordinate Google Maps coordinate API./apis/coordinate/v1/rest Customsearch Customsearch API ./apis/customsearch/v1/rest dfareporting DFA Reporting API./apis/dfareporting/v1/rest Discovery APIs Disc Overy Service./apIs/discovery/v1/rest Drive Drive API,/apis/drive/v2/rest storage Cloud API./A             Pis/storage/v1beta1/rest taskqueue taskqueue API./apis/taskqueue/v1beta2/rest Tasks Tasks API         ./apis/tasks/v1/rest translate translate API./apis/translate/v2/rest Urlshortener URL Shortener API ./apis/urlshortener/v1/rest webfonts Google Web Fonts Developer API./apis/webfonts/v1/rest YouTube You Tube API./apis/youtube/v3alpha/rest youtubeanalytics YouTube Analytics API./apis/youtubeanalytics/v1/re

 St $

The output in Listing 11 shows the API name, their title, and the URL path for further analysis of each API.


The examples in this article demonstrate the power of public APIs to extract information from the Internet. Web APIs provide the ability to access targeted, specific information, compared to web crawl and crawl (spidering). New value is being created on the Internet, not only by using these APIs, but also by combining them in novel ways to provide new data to an increasing number of Web users.

But keep in mind that there is a price to pay for using the API. Limitations are often complained of. Again, the fact that the API rules may be changed without notifying you is a must to be considered when building your application. Recently, Twitter has changed its API to provide a "more consistent experience". This change is a disaster for many third-party applications that may be considered a typical Twitter Web client competitor.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.