Tutorial on web Information capturing using Ruby program, rubyweb
Websites no longer simply cater to human readers. Many websites now support APIs that allow computer programs to obtain information. Screen capture-time-saving technology for parsing HTML pages into forms that are easier to understand-is still very convenient. However, the chances of using APIs to simplify Web data extraction are increasing rapidly. According to the ProgrammableWeb information, at the time of this article, more than 10,000 website APIS exist-an increase of 3,000 in the past 15 months. (ProgrammableWeb provides an API to search for and retrieve APIs, mashup, member profiles, and other data from its directory .)
This article first introduces modern Web crawling and compares it with API methods. The Ruby example shows how to use APIs to extract structured information from some popular Web attributes. You need to have a basic understanding of the Ruby language, concrete State Transfer (REST), and JavaScript Object Notation (JSON) and XML concepts.
Capture and API
There are currently multiple capture solutions. Some of them convert HTML to other formats, such as JSON, so that it is easier to extract the desired content. For other solutions to read HTML, you can define the content as a function of the HTML layered structure, where the data has been tagged. One such solution is Nokogiri, which supports parsing HTML and XML documents in Ruby. Other open-source crawling tools include pjscrape for JavaScript and Beautiful Soup for Python. Pjscrape implements a command line tool to capture fully rendered pages, including JavaScript content. Beautiful Soup is fully integrated into Python 2 and 3 environments.
Assume that you want to use the capture function and Nokogiri to identify the number of IBM employees reported by CrunchBase. The first step is to understand the markup of a specific HTML page listing the number of IBM employees on CrunchBase. Figure 1 shows the page opened in the Firebug tool in Mozilla Firefox. The upper part of the image shows the HTML, and the lower part shows the HTML source code of the interested part.
The Ruby script in Listing 1 uses Nokogiri to capture the number of employees from the web page in Figure 1.
Listing 1. Using Nokogiri to parse HTML (parse. rb)
#!/usr/bin/env rubyrequire 'rubygems'require 'nokogiri'require 'open-uri'# Define the URL with the argument passed by the useruri = "http://www.crunchbase.com/company/#{ARGV[0]}"# Use Nokogiri to get the documentdoc = Nokogiri::HTML(open(uri))# Find the link of interestlink = doc.search('tr span[1]')# Emit the content associated with that linkputs link[0].content
In the HTML source code displayed by Firebug (as shown in Figure 1), you can see the data (number of employees) You are interested in is embedded in an HTML Unique ID <span> tag. The <span id = "num_employees"> tag is the first of the two <span> IDs. Therefore, the last two commands in Listing 1 use link = doc. search ('tr span [1] ') requests the first <span> flag, and then uses puts link [0]. content sends the resolved link content.
CrunchBase also discloses a rest api, which can access much more data than the data accessed through the crawling function. Listing 2 shows how to use this API to extract company employees from the CrunchBase site.
Listing 2. Using CrunchBase rest api and JSON parsing (api. rb) together)
#!/usr/bin/env rubyrequire 'rubygems'require 'json'require 'net/http'# Define the URL with the argument passed by the useruri = "http://api.crunchbase.com/v/1/company/#{ARGV[0]}.js"# Perform the HTTP GET request, and return the responseresp = Net::HTTP.get_response(URI.parse(uri))# Parse the JSON from the response bodyjresp = JSON.parse(resp.body)# Emit the content of interestputs jresp['number_of_employees']
In Listing 2, you define a URL (the company name is passed in as a script parameter ). Then, use the HTTP class to send a GET request and return a response. The response is parsed as a JSON object. You can reference data items of interest through a Ruby data structure.
The console session in listing 3 shows the results of running the capture script in Listing 1 and the API-Based Script in Listing 2.
Listing 3. Demonstrate capture and API methods
$ ./parse.rb ibm388,000$ ./api.rb ibm388000$ ./parse.rb cisco63,000$ ./api.rb cisco63000$ ./parse.rb paypal300,000$ ./api.rb paypal300000$
When the capture script runs, you receive a formatted count, and the API script generates an original integer. As shown in listing 3, you can promote the use of each script and obtain the number of employees from other companies that CrunchBase traces. The general structure of the URL provided by each method makes this versatility possible.
So what can we get through the API? For capturing, You need to analyze HTML to understand its structure and identify the data to be extracted. Then it is easy to parse HTML with Nokogiri and get the data you are interested in. However, if the structure of the HTML document changes, you may need to modify the script to parse the new structure correctly. According to the API contract, the API method does not have this problem. Another important advantage of the API method is that you can access all data exposed through the interface (through the returned JSON object. There is much less CrunchBase data exposed through HTML and available for users.
Now let's take a look at how to use other APIs to extract various types of information from the Internet. We also need to use Ruby scripts. First, let's take a look at how to collect personal data from a social network site. Then we can see how to use other API sources to find less personal data.
Use LinkedIn to extract personal data
LinkedIn is a professional social network website. It is useful to contact other developers, find a job, study a company, or join a group to collaborate on interesting topics. LinkedIn also integrates a recommendation engine to recommend jobs and companies based on your profile.
LinkedIn users can access the REST and JavaScript APIs of the site to obtain information that can be accessed through its human-readable Website: contact information, social sharing streams, content groups, communications (messages and contact invitations), and company and work information.
To use the LinkedIn API, you must register your application. After registration, you will obtain an API key and secret key, as well as a User Token and secret key. LinkedIn uses the OAuth protocol for authentication.
After Authentication, you can send a REST request through the access token object. A response is a typical HTTP response, so you can parse the body into a JSON object. Then, the JSON object can be iterated to extract the data of interest.
The Ruby script in Listing 4 provides the company recommendations and work suggestions that interest LinkedIn users after authentication.
Listing 4. Use the LinkedIn API (lkdin. rb) to view company and work suggestions
#!/usr/bin/rubyrequire 'rubygems'require 'oauth'require 'json'pquery = "http://api.linkedin.com/v1/people/~?format=json"cquery='http://api.linkedin.com/v1/people/~/suggestions/to-follow/companies?format=json'jquery='http://api.linkedin.com/v1/people/~/suggestions/job-suggestions?format=json' # Fill the keys and secrets you retrieved after registering your appapi_key = 'api key'api_secret = 'api secret'user_token = 'user token'user_secret = 'user secret' # Specify LinkedIn API endpointconfiguration = { :site => 'https://api.linkedin.com' } # Use the API key and secret to instantiate consumer objectconsumer = OAuth::Consumer.new(api_key, api_secret, configuration) # Use the developer token and secret to instantiate access token objectaccess_token = OAuth::AccessToken.new(consumer, user_token, user_secret)# Get the username for this profileresponse = access_token.get(pquery)jresp = JSON.parse(response.body)myName = "#{jresp['firstName']} #{jresp['lastName']}"puts "\nSuggested companies to follow for #{myName}"# Get the suggested companies to followresponse = access_token.get(cquery)jresp = JSON.parse(response.body)# Iterate through each and display the company namejresp['values'].each do | company | puts " #{company['name']}"end# Get the job suggestionsresponse = access_token.get(jquery)jresp = JSON.parse(response.body)puts "\nSuggested jobs for #{myName}"# Iterate through each suggested job and print the company namejresp['jobs']['values'].each do | job | puts " #{job['company']['name']} in #{job['locationDescription']}"endputs "\n"
The console session in listing 5 shows the output of running the Ruby script in Listing 4. The three independent calls to the LinkedIn API in the script have different output results (one for identity authentication and the other two for the company suggestion and work suggestion links respectively ).
Listing 5. Demonstrate the LinkedIn Ruby script
$ ./lkdin.rbSuggested companies to follow for M. Tim Jones Open Kernel Labs, Inc. Linaro Wind River DDC-I Linsyssoft Technologies Kalray American Megatrends JetHead Development Evidence Srl Aizyc TechnologySuggested jobs for M. Tim Jones Kozio in Greater Denver Area Samsung Semiconductor Inc in San Jose, CA Terran Systems in Sunnyvale, CA Magnum Semiconductor in San Francisco Bay Area RGB Spectrum in Alameda, CA Aptina in San Francisco Bay Area CyberCoders in San Francisco, CA CyberCoders in Alameda, CA SanDisk in Longmont, CO SanDisk in Longmont, CO$
LinkedIn APIs can be used in combination with any language that provides OAuth support.
Use the Yelp API to retrieve business data
Yelp discloses a rich rest api for executing enterprise searches, including rating, comments, and geographic searches (location, city, and geocode ). Using the Yelp API, you can search for a given type of enterprise (such as "hotel") and restrict the search to a geographical boundary; near a geographical coordinate; or a neighbor, address, or city nearby. The JSON response contains a large amount of information related to the enterprise that matches the condition, including address information, distance, rating, and transaction, and other types of URL information (such as the enterprise's image and mobile format information.
Like LinkedIn, Yelp uses OAuth for authentication, so you must register with Yelp to get a set of creden。 for authentication through this API. After the script completes authentication, You can construct a REST-based URL request. In Listing 6, I hardcoded a hotel request for Boulder, Colorado. The response body is parsed into a JSON object and iterated to send the desired information. Note: I have excluded closed enterprises.
Listing 6. Using the Yelp API (yelp. rb) to retrieve enterprise data
#!/usr/bin/rubyrequire 'rubygems'require 'oauth'require 'json'consumer_key = 'your consumer key'consumer_secret = 'your consumer secret'token = 'your token'token_secret = 'your token secret'api_host = 'http://api.yelp.com'consumer = OAuth::Consumer.new(consumer_key, consumer_secret, {:site => api_host})access_token = OAuth::AccessToken.new(consumer, token, token_secret)path = "/v2/search?term=restaurants&location=Boulder,CO"jresp = JSON.parse(access_token.get(path).body)jresp['businesses'].each do | business | if business['is_closed'] == false printf("%-32s %10s %3d %1.1f\n", business['name'], business['phone'], business['review_count'], business['rating']) endend
The console session in listing 7 shows the sample output of running the script in Listing 6. To make it simpler, I only show the previous group of enterprises returned, rather than supporting the restriction/Offset feature of this API (to retrieve the entire list by executing multiple calls ). This example shows the company name, phone number, number of received comments, and average score.
Listing 7. Demonstrate the Yelp API Ruby script
$ ./yelp.rbFrasca Food and Wine 3034426966 189 4.5John's Restaurant 3034445232 51 4.5Leaf Vegetarian Restaurant 3034421485 144 4.0Nepal Cuisine 3035545828 65 4.5Black Cat Bistro 3034445500 72 4.0The Mediterranean Restaurant 3034445335 306 4.0Arugula Bar E Ristorante 3034435100 48 4.0Ras Kassa's Ethiopia Restaurant 3034472919 101 4.0L'Atelier 3034427233 58 4.0Bombay Bistro 3034444721 87 4.0Brasserie Ten Ten 3039981010 200 4.0Flagstaff House 3034424640 86 4.5Pearl Street Mall 3034493774 77 4.0Gurkhas on the Hill 3034431355 19 4.0The Kitchen 3035445973 274 4.0Chez Thuy Restaurant 3034421700 99 3.5Il Pastaio 3034479572 113 4.53 Margaritas 3039981234 11 3.5Q's Restaurant 3034424880 65 4.0Julia's Kitchen 8 5.0$
Yelp provides an API with excellent documentation, as well as data descriptions, examples, and error handling. Although the Yelp API is very useful, its usage is limited. As an original software developer, you can perform up to 100 API calls per day and 1,000 API calls for testing purposes. If your application meets the display requirements of Yelp, you can perform 10,000 calls per day (or more ).
Contains a simple mashup domain location
The next example links two source codes to generate information. In this example, you want to convert a Web domain name to its general geographic location. How does the Ruby script in listing 8 use Linux? The host command and OpenCrypt IP Location API Service are used to retrieve Location information.
Listing 8. retrieving the location information of a Web domain
#!/usr/bin/env rubyrequire 'net/http'aggr = ""key = 'your api key here'# Get the IP address for the domain using the 'host' commandIO.popen("host #{ARGV[0]}") { | line | until line.eof? aggr += line.gets end}# Find the IP address in the response from the 'host' commandpattern = /\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$/if m = pattern.match(aggr) uri = "http://api.opencrypt.com/ip/?IP=#{m[0]}&key=#{key}" resp = Net::HTTP.get_response(URI.parse(uri)) puts resp.bodyend
In listing 8, you first use the local host command to convert the domain name to an IP address. (The host command uses an internal API and DNS resolution to resolve the domain name to an IP address .) You use a simple regular expression (and the match method) to parse the IP address from the host command output. With an IP address, you can use the IP address location service on OpenCrypt to retrieve general geographic location information. OpenCrypt API allows you to perform a maximum of 50,000 free API calls.
OpenCrypt API call is simple: the URL you construct contains the IP address you want to locate and the key provided to you during the OpenCrypt registration process. The HTTP response body contains the IP address, country code, and country name.
The console session in listing 9 shows the output of two sample domain names.
Listing 9. Using a simple domain location script
$ ./where.rb www.baynet.ne.jpIP=111.68.239.125CC=JPCN=Japan$ ./where.rb www.pravda.ruIP=212.76.137.2CC=RUCN=Russian Federation$
Google API Query
Google is the undisputed winner of Web APIs. Google has so many APIs that it provides another API to query them. Using the Google API Discovery Service, you can list available APIs provided by Google and extract their metadata. Although identity authentication is required for interaction with most Google APIs, you can access the query API through a secure socket connection. For this reason, listing 10 uses the Ruby https class to construct a connection to a secure port. The defined URL specifies the REST request, and the response adopts JSON encoding. Iteratively respond to and issue a small portion of the preferred API data.
Listing 10. Using Google API Discovery Service (gdir. rb) to list Google APIs
#!/usr/bin/rubyrequire 'rubygems'require 'net/https'require 'json'url = 'https://www.googleapis.com/discovery/v1/apis'uri = URI.parse(url)# Set up a connection to the Google API Servicehttp = Net::HTTP.new( uri.host, 443 )http.use_ssl = truehttp.verify_mode = OpenSSL::SSL::VERIFY_NONE# Connect to the servicereq = Net::HTTP::Get.new(uri.request_uri)resp = http.request(req)# Get the JSON representationjresp = JSON.parse(resp.body)# Iterate through the API Listjresp['items'].each do | item | if item['preferred'] == true name = item['name'] title = item['title'] link = item['discoveryLink'] printf("%-17s %-34s %-20s\n", name, title, link) endend
The console session in listing 11 shows the response example obtained by running the script in listing 10.
Listing 11. Using a simple Google Directory Service Ruby script
$ ./gdir.rbadexchangebuyer Ad Exchange Buyer API ./apis/adexchangebuyer/v1.1/restadsense AdSense Management API ./apis/adsense/v1.1/restadsensehost AdSense Host API ./apis/adsensehost/v4.1/restanalytics Google Analytics API ./apis/analytics/v3/restandroidpublisher Google Play Android Developer API ./apis/androidpublisher/v1/restaudit Enterprise Audit API ./apis/audit/v1/restbigquery BigQuery API ./apis/bigquery/v2/restblogger Blogger API ./apis/blogger/v3/restbooks Books API ./apis/books/v1/restcalendar Calendar API ./apis/calendar/v3/restcompute Compute Engine API ./apis/compute/v1beta12/restcoordinate Google Maps Coordinate API ./apis/coordinate/v1/restcustomsearch CustomSearch API ./apis/customsearch/v1/restdfareporting DFA Reporting API ./apis/dfareporting/v1/restdiscovery APIs Discovery Service ./apis/discovery/v1/restdrive Drive API ./apis/drive/v2/rest...storage Cloud Storage API ./apis/storage/v1beta1/resttaskqueue TaskQueue API ./apis/taskqueue/v1beta2/resttasks Tasks API ./apis/tasks/v1/resttranslate Translate API ./apis/translate/v2/resturlshortener URL Shortener API ./apis/urlshortener/v1/restwebfonts Google Web Fonts Developer API ./apis/webfonts/v1/restyoutube YouTube API ./apis/youtube/v3alpha/restyoutubeAnalytics YouTube Analytics API ./apis/youtubeAnalytics/v1/rest$
The output in listing 11 shows the API names, their titles, and further analyzes the URL paths of each API.
Conclusion
The example in this article demonstrates the powerful functions of public APIs in extracting information from the Internet. Compared with spidering, Web APIs provide the ability to access specific targeted information. The Internet is constantly creating new values, not only through the use of these APIs, but also by combining them in a novel way, so as to provide more and more Web users with new data.
However, you must pay for using APIs. Restrictions are often complained. Similarly, the fact that you may change the API rules without notice must be taken into account when building an application. Recently, Twitter has changed its API to provide "a more consistent experience ". This change is undoubtedly a disaster for many third-party applications that may be considered a typical competitor of Twitter Web clients.