Apache Hadoop is now widely adopted by organizations as the industry standard for MapReduce implementations, and the Savanna project is designed to allow users to run and manage Hadoop over OpenStack . Amazon has been providing Hadoop services over EMR (Elastic MapReduce) for years.
Savanna needed information from users to build clusters such as Hadoop's version, cluster topology, node hardware details, and some other information. Within a matter of minutes Savana will help users set up the cluster, and Savanna will also be able to help users scale the cluster (add or remove work nodes) as needed.
Program for the following use cases:
Quickly configure Hadoop clusters for Dev and QA
Provide "analytics as a service" for dedicated or unexpected analytic workloads (similar to EMR in AWS)
Take advantage of unused computing power in a generic OpenStack IaaS cloud .
The main features are as follows:
Appears as an OpenStack component
Managed through the REST API, the user interface as part of the OpenStack Dashboard.
Support for multiple Hadoop distributions:
Pluggable system as a Hadoop installation engine.
Integrates vendor-specific management tools such as Apache Ambari or Cloudera Managent Console.
Pre-defined templates for Hadoop configuration, with configuration parameters.
Details
Savanna products primarily communicate with the following OpenStack components:
Horizon - Provides a GUI to use all of Savanna's features.
Keystone - Authenticates users and provides security tokens to communicate with OpenStack to assign users specific OpenStack privileges.
Nova - Configure virtual machines for Hadoop clusters.
Glance - used to store Hadoop virtual machine images, each mirror contains the installed OS and Hadoop; pre-installed Hadoop should give us the convenience of node layout.
Swift - can be used as pre-storage for Hadoop jobs.
Regular workflow
Savanna gives users two APIs and UIs at different levels of abstraction based on use cases: Cluster configuration and analysis as a service.
The workflow for cluster quick configuration includes the following options:
Choose Hadoop version
Select whether or not to include the base image of pre-installed Hadoop
For a base image that does not pre-install Hadoop, Savanna will provide a pluggable deployment engine that incorporates vendor tools.
Define the cluster configuration, including the cluster size and topology, and set different Hadoop parameters (such as heap size).
Configurable templates will be provided for easy configuration of parameters.
Cluster Configuration: Savanna will provide virtual machines, install and configure Hadoop.
Action on the cluster: Add and remove nodes.
Stop the cluster when it is not needed.
The workflow for Analytics as a Service includes the following options:
Choose a predefined version
Configure the job:
Choose the type of job: pig, hive, jar-file, and so on
Provide job script source or jar path
Choose input and output data path (initially only supports Swift)
Choose the path for the log
Set the cluster size limit
Execute the job:
All cluster configurations and job executions are clearly presented to the user
After the job is completed, the cluster is automatically removed
Retrieve the calculation (for example from Swift)
User side
When using Savanna to configure the cluster, the user operates on two types of entities: Node Template and Cluster.
The Node Template is used to describe the nodes in a cluster and contains several parameters. The node type belongs to one of the properties of the Node Template and will determine what Hadoop will run on the node and determine the role that the node plays in the cluster, which may be Job Tracker, NameNode, TaskTracker, DataNode, or both Logical combination. The Node Template also holds hardware parameters that are used by the node's virtual machines and the Hadoop's work on the node.
The Cluster entity is used to describe Hadoop Cluster and describes the pre-installed Hadoop virtual machine features for cluster deployment and cluster topology. The topology is a list of node templates and the number of deployed nodes per template. With regard to the topology, Savanna verifies that the NameNode and JobTracker in the cluster are unique.
Each node template and cluster belong to the tenant assigned by the user, and the user can only access the object that has accessed the tenant. Users can only edit or delete objects they create. Of course, administrators can access all objects. Savanna needs to follow the same OpenStack access policy.
Savanna offers a variety of Hadoop cluster topologies, and Job Tracker and NameNode processes can choose to run on one or two separate virtual machines. The same cluster can contain many types of work nodes, work nodes can act as both TaskTracker and DataNode, also can play a role. Savanna allows users to set up a cluster with any combination of options.
Integration with Swift
In OpenStack, Swift is stored as a standard object, similar to Amazon S3. Often deployed on physical hosts, Swift has many enhancements to use as "HDFS on OpenStack."
The first file system for Swift: HADOOP-8545, so Hadoop jobs can run on Swift. In Swift, we have to change the request to Change I6b1ba25b. It maps endpoints to Object, Account, or Container lists so Swift can be integrated with software that relies on data location information to avoid network overhead.
Pluggable deployment and monitoring
Surveillance comes from vendor-specific Hadoop management tools, and Savanna integrates with Nagios and Zabbix pluggable external monitoring systems.
Deployment and monitoring tools will be installed on separate virtual machines, allowing a single instance to manage or monitor different clusters at the same time.