In Serengeti, there are two most important and most critical functions: one is virtual machine management and the other is cluster software installation and configuration management. The virtual machine management is to create and manage the required virtual machines for a Hadoop cluster in vCenter. Cluster software installation and configuration management is to install Hadoop related components (including Zookeeper, Hadoop, Hive, Pig, etc.) on the installed virtual machine of the operating system, update the configuration file such as Namenode / Jobtracker / Zookeeper node IP and other information, and then Start the Hadoop service. Ironfan is the component responsible for the management of cluster software installation and configuration in Serengeti.
Ironfan is a cluster software deployment configuration management tool based on Chef technology. Chef is an open source system configuration management tool similar to Puppet and CFEngine that defines a set of easy-to-use DSL (Domain Specific language) language for installing and configuring any software on a machine that already has a basic operating system installed And configure the system itself. Ironfan Chef-based frameworks and APIs provide an easy-to-use, command-line tool for automated deployment and management of clusters. Ironfan supports deploying Zookeeper, Hadoop, and HBase clusters, or you can write a new cookbook to deploy any other non-Hadoop cluster.
Ironfan was initially developed using the Ruby language by Infochimps, a Big Data start-up in the United States, and is open sourced under Apache Licensev2 at github.com. Ironfan initially supported Hadoop clustering on Amazon EC2's Ubuntu virtual machines. The VMwareProject Serengeti team chose to develop the Big Data clustering tools based on Ironfan and implemented a number of significant improvements that allowed Ironfan to create deploying Hadoop clusters on CentOS 5.x virtual machines in VMware vCenter. ProjectSerengeti's improved Ironfan also opens source on github.com for Apache License v2 for free download and modification.
Ironfan architecture
The figure below depicts Ironfan's architecture. Ironfan includes Cluster Orchestration Engine, VM Provision Engine, SoftwareProvision Engine, and Chef Server and Package Server for storing data.
ClusterOrchestration Engine: Ironfan's total controller, responsible for loading and parsing the cluster definition file, creating a virtual machine, saving the configuration information of the cluster in the ChefServer and calling the Chef REST API to create a corresponding ChefNode and Chef Client for each virtual machine and setting ChefRole for each virtual machine.
VMProvision Engine: Create all the virtual machines in the cluster and wait for the virtual machines to get IPs. The VM Provision Engine provides interfaces to support the creation of virtual machines in various virtual machine cloud environments and is currently supported by Amazon EC2 and VMware vCenter. In Serengeti, all virtual machines are created in VMware vCenter by Ironfan callers and saved in a cluster spec file for delivery to Ironfan's VM Provision Engine.
SoftwareProvision Engine: Using a pre-created default username and password in a virtual machine, SSH remotely to all virtual machines and start chef-client to install the software. chef-client is an agent in the Chef framework that executes the installation configuration script specified in advance by Chef Role on the node it runs on. chef-client also saves execution progress data in Chef Server.
ChefServer: Used to store Chef Nodes, Chef Clients, Chef Roles, Chef Cookbooks, Chef RESTAPI, is an important part of Chef framework.
PackageServer: The installation package on which the required Hadoop and other Hadoop files are stored.
Ironfan provides the Knife CLI command line interface to the outside world. The caller (ie, SerengetiWeb Service component) creates a separate process to call the Knife CLI and exits the state value through the process to determine the success or failure. Specific cluster node data and execution progress information is obtained by the caller from ChefServer at any time.
IronfanKnife CLI
Each SerengetiCLI cluster command corresponds to an IronfanKnife CLI command, including create, list, config, stop, start, delete.
clustercreate => knife cluster create -f / opt / serengeti / logs / task ///. json - yes --bootstrap
clusterlist => knife cluster show -f / opt / serengeti / logs / task ///. json - yes
clusterconfig => knife cluster bootstrap -f / opt / serengeti / logs / task ///. json-yes
clusterstop => knife cluster stop -f / opt / serengeti / logs / task ///. json - yes
clusterstart => knife cluster start -f / opt / serengeti / logs / task ///. json-yes --bootstrap
clusterdelete => knife cluster kill -f / opt / serengeti / logs / task ///. json - yes
The parameter /opt/serengeti/logs/task///.json is the cluster spec file that Serengeti Web Services passes to Ironfan, which is a JSON-formatted file that contains the cluster group nodes, the number of nodes, the nodes Software definition description, cluster configuration, PackageServer and name of all virtual machines and IP and other information. Ironfan analyzes the cluster spec file, generates the cluster definition file Ironfan needs, and saves it at /opt/serengeti/tmp/.ironfan-clusters/.rb.
Ironfancluster definition file (DSL, roles)
Next, let's see how Ironfan defines the cluster. The figure below is a demo file demo.rb for a demo named cluster, which is a Ruby file that describes the cluster's composition using the DSL language defined by Ironfan and defines three virtual machine groups. Each facet defines a virtual machine group containing several virtual machines that have the same software installed. The number of nodes in each group is specified by instance, and the software to be installed on the virtual machine is specified by role. This role is the role defined in Chef.
Chef Roles and Cookbooks
All Chef Role files in Serengeti are in /opt/serengeti/cookbooks/roles/*.rb and all
Chef Cookbook files are stored in / opt / serengeti / cookbooks / cookbooks /
For example, hadoop_namenode role, / opt / serengeti / cookbooks / roles / hadoop_namenode.rb reads as follows:
name 'hadoop_namenode'
description 'runs a namenode infully-distributed mode. There should be exactly one of these per cluster.'
run_list% w [
role [hadoop] # A role can contain references to another role
hadoop_cluster :: namenode # hadoop_cluster is a cookbook and namenode is a recipe in this cookbook
]
If the developer needs to modify the debug role and cookbook, after modifying the role and cookbook files, run the following command to upload the role and cookbook:
knife role from file / opt / serengeti / cookbooks / roles / .rb -V
knifecookbook upload -V
Cluster Service Discovery
In the process of cluster deployment, there are some dependencies on the installation and service startup sequence of some components. For example, the Datanode service needs to be started after the Namenode service is started. The Tasktracker service needs to be started after the Jobtracker service is started, and the services are usually out On the same virtual machine. Therefore, Ironfan needs to control the installation and startup sequence of services on different nodes in the deployment process and synchronize the nodes with dependencies. Ironfan uses a cookbook named cluster_service_discovery to synchronize related nodes.
The cluster_service_discoverycookbook defines methods such as provide_service, provider_fqdn, provider_fqdn_for_role, all_providers_for_service, etc. for implementing node synchronization. We need to wait for the datanode service namenode service startup as an example to explain how to achieve synchronization:
In namenoderecipe, after starting the namenode service, call provide_service (node [: hadoop] [: namenode_service_name]) to register this node with the chef server as the provider of the namenode service;
In datanoderecipe, the provider_fqdn (node [: hadoop] [: namenode_service_name]) is called to query the Chef Server for the FQDN (or IP) of the namenode service provider before starting the datanode service; the provider_fqdn method queries the Chef Server every 5 seconds Once, until the results of the inquiry, or 30 minutes after the timeout error.
Synchronization of other related nodes is also similar to this mechanism, for example, the mutual wait between Zookeeper nodes, HBase nodes waiting for the Zookeeper node, the specific method calls can view the cluster_service_discovery, zookeeper, hadoop, and hbase cookbook source code.
About vSphere Big Data Extensions:
VMware vSphere Big Data Extensions (BDE) supports big data and Apache Hadoop jobs based on the vSphere platform. Based on the open source Serengeti project, BDE provides enterprise-class users with a set of integrated management tools to help users implement agile, resilient, secure and fast big data deployments, operations and infrastructure on the infrastructure by virtualizing Apache Hadoop on vSphere Management work. To learn more about VMware vSphere Big Data Extensions, see http://www.vmware.com/hadoop.
about the author
Jesse Hu
VMware Senior Development Engineer
One of the technology leaders at VMware's Big Data Products vSphere BDE and Serengeti open source project was the earliest developer of the Serengeti open source project and the first prototype system to be the architect of Ironfan, the Serengeti cluster software configuration and management module. Prior to joining VMware, he worked for several IT companies such as Yahoo, IBM, Oracle and others, and learned and researched open source communities, cloud computing, Mobile, SNS, Web 2.0 and Ruby.
Original link: http://vbigdata.blog.51cto.com/7526470/1338356
【Editor's Choice】
Big Data Will Drive a New Wave of Mobile Computing Revolutions CA ERwin Reduce Big Data vs. Traditional Data Sources Tianyun Trends Mongolia Invited to Attend 2013 Hadoop Technology Summit 2014 Big Data and Forecast Analysis Market Concerns Tendency [ Editor: Wang Chengcheng TEL: (010) 68476606】