Application of Ironfan in big data cluster deployment and configuration management

Source: Internet
Author: User

Ironfan Introduction

In Serengeti, there are two most important and critical functions: one is virtual machine management, that is, creating and managing required virtual machines for a Hadoop cluster in vCenter; the other is cluster software installation and configuration management, that is, Hadoop-related components (including Zookeeper, Hadoop, Hive, and Pig) are installed on virtual machines with operating systems installed ), update the configuration file (such as the IP address of the Namenode/Jobtracker/Zookeeper node), and then start the Hadoop service. Ironfan is the component responsible for cluster software installation and configuration management in Serengeti.

Ironfan is a cluster software deployment and configuration management tool developed based on Chef technology. Chef is an open-source system configuration management tool similar to Puppet and CFEngine. It defines a set of easy-to-use DSL (Domain Specific language) the language is used to install and configure any software and the configuration system itself on a machine with the basic operating system installed. Ironfan provides simple and easy-to-use command line tools for automated deployment and management of clusters based on Chef framework and APIs. Ironfan supports the deployment of Zookeeper, Hadoop, and HBase clusters. You can also write a new cookbook to deploy any other non-Hadoop clusters.

Ironfan was initially developed by Infochimps, a U. S. Big Data startup, using the Ruby language and open source with Apache Licensev2 on github.com. At first, Ironfan only supported deploying Hadoop clusters on Amazon EC2's Ubuntu virtual machine. The VMwareProject Serengeti team chose Ironfan to develop the Big Data cluster tool and implemented a series of major improvements so that Ironfan can create and deploy Hadoop clusters on CentOS 5.x Virtual Machine in VMware vCenter. The improved Ironfan of ProjectSerengeti is also open-source on github.com using Apache License v2 for free download and modification.

Ironfan Architecture

The following figure depicts the Ironfan architecture. Ironfan mainly includes Cluster OrchestrationEngine, VM Provision Engine, SoftwareProvision Engine, Chef Server and Package Server used to store data.


· ClusterOrchestration Engine: The overall controller of Ironfan. It is responsible for loading and parsing cluster definition files, creating virtual machines, and saving cluster configuration information in ChefServer, call the Chef rest api to create the corresponding ChefNode and Chef Client for each virtual machine, and set the ChefRole of each virtual machine.

· VMProvision Engine: Creates all virtual machines in the cluster and waits for the Virtual Machine to obtain the IP address. VM Provision Engine provides interfaces to create virtual machines in various virtual machine cloud environments. Currently, Amazon EC2 and VMware vCenter are supported. In Serengeti, all virtual machines are created by Ironfan callers in VMware vCenter, And the IP addresses are saved in the cluster spec file and passed to the VM Provision Engine of Ironfan.

· SoftwareProvision Engine: use the default username and password pre-created in the virtual machine to remotely log on to all virtual machines through SSH and start chef-client to install software. Chef-client is a proxy program in the Chef framework. It is responsible for executing the installation configuration script specified by Chef Role in advance on the node where it runs. The chef-client also saves the execution progress data in the Chef Server.

· ChefServer: used to store Chef Nodes, Chef Clients, Chef Roles, and Chef Cookbooks. It provides Chef RESTAPI and is an important component of the Chef framework.

· PackageServer: used to store the required Hadoop and other Hadoop-dependent installation packages.


Ironfan provides the Knife CLI command line interface, which is called by the SerengetiWeb Service component) to create a separate process to call Knife CLI. The process exit status value determines whether the process is successful or fails. The caller can obtain the specific cluster node data and execution progress information from ChefServer at any time.

650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/131229/115T5I17-0.png "title =" if1.png "alt =" 142025631.png"/>

IronfanKnife CLI

Each SerengetiCLI cluster command corresponds to an IronfanKnife CLI command, including create (create cluster), list (view cluster), config (configure cluster), and stop (stop cluster), start (start cluster), and delete (delete cluster ).

Clustercreate => knife cluster create <cluster_name>-f/opt/serengeti/logs/task/<task_id>/<step_id>/<cluster_name>. json -- yes -- bootstrap

Clusterlist => knife cluster show <cluster_name>-f/opt/serengeti/logs/task/<task_id>/<step_id>/<cluster_name>. json -- yes

Clusterconfig => knife cluster bootstrap <cluster_name>-f/opt/serengeti/logs/task/<task_id>/<step_id>/<cluster_name>. json -- yes

Clusterstop => knife cluster stop <cluster_name>-f/opt/serengeti/logs/task/<task_id>/<step_id>/<cluster_name>. json -- yes

Clusterstart => knife cluster start <cluster_name>-f/opt/serengeti/logs/task/<task_id>/<step_id>/<cluster_name>. json-yes -- bootstrap

Clusterdelete => knife cluster kill <cluster_name>-f/opt/serengeti/logs/task/<task_id>/<step_id>/<cluster_name>. json -- yes


The parameter/opt/serengeti/logs/task/<task_id>/<step_id>/<cluster_name>. json is the cluster spec file passed by Serengeti Web Service to Ironfan. This is a JSON file, it contains the cluster node group, the number of nodes, the node software definition description, cluster configuration, PackageServer, and the names and IP addresses of all virtual machines. Ironfan analyzes the cluster spec file, generates the cluster definition file required by Ironfan, and stores it in/opt/serengeti/tmp/. ironfan-clusters/<cluster_name>. rb. Ironfancluster definition file (DSL, roles)

Next, let's take a look at how Ironfan defines a cluster. The following figure shows the demo. rb definition file of the cluster named demo. It is a Ruby file that describes the composition structure of the cluster using the DSL language defined by Ironfan and defines three virtual units. Each facet defines a virtual unit, which contains several virtual machines installed with the same software. The number of nodes in each group is specified by instance, and the software to be installed on the virtual machine is specified by role. This role is the role defined in Chef.

650) this. length = 650; "src =" http://www.bkjia.com/uploads/allimg/131229/115T54129-1.png "title =" if2.png "width =" 700 "height =" 424 "border =" 0 "hspace =" 0 "vspace =" 0 "style =" width: 700px; height: pixel PX; "alt =" 142142254.png"/>

Chef Roles and Cookbooks

All Chef Role files in Serengeti are stored in/opt/serengeti/cookbooks/roles/*. rb, all

The Chef Cookbook file is stored in/opt/serengeti/cookbooks/


Take hadoop_namenode role as an example. The content of/opt/serengeti/cookbooks/roles/hadoop_namenode.rb is as follows:

Name 'hadoop _ namenode'

Description 'RUNS a namenode infully-distributed mode. There shoshould be exactly one of these per cluster .'


Run_list % w [

Role [hadoop] # One role can contain references to another role

Hadoop_cluster: namenode # hadoop_cluster is a cookbook, and namenode is a recipe in this cookbook.

]


If the developer needs to modify and debug the role and cookbook files, after modifying the role and cookbook files, run the following command to upload the role and cookbook files:

Knife role from file/opt/serengeti/cookbooks/roles/<role_name>. rb-V

Knifecookbook upload <cookbook_name>-VCluster Service Discovery

During cluster deployment, some components are installed and services are started sequentially. For example, the Datanode service must be started after the Namenode service is started, the Tasktracker service needs to be started after the Jobtracker service is started, and these services are usually not on the same virtual machine. Therefore, Ironfan needs to control the installation and startup sequence of services on different nodes during the deployment process to synchronize nodes with dependencies. Ironfan uses a cookbook named cluster_service_discovery to synchronize related nodes.

Cluster_service_discoverycookbook defines provide_service, provider_fqdn, provider_fqdn_for_role, all_providers_for_service, and other methods for node synchronization. We will explain how to implement synchronization when the datanode service needs to wait for the namenode service to start:

· In namenoderecipe, after the namenode service is started, call provide_service (node [: hadoop] [: namenode_service_name]) and register this node as the namenode service provider to the Chef Server;

· In datanoderecipe, call provider_fqdn (node [: hadoop] [: namenode_service_name]) to query the FQDN (or IP address) of the namenode service provider from the Chef Server before starting the datanode service ); the provider_fqdn method queries the Chef Server every five seconds until the query results are found, or an error is returned when the request times out after 30 minutes.


The synchronization of other related nodes is similar to this mechanism. For example, the Zookeeper nodes wait for each other, and the HBase node waits for the Zookeeper node. For details about how to call the node, see cluster_service_discovery, zookeeper, hadoop, and hbase cookbook source code.


About vSphere Big Data Extensions:

VMware vSphere Big Data Extensions (BDE) supports Big Data and Apache Hadoop jobs based on the vSphere platform. Based on the open-source Serengeti project, BDE provides enterprise users with a series of integrated management tools. By virtualizing Apache Hadoop on vSphere, it helps users flexibly, elastically, securely, and quickly deploy, run, and manage big data on their infrastructure. Understanding

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.