Read the previous reports, and from the perspective of the architecture of Netflix's large-scale Hadoop job scheduling tool. Its storage is mainly based on the Amazon S3 (simple Storage Service), using the flexibility of the cloud to run the dynamic adjustment of multiple Hadoop clusters, today can be a good response to different types of workloads, This scalable Hadoop platform, the service, is called Genie. But just recently, the GitHub, a predator from Netflix, finally unlocked the shackles and made open source on the island.
Genie provides job and resource scheduling in the Hadoop environment cloud, from the end-user perspective, Genie stripped the physical details of various Hadoop resources, providing a way to monitor and submit Hadoop, hive, and pig operations without having to install Hadoop clients Rest-ful Execution Service, responsible for the entire cluster and related hive and pig configurations.
Why Build Genie
There are two main reasons Netflix built Genie. The first is the need to run different sized Hadoop clusters in the cloud to cope with different workloads at Netflix. Some of them are based on the need to start, transient; for example, at night Netflix needs to start the "bonus" Hadoop cluster to increase resources for ETL (extract, transform, and load) processing. There are also a number of running clusters, such as regular SLAs and hoc clusters, but sometimes downtime, because Netflix uses cloud services and is also affected by cloud service stability. Users can find the latest version of these clusters by cluster name or supported load type, which is generally not a problem in the data center because the Hadoop cluster is not going down from time to time, but it is an unavoidable challenge in the cloud.
Second, some end users expect to run their own hadoop, hive or pig jobs--few even expect to run their own clusters, even install client software and download all jobs that need to be run. In general, whether it's a data center or a cloud, there is a need to run a job using a Rest-ful API that implements many functions, such as using it to build a network UI, a workflow template, and a visual tool that encapsulates the day-to-day needs.
The difference between Genie and some tools
First, Genie is not a workflow scheduler, such as Oozie. Genie's execution unit is a single hadoop, pig, or hive operation. Genie does not schedule or run workflows, in fact, Netflix uses an enterprise version of the Scheduler (UC4) to run the ETL.
Second, genie is not a task scheduler, such as some of the performance scheduling of Hadoop. Genie is essentially a resource "matchmaker" that assigns the right clusters based on job parameters and cluster performance. If there are multiple clusters available for the job to run, Genie will randomly allocate them. Of course, you can add a custom load balancer to better optimize the job and cluster matching, however, there is no such a load balancer.
Finally, Genie is also not a terminal to Terminal resource management tool that does not provide or start a cluster, nor does it open or close the cluster based on the utilization of the cluster. However, Genie can work with them to achieve better results, as a cluster resource pool and an API for job management.
How Genie Works
The following figure details the core components of Genie, as well as its two types of Hadoop users-administrators and end users.
The genie itself is built on the following several Netflix OSS components:
karyon--provides boot, run-time analysis (Runtime Insight), diagnostics, and hooks for different clouds.
eureka--provides registration of services and some search functions (such as Genie instances of search activities)
archaius--provides dynamic management characteristics of cloud
ribbon--provides Eureka consolidation and provides client-side load balancing and interprocess communication for rest-ful
servo--provides output measurements and registers with JMX (Java Management extensions) and sends them to external surveillance systems, such as Amazon's Cloudwatch
Genie can now be downloaded from the GitHub and deployed into a tomcat-like container. But deploying this is not going to work much unless you're registering a Hadoop cluster for it. There are several steps you can take to register a Hadoop cluster with Genie:
The Hadoop administrator initiates a Hadoop cluster, such as using the EMR Client API.
Upload a clustered Hadoop and hive configuration to a location on the S3
The administrator uses the Genie client to Eureka to find a Genie instance that invokes the configuration of the rest-ful registered cluster, where the properties are used: A unique ID, the name of the cluster, and some other attributes; for example, it supports SLA jobs and "prod" meta stores. If you create a new meta storage configuration, you will also need to register a new hive or pig configuration with Genie.
When the cluster is registered, Genie has been able to fulfill all the desires of the end user-to submit Hadoop, hive and pig jobs. End users use the Genie client to publish and monitor Hadoop jobs. The client internally uses Eureka to find an active Genie instance, and Ribbon performs the internal load balancing of the client and communicates with the restfully of the service. The job parameters that the user needs to specify include:
Type of job, Hadoop, hive or pig
Command line arguments for a job
S3 dependencies on the previous set of files, including scripts or UDF (user-defined functions)
The user must also inform Genie of the cluster type that needs to be selected. In this context, there are many choices--using cluster names or cluster IDs to specify specific clusters, or using schedules (such as SLAs) and meta storage configurations (such as PROD), so that Genie chooses a suitable cluster to run the job based on these parameters.
Genie creates a new working directory for each job, calculus all dependencies (including Hadoop, Hive, and pig to select the configuration for the cluster), and then selects a Hadoop client process from that working directory. A Genie Job ID is then returned, which the client can use to query the status of the job and obtain an output URI that can be used during the execution of the job and after the execution of the query (see figure below). Users can use it to monitor standard output and Hadoop client errors, as well as to view hive and pig client logs when errors occur.
The Genie execution model is very simple, Genie Select a new process for each job in the new working directory. This simple, important mode of work benefits the isolation of each job and the Genie, as well as the ease of operating standard output, error occurrence, and end-user job logs (which can be viewed from the output URI). Netflix does not use job queues within Genie because it must implement shared and performance schedulers if you want to implement Genie internal queues, but these are already implemented at the Hadoop layer. The number of parallel execution jobs on each Genie instance is limited, given the use of the JVM to process each job, based on available memory.
Genie's deployment at Netflix
Genie uses ASG (auto-scaling Group) to scale horizontally, so that through Asgard cloud management and deployment, Netflix can run thousands of Hadoop parallel jobs. Use Asgard to compute the minimum, desired, and maximum number of instances in multiple available areas for fault tolerance settings. For Genie Server Push, Asgard provides the "sequential ASG" concept, which allows traffic to be routed to the new ASG instance immediately after the new Genie is released, and the communication with the old instance is cut off by shutting down the old ASG.
By using Asgard, you can also set extended policies for dynamic loads. The screenshot below is a simple strategy that automatically opens a Genie instance once the average number of jobs on all instances is greater than 25.
Genie's practice at Netflix
Netflix has been using Genie Day to process nearly million of hadoop operations, processing thousands of TB of data. The following figure shows a summary of some of the clusters in the Netflix months:
The Blue line represents one of the SLA clusters, and the Orange Line represents a major hoc cluster. The red line represents another hoc cluster, which uses an experimental version of the shared scheduler, and Genie randomly assigns jobs to one of the two hoc clusters. When satisfied with the performance of the new scheduler band, Netflix is decisively working on another larger hoc cluster (also with orange lines), while all new hoc genie jobs are routed to the new cluster, and two of old clusters are closed as the running jobs are completed.
Concluding
While Genie has a powerful function, Netflix believes there are plenty of areas for Genie to continue to improve, such as the design of a generic data model, which has strong Netflix and cloud overtones. Netflix hopes to get more feedback on its products and make better improvements.
"Edit Recommendation"
DDoS attacks can also be a cloud-computing service Challenge Traditional relational database: Facebook graphic database Tao Secrets: Xiao Yun TEL: (010) 68476606 "