Notes: Hadoop based open source projects

來源:互聯網
上載者:User
關鍵字 that stores java
Hadoop

Here's my notes about introduction and some hints for Hadoop based open source projects. Hope it's useful to you.

Management Tool

Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hado op HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive a pplications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.

Ambari enables System Administrators to:

Provision a Hadoop ClusterAmbari handles configuration of Hadoop services for the cluster. Ambari provides a step-by-step wizard for installing Hadoop services across any number of hosts. Manage a Hadoop ClusterAmbari provides central management for starting, stopping, and reconfiguring Hadoop services across the entire cluster. Monitor a Hadoop ClusterAmbari provides a dashboard for monitoring health and status of the Hadoop cluster. Ambari leverages Ganglia for metrics collection. Ambari leverages Nagios for system alerting and will send emails when your attention is needed (e.g., a node goes down, re maining disk space is low, etc). Ambari enables Application Developers and System Integrators to:Easily integrate Hadoop provisioning, management, and moni toring capabilities to their own applications with the Ambari REST APIs.

Chukwa: Chukwa is an open source data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalabi lity and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.

Data Storage

Avro: A data serialization system.

Avro provides:

Rich data structures. A compact, fast, binary data format. A container file, to store persistent data. Remote procedure call (RPC). Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.

Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects.

Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes , etc. This facilitates construction of generic data-processing systems and languages. Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data , resulting in smaller serialization size. No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data , so differences may be resolved symbolically, using field names.

HBase: Use Apache HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commod ity hardware. Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable: A Distribut ed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable -like capabilities on top of Hadoop and HDFS.

Features

Linear and modular scalability. Strictly consistent reads and writes. Automatic and configurable sharding of tablesAutomatic failover support between RegionServers.Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables. Easy to use JAVA API for client access. Block cache and Bloom Filters for real-time queries. Query predicate push down via server side FiltersThrift gateway and a REST-ful Web service that supports XML, Protobuf, an d binary data encoding optionsExtensible jruby-based (JIRB) shellSupport for exporting metrics via the Hadoop metrics subs ystem to files or Ganglia; or via JMX

Hive: Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the anal ysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Accumulo: The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a se rver-side programming mechanism that can modify key/value pairs at various points in the data management process.

Gora: The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with exten sive Apache Hadoop MapReduce support.

Although there are various excellent ORM frameworks for relational databases, data modeling in NoSQL data stores differ pr ofoundly from their relational cousins. Moreover, data-model agnostic frameworks such as JDO are not sufficient for use cases, where one needs to use the full pow er of the data models in column stores. Gora fills this gap by giving the user an easy-to-use in-memory data model and persistence for big data framework with dat a store specific mappings and built in Apache Hadoop support.

The overall goal for Gora is to become the standard data representation and persistence framework for big data. The roadmap of Gora can be grouped as follows.

Data Persistence : Persisting objects to Column stores such as HBase, Cassandra, Hypertable; key-value stores such as Voldermort, Redis, etc; SQL databases, such as MySQL, HSQLDB, flat files in local file system of Hadoop HDFS. Data Access : An easy to use JAVA-friendly common API for accessing the data regardless of its location. Indexing : Persisting objects to Lucene and Solr indexes, accessing/querying the data with Gora API. Analysis : Accesing the data and making analysis through adapters for Apache Pig, Apache Hive and CascadingMapReduce suppo rt : Out-of-the-box and extensive MapReduce (Apache Hadoop) support for data in the data store.

ORM stands for Object Relation Mapping. It is a technology which abstacts the persistency layer (mostly Relational Databases) so that plain domain level objects c an be used, without the cumbersome effort to save/load the data to and from the database. Gora differs from current solutions in that:

Gora is specially focussed at NoSQL data stores, but also has limited support for SQL databases. The main use case for Gora is to access/analyze big data using Hadoop.Gora uses Avro for bean definition, not byte code en hancement or annotations. Object-to-data store mappings are backend specific, so that full data model can be utilized. Gora is simple since it ignores complex SQL mappings. Gora will support persistence, indexing and anaysis of data, using Pig, Lucene, Hive, etc.

HCatalog: Apache HCatalog is a table and storage management service for data created using Apache Hadoop.

This includes:

Providing a shared schema and data type mechanism. Providing a table abstraction so that users need not be concerned with where or how their data is stored. Providing interoperability across data processing tools such as Pig, Map Reduce, and Hive.

Development Platform

Pig: Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data ana lysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns en ables them to handle very large data sets.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties:

Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, makin g them easy to write, understand, and maintain. Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focu s on semantics rather than efficiency. Extensibility. Users can create their own functions to do special-purpose processing.

Bigtop: Bigtop is a project for the development of packaging and tests of the Apache Hadoop ecosystem.

The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related pro jects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc...) developed by a community with a focu s on the system as a whole, rather than individual projects.

RHIPE: RHIPE (hree-pay') is the R and Hadoop Integrated Programming Environment. It means "in a moment" in Greek. RHIPE is a merger of R and Hadoop. R is the widely used, highly acclaimed interactive language and environment for data analysis. Hadoop consists of the Hadoop Distributed File System (HDFS) and the MapReduce distributed compute engine. RHIPE allows an analyst to carry out D&R analysis of complex big data wholly from within R. RHIPE communicates with Ha doop to carry out the big, parallel computations.

R/Hadoop: The aim of this project is to provide easy to use R interfaces to the open source distributed computing environm ent Hadoop including Hadoop Streaming and the Hadoop Distributed File System.

Data Transferring Tool

Sqoop: Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datasto res such as relational databases. You can use Sqoop to import data from external structured datastores into Hadoop Distributed File System or related system s like Hive and HBase. Conversely, Sqoop can be used to extract data from Hadoop and export it to external structured datastores such as relation al databases and enterprise data warehouses.

Flume: Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large a mounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

Workflow & Pipeline

Oozie:

Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as JAVA map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as JAVA programs and shell scripts). Oozie is a scalable, reliable and extensible system.

Crunch: The Apache Crunch JAVA library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficien t to run.

Running on top of Hadoop MapReduce, the Apache Crunch library is a simple JAVA API for tasks like joining and data aggr egation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series , serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the JAVA APIs and includes a REPL (read-eval-print loo p) for creating MapReduce pipelines.

相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.