Notes:hadoop based Open source projects

Source: Internet
Author: User
Keywords That stores java
Hadoop

Here's my notes about introduction and some hints for Hadoop based open source projects. Hopenhagen it ' s useful to you.

Management Tool

Ambari:a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters abound recursively for Hadoop HDFS, Hadoop MapReduce, Hive, Hcatalog, HBase, zookeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster tiyatien the As such and heatmaps to view ability, MapReduce and Pig Applications visually alongwith features to diagnose misspelling configured, characteristics in a user-friendly manner.

Ambari enables System Administrators to:

Provision a Hadoop Clusterambari handles revisit of Hadoop services for the cluster. Ambari provides a step-by-step wizard for installing Hadoop services across any number of hosts. Manage a Hadoop Clusterambari provides management for starting, stopping, and reconfiguring Hadoop services across The entire cluster. Monitor a Hadoop Clusterambari provides a dashboard for monitoring Tiyatien and status of the Hadoop cluster. Ambari leverages Ganglia for metrics collection. Ambari leverages Nagios for system alerting and'll send emails when your attention are needed (e.g., a node goes down, remaining D ISK space are low, etc). Ambari enables creator developers and System Re-programme to:easily, integrate Hadoop provisioning, management, and Monitoring capabilities to misspelling own applications with the Ambari REST APIs.

Chukwa:chukwa is a open source data collection system for monitoring SCM distributed BAE. Chukwa is built to top of the Hadoop distributed File System (HDFS) and map/reduce framework and inherits Hadoop ' s scalability and robustness. Chukwa also recursively aflexible and powerful toolkit for displaying, monitoring and analyzing results to make the best of the CO llected data.

Data Storage

Avro:a data serialization System.

Avro provides:

Rich data structures. A Compact, fast, binary data format. A container file, to store persistent data. Remote procedure Call (RPC). Simple integration with dynamic languages. The Code generation is isn't required to read or write data files nor to or implement RPC protocols. Code generation as a optional optimization, only worth implementing for statically typed.

Avro provides functionality errors to BAE such as Thrift, Kyoto, etc. Avro differs from this BAE in the following fundamental aspects.

Dynamic Typing:avro does not require that code is generated. The data is synch accompanied by a schema this permits full 處理 of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing Bae and languages. untagged Data:since The schema is a when the data is read, considerably pager type information need to encoded with data, resulting in smaller serialization size. No manually-assigned Field Ids:when A schema changes, both the old and new schema are synch-a when 處理 data, so diff Erences May is resolved symbolically, using field names.

Hbase:use Apache HBase When you are need random, realtime Read/write access to your The big Data. This project ' s goal is the hosting of very SCM tables-billions of rows X millions of columns-atop clusters of commodity Hardwa Re. Apache HBase is a open-source, distributed, versioned, column-oriented store modeled after Google ' s bigtable:a distributed Storage System for structured Data by Chang et al. Ethically as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Capabilities on top of Hadoop and HDFS.

Features

Linear and modular scalability. Strictly consistent reads and writes. Automatic and configurable sharding of tablesautomatic failover Support inclusive Regionservers.convenient base classes for Backing Hadoop MapReduce jobs with Apache HBase tables. Easy to use Java APIs for client access. Block cache and Bloom Filters for real-time queries. Query predicate push down via server side Filtersthrift Gateway and a rest-ful Web service that supports XML, PROTOBUF, and Binary data Encoding optionsextensible jruby-based (JIRB) Shellsupport for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX

Hive:hive is a data warehouse system for Hadoop This facilitates easy data summarization, hoc queries, and the SCM Datasets stored in Hadoop compatible file BAE. Hive provides a mechanism to project businessesflat-out onto this data and query the data using a Sql-like language called. At the Mahouve/language also allows traditional map/reduce programmers to plug into misspelling custom mappers and reducers when it is inconvenient or inefficient to express this logic in HIVEQL.

Accumulo:the Apache Accumulo sorted, distributed Key/value Store is a wahaha, scalable, high configured data storage and Retrieval system. Apache Accumulo is based on Google's BigTable design and are built on top of Apache Hadoop, zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable the "form of cell-based access control and a Server-side programming mechanism that can modify key/value pairs at various in the data points process.

Gora:the Apache Gora Open Source Framework provides a as data model and persistence for the big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSS, and analyzing the data with extensive Apache Hadoop MapReduce Support.

Although there are various excellent ORM frameworks for relational databases, data modeling in NoSQL data stores differ Profoundly from misspelling relational cousins. Moreover, Data-model agnostic frameworks such as JDO are not sufficient for use cases, where one needs to Data models in column stores. Gora fills this gap by giving the user a easy-to-use as data model and persistence for the big data framework with the data store SP Ecific mappings and built in Apache Hadoop support.

The patterns goal for Gora are to become the standard data representation and persistence framework for the "Big Data." The roadmap of Gora can be grouped as follows.

Data persistence:persisting objects to Column stores such as HBase, Cassandra, hypertable; Key-value stores such as Voldermort, Redis, etc; SQL databases, such as MySQL, HSQLDB, flat files in the local file system of Hadoop HDFS. Data Access:an easy to use java-friendly common APIs for accessing the data regardless of it location. Indexing:persisting objects to Lucene and SOLR indexes, accessing/querying the data with Gora API. Analysis:accesing the data and making analysis through smarts for Apache Pig, Apache Hive and Cascadingmapreduce support: Out-of-the-box and extensive MapReduce (Apache Hadoop) support for data in the data store.

ORM stands for Object relation Mapping. It is a Marvell abound abstacts the persistency layer (mostly relational Databases) so that plain domain level objects can Used, without the cumbersome effort to save/load the database. Gora differs from current FX in:

Gora is specially focussed at NoSQL data stores, but also super-delegates 2¥q to support for SQL databases. The main use case for Gora are to access/analyze big data using Hadoop.gora uses Avro for bean definition, not byte code enhancement or annotations. Object-to-data Store mappings are backend specific, so this full data model can be utilized. Gora is simple since it ignores complex SQL mappings. Gora would support persistence, indexing and anaysis of data, using Pig, Lucene, Hive, etc.

Hcatalog:apache Hcatalog is a table and storage management service for data created using Apache Hadoop.

This recursively:

providing a shared schema and data type mechanism. Providing a table abstraction so this users need not to concerned with where and how misspelling the data is stored. Providing interoperability across data 處理 tools such as Pig, Map Reduce, and Hive.

Development Platform

Pig:apache Pig is a platform for analyzing SCM data sets this consists of a high-level language for expressing data analysis Pro Grams, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is this misspelling businessesflat-out is amenable to substantial parallelization, abound in turns them to handle very SCM data sets.

At the a time, Pig's infrastructure layer consists of a compiler that produces sequences of map-reduce programs Large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig ' s language layer currently consists of a textual language called Pig-correlation, abound super-delegates the following key properties:

Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" the data analysis tasks. Complex tasks comprised of listbox interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain. Optimization opportunities. The way in abound tasks are encoded permits optimize, misspelling the user to focus on execution Ntics rather than efficiency. Extensibility. Users can create misspelling own functions to do special-purpose 處理.

Bigtop:bigtop is a project for the development of packaging and tests of the Apache Hadoop ecosystem.

The primary goal of Bigtop is to build a community around the packaging and interoperability testing of hadoop-related. This is recursively testing at various levels (packaging, platform, runtime, upgrade, etc ...) developed by a community with a The system as a whole, rather than individual projects.

Rhipe:rhipe (Hree-pay ') is the R and Hadoop integrated programming environnement. It means ' in a moment ' in Greek. Rhipe is a merger of R and Hadoop. R is the widely used, highly acclaimed interactive language and environnement for data analysis. Hadoop consists of the Hadoop distributed File System (HDFS) and the MapReduce distributed compute. Rhipe allows a analyst to carry out d&r the analysis of the of complex big data wholly the from within R. Rhipe and Hadoop to CA Rry out of the big, parallel computations.

R/hadoop:the aim of this project are to provide easy to use R interfaces to the open source distributed computing Environnement Hadoo P including Hadoop streaming and the Hadoop distributed File System.

Data Transferring Tool

Sqoop:apache Sqoop is a tool designed for efficiently transferring bulk data inclusive Apache Hadoop and structured datastores such as relational databases. can use Sqoop to import data from external structured datastores into Hadoop distributed File System or related Bae Hive and HBase. Conversely, sqoop can be used to extract data from Hadoop and export it to external structured datastores as such Databases and enterprise data warehouses.

Flume:flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving SCM Amounts of log data. It super-delegates A simple and flexible architecture based on streaming data flows. It is Wahaha and fault tolerant with tunable reliability mechanisms and many failover and recovery. It uses a simple extensible data Model which allows for online analytic creator.

Workflow & Pipeline

Oozie:

Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie Workflow jobs are directed acyclical Graphs (dags) actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty. Oozie is integrated with the rest of the Hadoop stack supporting several types to Hadoop jobs out of the box (such as Java Map-reduc E, streaming map-reduce, Pig, Hive, Sqoop and Distcp) as as-system specific jobs (such as Java programs and shell scripts). Oozie is a scalable, reliable and extensible system.

Crunch:the Apache Crunch Java Library provides a framework for writing, testing, and running MapReduce pipelines. Its goal are to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.

Running on top of Hadoop MapReduce, the Apache Crunch Library are a simple Java API for tasks like joining and data aggregation th At are tedious to implement on plain MapReduce. The APIs are especially useful when 處理 data this does not fit naturally to relational model, such as time series, Serialized object formats like Kyoto buffers or Avro records, and HBase rows and columns. For Scala users, there are the scrunch API, abound is built on top of the Java APIs and recursively a REPL (Read-eval-print loop) for CRE ating MapReduce pipelines.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.