Big Data three: several nouns

Source: Internet
Author: User
Tags hadoop mapreduce

Hadoop=hdfs+hive+pig+ ...

HDFS: Storage System
MapReduce: Computing Systems
Hive: MapReduce for SQL Developers (via HIVEQL), Hadoop-based Data Warehouse framework
Pig: Hadoop-based language development
HBase: NoSQL Database
Flume: A framework for collecting and processing Hadoop data
Oozie: A workflow processing system that allows users to define a series of jobs in multiple languages, such as Mapreduce,pig and hive
Ambari: A Web-based deployment/management/monitoring tool set for Hadoop clusters
Avro: A data serialization system that allows schema encoding of Hadoop files
Mahout: A data Mining library that contains some of the most popular data digging algorithms and implements them with the MapReduce model
Sqoop: A connectivity tool that comes in from non-Hadoop data stores, such as relational databases and data warehouses, to Hadoop
Hcatalog: A centralized metadata management and Apache Hadoop shared service that allows a unified view of all data in a Hadoop cluster and allows different tools, including pig and hive, to process any data element , Without needing to know the data storage of the body in the cluster.

bigtop: To create a more formal program or framework for Hadoop's sub-projects and related components to improve the Hadoop platform as a whole for packaging and interoperability testing.

Apache Storm: A distributed real-time computing system, storm is a task parallel continuous computing engine. Storm itself is not typically run on a Hadoop cluster, it uses Apache zookeeper and its own master/slave processes, coordinates topologies, hosts and worker states, and guarantees the semantics of information. In any case, storm must still be able to consume from HDFs files or write from files to HDFs.

Apache Spark: A fast, general-purpose engine for large-scale data processing, spark is a parallel, common batch processing engine. Workflows are defined in a similar and nostalgic-style mapreduce, but more capable than traditional Hadoop mapreduce. Apache Spark has its streaming API project, which allows continuous processing through short interval batches. Apache Spark itself does not require Hadoop operations. However, its data parallel pattern requires stable data optimization using a shared file system. The stability source can range from S3,nfs or more typically to HDFS. Hadoop YARN is not required to execute a spark application. Spark has its own stand-alone master/server process. However, this is common for running applications that use the yarn container spark. In addition, spark can also be transported on Mesos clusters .

Big Data three: several nouns

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.