Big Data technology: Hadoop family, Cloudera series, Spark, Storm "turn"

Source: Internet
Author: User
Tags cassandra hadoop mapreduce sqoop

Big Data We all know about Hadoop, but there's a whole range of technologies coming into our sights: Spark,storm,impala, let's just not come back. To be able to better architect big data projects, here to organize, for technicians, project managers, architects to choose the right technology, understand the relationship between the various technologies of big data, choose the right language.

We can read this article with the following questions:
What technologies are included in 1.hadoop?
2.Cloudera What is the relationship between the company and Hadoop, what are the products and what are the characteristics of the products?
What is the 3.Spark association with Hadoop?
What is the 4.Storm association with Hadoop?

Hadoop Family:

Founder: Doug Cutting

The entire Hadoop family consists of several sub-projects:

Hadoop Common:

A module at the bottom of the Hadoop system that provides tools for each of the Hadoop sub-projects, such as configuration files and log operations.

Hdfs:

is the primary distributed storage system in Hadoop applications, the HDFs cluster contains a Namenode (master node) that manages the metadata of all file systems and the Datanode (data nodes that can have many) that store real data. HDFs is designed for massive amounts of data, so HDFs optimizes access and storage for small batches of large files compared to traditional file system optimizations on large batches of small files.

Mapreduce:

is a software framework that makes it easy to write parallel applications that handle massive (terabytes) of data, connecting tens of thousands of nodes (commercial hardware) in a large cluster in a reliable and fault-tolerant manner.

Hive:

Apache Hive is a data warehouse system for Hadoop that facilitates a review of data (mapping a structured data file into a database table), ad hoc queries, and large data set analysis stored in a Hadoop compliant system. Hive provides full SQL query functionality--hiveql language, and while using this language to express a logic becomes inefficient and cumbersome, HIVEQL also allows traditional map/reduce programmers to use their own custom mapper and reducer. Hive is similar to Cloudbase, a set of software that provides the SQL capabilities of data Warehouse on a Hadoop distributed computing platform. The summary of the massive amounts of data stored in Hadoop simplifies the ad hoc query.

Pig:

Apache Pig is a platform for large data set analysis that includes a high-level language for data analysis applications and an infrastructure to evaluate these applications. The flash feature of pig applications is that their structures stand up to large amounts of parallelism, which means that they support very large datasets. The infrastructure layer of pig contains the compiler that generates the Map-reduce task. The language layer of Pig currently contains a native language--pig Latin, which was originally developed to be easy to program and ensure scalability.

Pig is a sql-like language, a high-level query language built on MapReduce, which compiles some operations into the map and reduce of the MapReduce model, and the user can define their own functions. Yahoo Grid Computing department developed another project to clone Google Sawzall.

HBase:

Apache HBase is a Hadoop database, a distributed, scalable, big data store. It provides random and real-time read/write access to large data sets and optimizes for large tables on commercial server clusters-tens of billions of rows and millions of columns. Its core is the open source implementation of the Google BigTable paper, distributed Columnstore. Just as BigTable uses the distributed data store provided by GFS (Google File System), it is a class bigatable provided by Apache Hadoop on an hdfs basis.

ZooKeeper:

Zookeeper is an open source implementation of Google's chubby. It is a reliable coordination system for large-scale distributed systems, including configuration maintenance, name services, distributed synchronization, group services, etc. The goal of zookeeper is to encapsulate complex and error-prone services that provide users with easy-to-use interfaces and performance-efficient, robust systems.

Avro:

Avro is a RPC project hosted by Doug Cutting, a bit like Google's Protobuf and Facebook thrift. Avro is used to do later Hadoop RPC, which makes Hadoop's RPC module communicate faster and with more compact data structures.

Sqoop:

Sqoop is a tool used to transfer data from Hadoop and relational databases to and from a relational database to HDFs in Hadoop, or to the data in HDFs into a relational database.

Mahout:

Apache Mahout is a scalable machine learning and data Mining library that currently supports the main 4 use cases of mahout:

    • Recommended mining: Collect user actions and use this to recommend things that you might like.
    • Aggregation: Collects files and groups related files.
    • Classification: Learn from existing classification documents, look for similar features in documents, and categorize them correctly for untagged documents.
    • Frequent itemsets mining: grouping a set of items and identifying which individual items will often appear together.

Cassandra:

Apache Cassandra is a high-performance, linearly scalable, high-availability database that can run on commercial hardware or cloud infrastructure to create the perfect mission-critical data platform. In replication across the data center, Cassandra is best-in-class, providing users with lower latency and more reliable disaster backups. With strong support for log-structured update, anti-normalization and materialized views, and powerful built-in caches, the Cassandra Data model provides a convenient two-level index (column Indexe).

Chukwa:

Apache Chukwa is an open source data collection system for monitoring large distribution systems. Built on the HDFs and map/reduce frameworks, it inherits the scalability and stability of Hadoop. The Chukwa also includes a flexible and powerful toolkit for displaying, monitoring, and analyzing results to ensure optimal use of data.

Ambari:

Apache Ambari is a web-based tool for configuring, managing, and monitoring Apache Hadoop clusters, supporting Hadoop HDFS, Hadoop MapReduce, Hive, Hcatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides cluster health dashboards, such as heatmaps and the ability to view MapReduce, Pig, and hive applications to diagnose their performance characteristics with a friendly user interface.

Hcatalog

Apache Hcatalog is a mapping table and storage Management Service for Hadoop to build data, which includes:

    • Provides a mechanism for sharing patterns and data types.
    • Provides an abstract table so that users do not need to focus on the way and address of the data store.

Provides interoperability for data processing tools like pig, MapReduce, and hive.

Chukwa:

Chukwa is a large cluster monitoring system based on Hadoop, which is contributed by Yahoo.

Cloudera Series Products:

Founding organization: Cloudera Company

1.Cloudera Manager:

There are four functions

    • (1) Management
    • (2) Monitoring
    • (3) Diagnosis
    • (4) Integration

2.Cloudera CDH: English Name: CDH (Cloudera ' s distribution, including Apache Hadoop)

Cloudera has made a corresponding change to Hadoop.

Cloudera Company's release, we call this version CDH (Cloudera distribution Hadoop).

3.Cloudera Flume

Flume is the Cloudera provided by the Log collection system, flume support in the log system to customize various types of data senders, for data collection;

Flume is a highly available, highly reliable, distributed mass log capture, aggregation, and transmission system provided by Cloudera, Flume supports the customization of various data senders in the log system for data collection, while Flume provides simple processing of data The ability to write to various data-receiving parties (customizable).

Flume is the first cloudera to provide a log collection system, is currently under the Apache Incubation project, flume support in the log system to customize various types of data senders, for data collection, while Flume provides simple processing of data, The ability to write to various data recipients (customizable) flume provided from the console (console), RPC (THRIFT-RPC), text (file), tail (UNIX tail), syslog (syslog log System, Supports 2 modes such as TCP and UDP, and the ability to collect data on data sources such as exec (command execution).

The Flume uses a multi-master approach. To ensure consistency of configuration data, Flume[1] introduces zookeeper for saving configuration data, zookeeper itself guarantees consistency and high availability of configuration data, and zookeeper can notify Flume master nodes when configuration data changes. Flume Master synchronizes data using the gossip protocol.

4.Cloudera Impala

Cloudera Impala provides direct query interaction for SQL that you store in Apache Hadoop data in Hdfs,hbase. In addition to using the same unified storage platform as Hive, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC Driver and user interface (Hue beeswax). Impala also offers a familiar platform for batch or real-time queries and unified platforms.

5.Cloudera Hue

Hue is a CDH dedicated set of Web managers that includes 3 parts of Hue Ui,hue Server,hue db. Hue provides an interface for all CDH components of the Shell interface. You can write Mr in Hue, view files that modify HDFs, manage hive metadata, run Sqoop, write Oozie workflows, and much more.

Spark:

Founding organization: Development of the University of California, Berkeley AMP Lab (algorithms, machines, and people Lab)

Spark is an open-source cluster computing environment similar to Hadoop, but there are some differences between the two that make spark more advantageous in some workloads, in other words, Spark enables the memory distribution dataset, in addition to providing interactive queries, It can also optimize iteration workloads.

Spark is implemented in the Scala language and uses Scala as its application framework. Unlike Hadoop, Spark and Scala are tightly integrated, and Scala can manipulate distributed datasets as easily as local collection objects.

Although the Spark was created to support an iterative job on a distributed dataset, it is actually a supplement to Hadoop that can be run in parallel in the Hadoo file system. This behavior can be supported by a third-party cluster framework named Mesos. Developed by the UC Berkeley AMP Lab (algorithms, machines, and People Lab), Spark can be used to build large, low-latency data analytics applications.

Storm

Founder: Twitter

Twitter is officially open source for Storm, a distributed, fault-tolerant, real-time computing system that is hosted on GitHub and follows the Eclipse public License 1.0. Storm is a real-time processing system developed by Backtype, and Backtype is now under Twitter. The latest version on GitHub is Storm 0.5.2, which is basically written in Clojure.

Big Data technology: Hadoop family, Cloudera series, Spark, Storm "turn"

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.