Big Data We all know about Hadoop, but there's a whole range of technologies coming into our sights: Spark,storm,impala, let's just not come back. To be able to better architect big data projects, here to organize, for technicians, project managers, architects to choose the right technology, understand the relationship between the various technologies of big data, choose the right language.
We can read this article with the following questions:
What technology is included in the 1.hadoop.
2.Cloudera What is the relationship between the company and Hadoop, what are the products, products have what characteristics.
What 3.Spark is associated with Hadoop.
What 4.Storm is associated with Hadoop. Hadoop Family
Founder: Doug Cutting
The entire Hadoop family consists of several sub-projects:
Hadoop Common:
A module at the bottom of the Hadoop system that provides tools for each of the Hadoop sub-projects, such as configuration files and log operations.
HDFS:
is the primary distributed storage system in Hadoop applications, the HDFs cluster contains a Namenode (master node) that manages the metadata of all file systems and the Datanode (data nodes that can have many) that store real data. HDFs is designed for massive amounts of data, so HDFs optimizes access and storage for small batches of large files compared to traditional file system optimizations on large batches of small files.
MapReduce:
is a software framework that makes it easy to write parallel applications that handle massive (terabytes) of data, connecting tens of thousands of nodes (commercial hardware) in a large cluster in a reliable and fault-tolerant manner.
Hive:
Apache Hive is a data warehouse system for Hadoop that facilitates a review of data (mapping a structured data file into a database table), ad hoc queries, and large data set analysis stored in a Hadoop compliant system. Hive provides full SQL query functionality--hiveql language, and while using this language to express a logic becomes inefficient and cumbersome, HIVEQL also allows traditional map/reduce programmers to use their own custom mapper and reducer. Hive is similar to Cloudbase, a set of software that provides the SQL capabilities of data Warehouse on a Hadoop distributed computing platform. The summary of the massive amounts of data stored in Hadoop simplifies the ad hoc query.
Pig:
Apache Pig is a platform for large data set analysis that includes a high-level language for data analysis applications and an infrastructure to evaluate these applications. The flash feature of pig applications is that their structures stand up to large amounts of parallelism, which means that they support very large datasets. The infrastructure layer of pig contains the compiler that generates the Map-reduce task. The language layer of Pig currently contains a native language--pig Latin, which was originally developed to be easy to program and ensure scalability.
Pig is a sql-like language, a high-level query language built on MapReduce, which compiles some operations into the map and reduce of the MapReduce model, and the user can define their own functions. Yahoo Grid Computing department developed another project to clone Google Sawzall.
HBase:
Apache HBase is a Hadoop database, a distributed, scalable, big data store. It provides random and real-time read/write access to large data sets and optimizes for large tables on commercial server clusters-tens of billions of rows and millions of columns. Its core is the open source implementation of the Google BigTable paper, distributed Columnstore. Just as BigTable uses the distributed data store provided by GFS (Google File System), it is a class bigatable provided by Apache Hadoop on an hdfs basis.
ZooKeeper:
Zookeeper is an open source implementation of Google's chubby. It is a reliable coordination system for large-scale distributed systems, including configuration maintenance, name services, distributed synchronization, group services, etc. The goal of zookeeper is to encapsulate complex and error-prone services that provide users with easy-to-use interfaces and performance-efficient, robust systems.
Avro:
Avro is a RPC project hosted by Doug Cutting, a bit like Google's Protobuf and Facebook thrift. Avro is used to do later Hadoop RPC, which makes Hadoop's RPC module communicate faster and with more compact data structures.
Sqoop:
Sqoop is a tool used to transfer data from Hadoop and relational databases to and from a relational database to HDFs in Hadoop, or to the data in HDFs into a relational database.
Mahout:
Apache Mahout is a scalable machine learning and data Mining library that currently supports the main 4 use cases: Recommended mining: Collect user actions and use them to recommend things that you might like. Aggregation: Collects files and groups related files. Classification: Learn from existing classification documents, look for similar features in documents, and categorize them correctly for untagged documents. Frequent itemsets mining: grouping a set of items and identifying which individual items will often appear together.
Cassandra:
Apache Cassandra is a high-performance, linearly scalable, high-availability database that can run on commercial hardware or cloud infrastructure to create the perfect mission-critical data platform. In replication across the data center, Cassandra is best-in-class, providing users with lower latency and more reliable disaster backups. With strong support for log-structured update, anti-normalization and materialized views, and powerful built-in caches, the Cassandra Data model provides a convenient two-level index (column Indexe).
Chukwa:
Apache Chukwa is an open source data collection system for monitoring large distribution systems. Built on the HDFs and map/reduce frameworks, it inherits the scalability and stability of Hadoop. The Chukwa also includes a flexible and powerful toolkit for displaying, monitoring, and analyzing results to ensure optimal use of data.
Ambari:
Apache Ambari is a web-based tool for configuring, managing, and monitoring Apache Hadoop clusters, supporting Hadoop HDFS, Hadoop MapReduce, Hive, Hcatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides cluster health dashboards, such as heatmaps and the ability to view MapReduce, Pig, and hive applications to diagnose their performance characteristics with a friendly user interface.
Hcatalog
Apache Hcatalog is a mapping table and storage Management Service for Hadoop to build data, which includes: Provides a shared schema and data type mechanism. Provides an abstract table so that users do not need to focus on the way and address of the data store.
Provides interoperability for data processing tools like pig, MapReduce, and hive.
Chukwa:
Chukwa is a large cluster monitoring system based on Hadoop, which is contributed by Yahoo. Back to top Cloudera Series products:
Founding organization: Cloudera Company
1.Cloudera Manager:
There are four functions (1) Management (2) Monitoring (3) Diagnostics (4) integration
2.Cloudera CDH: English Name: CDH (Cloudera ' s distribution, including Apache Hadoop)
Cloudera has made a corresponding change to Hadoop.
Cloudera Company's release, we call this version CDH (Cloudera distribution Hadoop).
3.Cloudera Flume
Flume is the Cloudera provided by the Log collection system, flume support in the log system to customize various types of data senders, for data collection;
Flume is a highly available, highly reliable, distributed mass log capture, aggregation, and transmission system provided by Cloudera, Flume supports the customization of various data senders in the log system for data collection, while Flume provides simple processing of data The ability to write to various data-receiving parties (customizable).
Flume is the first cloudera to provide a log collection system, is currently under the Apache Incubation project, flume support in the log system to customize various types of data senders, for data collection, while Flume provides simple processing of data, The ability to write to various data recipients (customizable) flume provided from the console (console), RPC (THRIFT-RPC), text (file), tail (UNIX tail), syslog (syslog log System, Supports 2 modes such as TCP and UDP, and the ability to collect data on data sources such as exec (command execution).
The Flume uses a multi-master approach. To ensure consistency of configuration data, Flume[1] introduces zookeeper for saving configuration data, zookeeper itself guarantees consistency and high availability of configuration data, and zookeeper can notify Flume master nodes when configuration data changes. Flume Master synchronizes data using the gossip protocol.
4.Cloudera Impala
Cloudera Impala provides direct query interaction for SQL that you store in Apache Hadoop data in Hdfs,hbase. In addition to using the same unified storage platform as Hive, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC Driver and user interface (Hue beeswax). Impala also offers a familiar platform for batch or real-time queries and unified platforms.
5.Cloudera Hue
Hue is a CDH dedicated set of Web managers that includes 3 parts of Hue Ui,hue Server,hue db. Hue provides an interface for all CDH components of the Shell interface. You can write Mr in Hue, view files that modify HDFs, manage hive metadata, run Sqoop, write Oozie workflows, and much more. Back to the top of Spark
Founding organization: Development of the University of California, Berkeley AMP Lab (algorithms, machines, and people Lab)
Spark is an open-source cluster computing environment similar to Hadoop, but there are some differences between the two that make spark more advantageous in some workloads, in other words, Spark enables the memory distribution dataset, in addition to providing interactive queries, It can also optimize iteration workloads.
Spark is implemented in the Scala language and uses Scala as its application framework. Unlike Hadoop, Spark and Scala are tightly integrated, and Scala can manipulate distributed datasets as easily as local collection objects.
Although the Spark was created to support an iterative job on a distributed dataset, it is actually a supplement to Hadoop that can be run in parallel in the Hadoo file system. This behavior can be supported by a third-party cluster framework named Mesos. Developed by the UC Berkeley AMP Lab (algorithms, machines, and People Lab), Spark can be used to build large, low-latency data analytics applications. Back to the top of Storm
Founder: Twitter
Twitter is officially open source for Storm, a distributed, fault-tolerant, real-time computing system that is hosted on GitHub and follows the Eclipse public License 1.0. Storm is a real-time processing system developed by Backtype, and Backtype is now under Twitter. The latest version on GitHub is Storm 0.5.2, which is basically written in Clojure.
2, Storm, Spark, Hadoop three big data processing tools who will become mainstream.
First, compare
Storm: Distributed real-time computing, emphasizing real-time, often used in places where real-time requirements are high
Hadoop: Distributed batch computing, emphasizing batch processing, often used for data mining, analysis
Spark: An open-source cluster computing system based on memory computing, designed to make data analysis faster, Spark is an open-source cluster computing environment similar to Hadoop, but there are some differences between the two, and these useful differences make spark Performance is more advantageous in some workloads, in other words, Spark enables a memory distribution dataset that optimizes iterative workloads in addition to providing interactive queries.
Spark is implemented in the Scala language and uses Scala as its application framework. Unlike Hadoop, Spark and Scala are tightly integrated, and Scala can manipulate distributed datasets as easily as local collection objects.
Although the Spark was created to support an iterative job on a distributed dataset, it is actually a supplement to Hadoop that can be run in parallel in the Hadoop file system. This behavior can be supported by a third-party cluster framework named Mesos. Developed by the UC Berkeley AMP Lab (Algorithms,machines, and People Lab), Spark can be used to build large, low-latency data analytics applications.
While there are similarities between Spark and Hadoop, it provides a new cluster computing framework with useful differences. First, Spark is designed for a specific type of workload in cluster computing, that is, workloads that reuse working data sets (such as machine learning algorithms) between parallel operations. To optimize these types of workloads, Spark introduces the concept of memory cluster computing, which caches data sets in memory in memory cluster calculations to reduce access latency
Second, the advantages
1. Simple programming
In terms of big data processing, it is believed that Hadoop is familiar, and Hadoop, based on Googlemap/reduce, provides the developer with a map, reduce primitive, which makes the parallel batch process very simple and graceful. Similarly, Storm provides some simple and elegant primitives for real-time computing of big data, which greatly reduces the complexity of the task of developing parallel real-time processing, helping you develop applications quickly and efficiently.
There are many types of data set operations offered by Spark, unlike Hadoop, which provides only map and reduce two operations. such as Map,filter, Flatmap,sample, Groupbykey, Reducebykey, Union, join, Cogroup,mapvalues, Sort,partionby and many other types of operations, They refer to these operations as transformations. It also provides count, collect, reduce, lookup, save and many more actions. These various types of data set operations provide convenience to upper-level applications. The communication model between processing nodes is no longer the only data shuffle a pattern like Hadoop. Users can name, materialize, control the partitioning of intermediate results, and so on. It can be said that the programming model is more flexible than Hadoop.
2. Multi-lingual support
In addition to implementing spout and bolts in Java, you can do this with any programming language you are familiar with, thanks to Storm's so-called multi-lingual protocol. A multi-language protocol is a special protocol within storm that allows spout or bolts to use standard input and standard output for message delivery, with a single line of text or multiple lines of JSON encoding.
Storm supports multi-language programming primarily through shellbolt,shellspout and shellprocess classes that implement the Ibolt and Ispout interfaces, And let the shell execute scripts or program protocols through the Java Processbuilder class.
As you can see, in this way, each tuple needs to encode and decode the JSON at the time of processing, so it will have a significant impact on throughput.
3. Support Horizontal expansion
There are three main entities that really run topology in a storm cluster: worker processes, threads, and tasks. Multiple worker processes can be run on each machine in a storm cluster, each worker process can create multiple threads, each thread can perform multiple tasks, and the task is the actual data processing entity, and the spout and bolts we develop are executed as one or more tasks.
As a result, compute tasks are performed in parallel across multiple threads, processes, and servers, supporting flexible horizontal scaling.
4. Strong fault Tolerance
If there are some exceptions to the message processing, Storm will reschedule the problematic processing unit. Storm ensures that a processing unit runs forever (unless you explicitly kill the processing unit).
5. Reliable Message Guarantee
Storm can guarantee that every message sent by spout can be "fully processed", which is a direct distinction from other real-time systems, such as S4.
6. Fast Message Processing
Use ZEROMQ as the underlying message queue to ensure that messages can be processed quickly
7. Local mode, support fast programming test
Storm has a "local model" that simulates all the functions of a storm cluster in a process, running topology in local mode is similar to running topology on a cluster, which is useful for our development and testing.
Three, integration
The fusion of Spark and Hadoop, from Hadoop 0.23 to the repository of MapReduce, sees that the goal of Hadoop is to support more parallel computing models, including MapReduce, such as Mpi,spark. After all, the single-node CPU utilization of Hadoop is not high now, so if this iterative-intensive operation is complementary to the existing platform. At the same time, it puts forward higher requirements to the resource dispatching system. With regard to resource scheduling, UC Berkeley seems to be doing a mesos, as well as using Linux container to unify the scheduling of Hadoop and other application models.
From the above, it is difficult to explain who will become the mainstream, on the contrary, the technology to achieve a strong integration is the king, learn from each other, complementary advantages, from the recent years of big data development, we are most familiar with the non-Hadoop belongs to.
Http://www.open-open.com/lib/view/open1416646884102.html