Big Data architect basics: various technologies such as hadoop family and cloudera product series

Last Update:2014-07-19 Source: Internet

Author: User

Tags cassandra hadoop mapreduce sqoop

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

We all know big data about hadoop, but various technologies will enter our field of view: spark, storm, and Impala, which cannot be reflected by us. In order to better construct Big Data projects, let's sort out the appropriate technologies for technicians, project managers, and architects to understand the relationship between various big data technologies and select the appropriate language.
We can read this article with the following questions:
1. What technologies does hadoop contain?
2. What is the relationship between cloudera and hadoop? What products are there? What features are there?
3. What is the association between spark and hadoop?
4. What is the association between storm and hadoop?
Hadoop family
Founder: Doug Cutting
The entire hadoop family consists of the following sub-projects:
Hadoop common:
The underlying module of the hadoop system provides various tools for hadoop sub-projects, such as configuration files and log operations. For details, see
HadoopTechnical insiderIn-depth analysisHadoop commonAndHDFSArchitecture Design and Implementation Principles1-9Chapter
HDFS:
Is the main distributed storage system in hadoop applications. The HDFS cluster contains a namenode (master node ), this node is responsible for managing metadata of all file systems and datanode that stores real data (there can be many data nodes ). HDFS is designed for massive data volumes. Therefore, compared with traditional file systems that optimize large volumes of small files, HDFS optimizes access and storage of small batches of large files. The details are as follows:
What isHDFSAndHDFSArchitecture Design
HDFS + mapreduce + hiveQuick Start
Hadoop2.2.0MediumHDFSWhy High Availability
JavaCreateHDFSFile instance
Mapreduce:
It is a software framework used to easily write parallel applications that process massive (Tb-level) data and connect tens of thousands of nodes (Commercial hardware) in a large cluster in a reliable and fault-tolerant manner ).
For details, see:
HadoopIntroduction(1 ):What isMAP/reduce
Hadoop mapreduceBasic
MapreduceWorking principles
Hand in your handMapreduceThe application instance is deployed inHadoop2.2.0Run on
Hive:
Apache hive is a hadoop data warehouse system that promotes data summarization (ing structured data files into a database table) ad-hoc queries and analysis of large datasets stored in hadoop compatible systems. Hive provides the complete SQL query function-hiveql language. When using this language to express a logic becomes inefficient and cumbersome, hiveql also allows traditional map/reduce programmers to use their own custom er and reducer. Hive is similar to cloudbase. It is a set of software that provides data warehouse SQL functions based on hadoop distributed computing platform. This simplifies ad-hoc queries by aggregating massive data stored in hadoop.
For details, see:
HiveOrigin and detailed introduction
HiveDetailed explanation video
Pig:
Apache pig is a platform for large-scale dataset analysis. It contains a high-level language for data analysis applications and the infrastructure for evaluating these applications. The Flash feature of pig applications is that their structures can withstand a large number of parallel operations, that is, they can support very large datasets. The pig infrastructure layer contains the compiler that generates map-reduce tasks. The pig language layer currently contains a native language, Pig Latin, which was originally designed to be easy to program and ensure scalability.
Pig is an SQL-like language. It is an advanced query language built on mapreduce. It compiles some operations into the map and reduce OF THE mapreduce model, and users can define their own functions. Another clone Google Project sawzall developed by the Yahoo Grid Computing Department.
For details, see:
PigSimple operations and syntaxes include data types, functions, keywords, operators, etc.
HadoopFamilyPigAndHiveWhat is the difference?
Hbase:
Apache hbase is a hadoop database that provides distributed and scalable big data storage. It provides random and real-time read/write access to large datasets and optimizes large tables on commercial Server clusters-tens of billions of rows and tens of millions of columns. Its core is the open-source implementation of Google bigtable and distributed columnar storage. Just like bigtable uses the distributed data storage provided by GFS (Google File System), it is a bigatable class provided by Apache hadoop Based on HDFS.
For details, see:
HbaseDifferences from traditional data
HbaseDistributed installation video download and sharing
Zookeeper:
Zookeeper is an open-source implementation of Google's chubby. It is a reliable coordination system for large-scale distributed systems and provides functions such as configuration maintenance, Name Service, distributed synchronization, and group service. The goal of zookeeper is to encapsulate key services that are complex and error-prone, and provide users with easy-to-use interfaces and systems with high performance and stable functions.
For details, see:
What isZookeeper,ZookeeperInHadoopAndHbaseWhat is the specific role
Avro:
Avro is an RPC Project hosted by Doug cutting, a bit similar to Google's protobuf and Facebook's thrift. Avro is used for later hadoop RPC, making hadoop RPC module faster communication and more compact data structure.
Sqoop:
Sqoop is a tool used to transfer data between hadoop and relational databases. It can import data from a relational database into HDFS of hadoop or import data from HDFS into a relational database.
For details, see:
SqoopDetailed descriptions include:SqoopCommand, principle, process

Mahout:
Apache mahout is a scalable machine learning and data mining database. Currently, mahout supports the following four use cases:
Recommendation Mining: collects user actions and recommends things that users may like.
Aggregation: Collects and groups related files.
Classification: You can learn from existing classification documents, find similar features in documents, and correctly classify unlabeled documents.
Frequent Item Set Mining: groups a group of items and identifies which items appear frequently.

CASSANDRA:
Apache Cassandra is a high-performance, linearly scalable, and highly effective database that can run on commercial hardware or cloud infrastructure to build a perfect mission-critical data platform. In cross-Data Center replication, Cassandra is the best of its kind to provide users with lower latency and more reliable disaster backup. With strong support for log-structured update, denormalization, materialized views, and powerful built-in cache, Cassandra's data model provides a convenient secondary index (column indexe ).
Chukwa:
Apache chukwa is an open-source data collection system that monitors large distributed systems. Built on the HDFS and MAP/reduce frameworks, it inherits the scalability and Stability of hadoop. Chukwa also contains a flexible and powerful toolkit for displaying, monitoring, and analyzing results to ensure optimal data use.
Ambari:
Apache ambari is a web-based tool used to configure, manage, and monitor Apache hadoop clusters. It supports hadoop HDFS, hadoop mapreduce, hive, and hcatalog,, hbase, Zookeeper, oozie, pig, and sqoop. Ambari also provides a cluster status dashboard, such as heatmaps and the ability to view mapreduce, pig, and hive applications, to diagnose their performance characteristics on a friendly user interface.
Hcatalog
Apache hcatalog is a hadoop data ing table and storage management service, which includes:
Provides a sharing mode and data type mechanism.
Provides an abstract table so that you do not need to pay attention to the data storage methods and addresses.
Provides interoperability for data processing tools such as pig, mapreduce, and hive.

Certificate ------------------------------------------------------------------------------------------------------------------------------------------------
Chukwa:
Chukwa is a hadoop-based big cluster monitoring system contributed by Yahoo.
Certificate ------------------------------------------------------------------------------------------------------------------------------------------------
Cloudera series products:
Founding organization: cloudera
1. cloudera Manager:
Four functions
(1) Management
(2) monitoring
(3) Diagnosis
(4) Integration
Cloudera ManagerFour functions
2. cloudera CDH: English name: CDH (cloudera's distribution, including Apache hadoop)
Cloudera has made corresponding changes to hadoop.
The release of cloudera, which is called CDH (cloudera distribution hadoop ).
For details, see
Cloudera hadoopWhat isCDHAndCDHVersion Introduction
Related Materials
Cdh3PracticeHadoop (HDFS), hbase, Zookeeper, flume, hive
Cdh4Installation practicesHDFS,Hbase,Zookeeper,Hive,Oozie,Sqoop
Hadoop CDHFour installation methods and instance Guidance
HadoopOfCdh4AndCdh5Download and share a series of documents
3. cloudera Flume
Flume is a log collection system provided by cloudera. Flume allows you to customize various data senders in the log system to collect data;
Flume is a highly available, highly reliable, and distributed system for massive log collection, aggregation, and transmission provided by cloudera. Flume supports Custom Data senders in the log system, flume is used to collect data. Flume also provides the ability to process data and write data to various data receivers (customizable.
Flume was the first log collection system provided by cloudera and is currently an incubator project under Apache. Flume allows you to customize various data senders in the log system to collect data, flume provides simple processing of data and the ability to write data to various data receivers (customizable). Flume provides console, RPC (thrift-RPC), and text (file), tail (UNIX tail), syslog (syslog log system, supports TCP and UDP modes), exec (command execution), and other data sources to collect data.
Flume adopts the multi-master mode. To ensure configuration data consistency, Zookeeper is introduced in flume [1] to save configuration data. zookeeper can ensure consistency and high availability of configuration data. In addition, when configuration data changes, zookeeper can notify the flume master node. The flume master synchronizes data using the gossip protocol.
For details, see:
What isFlumeLog collection,FlumeFeatures
What isFlumeLog collection,FlumeWhat is the principle,FlumeWhat problems do you encounter?
4. cloudera impala

Cloudera Impala provides interactive SQL queries for your data stored in Apache hadoop in HDFS and hbase. In addition to using the same Unified Storage platform as hive, Impala also uses the same metadata, SQL syntax (hive SQL), ODBC driver, and user interface (hue beeswax ). Impala also provides a familiar batch or real-time query and unified platform.

For details, see:

What isImpala, How to install and useImpala
5. cloudera hue
Hue is a specialized web manager for CDH. It consists of three parts: hue UI, hue server, and hue dB. Hue provides Shell Interface interfaces for all CDH components. You can compile MR in hue to view and modify HDFS files, manage hive metadata, run sqoop, and write oozie workflow.
For details, see:
Cloudera hueInstallation andOozieInstallation
What isOozie?OozieIntroduction
Cloudera hueExperience Sharing, Problems and Solutions
Certificate ------------------------------------------------------------------------------------------------------------------------------------------------
Spark
Founding organization: developed by the University of California, Berkeley's amp Lab (algorithms, machines, and people Lab)
Spark is an open-source cluster computing environment similar to hadoop, but there are still some differences between the two. These useful differences make spark superior in some workloads, in other words, spark enables memory distributed datasets. In addition to interactive queries, it can also optimize iterative workloads.
Spark is implemented in the scala language and uses Scala as its application framework. Unlike hadoop, spark and Scala can be tightly integrated, and Scala can easily operate distributed datasets like local collection objects.
Although spark is created to support iterative jobs on distributed datasets, it is actually a supplement to hadoop and can be run in parallel in the hadoo file system. This behavior can be supported through a third-party cluster framework named mesos. Spark is developed by the UC Berkeley's amp Lab (algorithms, machines, and people Lab) to build large-scale, low-latency data analysis applications.
For more information, see
Popular ScienceSpark,SparkWhat is and how to useSpark(1)
Popular ScienceSpark,SparkWhat is the core and how to use it?Spark(2)
Youku TudouSparkImprove Big Data Analysis
HadoopNew memberHadoop-clouderaCompany willSparkJoinHadoop
Certificate -----------------------------------------------------------------------------------------------------------------------------------------------
Storm

Founder: Twitter
Twitter officially opened storm open-source, a distributed, fault-tolerant real-time computing system hosted on GitHub, following the eclipse Public License 1.0. Storm is a real-time processing system developed by backtype, which is now owned by Twitter. The latest version on GitHub is storm 0.5.2, which is basically written in clojure.
For details, see:
StormGetting started
Storm-0.9.0.1Installation and deploymentGuidance
Overall understandingStormIncluding concepts, scenarios, and components
Big Data Architect:Hadoop,StormWhich one can be selected?
Big Data Architecture:Flume-NG + Kafka + storm + HDFSReal-Time System combination

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More