"Popular Science" #001 Big Data related technology

Source: Internet
Author: User
Tags cassandra syslog hadoop mapreduce sqoop

Because not big data developers, so the knowledge of big data, but also need to simply understand, big data have what technology, what is the use, this is enough

Big Data We all know about Hadoop, but there's a whole range of technologies coming into our sights: Spark,storm,impala, let's just not come back. To be able to better architect big data projects, here to organize, for technicians, project managers, architects to choose the right technology, understand the relationship between the various technologies of big data, choose the right language.

We can read this article with the following questions:

What technologies are included in 1.hadoop
2.Cloudera What is the relationship between the company and Hadoop, what are the products and what are the characteristics of the products?
What is the 3.Spark association with Hadoop?
What is the 4.Storm association with Hadoop?

    1. Hadoop Common: "Bottom Module"
    2. HDFS: "Distributed Storage System"
    3. MapReduce: "Software Framework"
    4. Hive: "Data Warehouse System" HSQL
    5. Pig: "Platform for large data set analysis"
    6. HBase: "Hadoop database"
    7. ZooKeeper: "Reliable Coordination system" complex error-prone key services-"simple, high-performance interface
    8. Sqoop: Data in the "Tools" relational database and Hadoop mutual transfer
    9. Mahout: "Machine learning, data mining"
    10. Ambari: "Configuring, Managing, and monitoring Apache Hadoop clusters"
    11. Spark "similar to Hadoop, but enables a memory distribution dataset"

Hadoop family
Founder: Doug Cutting
The entire Hadoop family consists of several sub-projects:

Hadoop Common: "Bottom Module"
A module at the bottom of the Hadoop system that provides tools for each of the Hadoop sub-projects, such as configuration files and log operations. More detail to view
Hadoop technology Insider in-depth analysis of Hadoop Common and HDFS architecture design and implementation principles Daquan 1-9 chapters

HDFS: " Distributed Storage System "
is the primary distributed storage System in Hadoop applications, the HDFs cluster contains a Namenode (master node) that manages the metadata of all file systems and the Datanode (data nodes that can have many) that store real data. HDFs is designed for massive amounts of data, so HDFs optimizes access and storage for small batches of large files compared to traditional file system optimizations on large batches of small files. Below are the details:

What is HDFs and HDFs architecture design

Hdfs+mapreduce+hive Quick Start

Why HDFs in Hadoop2.2.0 is highly available

Java Creating an HDFs file instance

MapReduce: "Software Framework"
is a software framework that makes it easy to write parallel applications that handle massive (terabytes) of data, connecting tens of thousands of nodes (commercial hardware) in a large cluster in a reliable and fault-tolerant manner.

Detailed View:
Introduction to Hadoop (1): What is Map/reduce

Hadoop MapReduce Basics

How the MapReduce work is explained

Hands-On you write a MapReduce program instance and deploy it to run on Hadoop2.2.0

Hive: "Data Warehouse System" HSQL
Apache Hive is a Data Warehouse system for Hadoop that facilitates a review of data ( mapping a structured data file into a database table ), ad hoc queries, and large data set analysis stored in a Hadoop compliant system. Hive provides full SQL query functionality--hiveql language, and while using this language to express a logic becomes inefficient and cumbersome, HIVEQL also allows traditional map/reduce programmers to use their own custom mapper and reducer. Hive is similar to Cloudbase, a set of software that provides the SQL capabilities of data Warehouse on a Hadoop distributed computing platform. Make the summary of the huge amount of data stored in Hadoop, which is simple .

Detailed View:

Hive's origins and detailed introduction

Hive Detailed Video

Pig: "Platform for large data set analysis"
Apache Pig is a platform for large data set analysis that includes a high-level language for data analysis Applications and an infrastructure to evaluate these applications. The flash feature of pig applications is that their structures stand up to large amounts of parallelism , which means that they support very large datasets. The infrastructure layer of pig contains the compiler that generates the Map-reduce task. The language layer of Pig currently contains a native language--pig Latin, which was originally developed to be easy to program and ensure scalability.

Pig is a sql-like language, a high-level query language built on MapReduce, which compiles some operations into the map and reduce of the MapReduce model, and the user can define their own functions. Yahoo Grid Computing department developed another project to clone Google Sawzall.

Detailed View:

Pig Getting started simple operations and syntax includes support for data types, functions, keywords, operators, and more

What is the difference between pig and hive in the Hadoop family?

HBase: "Hadoop database"
Apache HBase is a Hadoop database, a distributed , scalable, Big data store. It provides random and real-time read/write access to large data sets and optimizes for large tables on commercial server clusters-tens of billions of rows and millions of columns . Its core is the open source implementation of the Google BigTable paper, distributed Columnstore. Just as BigTable uses the distributed data store provided by GFS (Google File System), it is a class provided by Apache Hadoop on an HDFS basis bigatable .

Detailed View:

The difference between hbase and traditional data

HBase distributed installation video download share

ZooKeeper: " reliable coordination System " complex error-prone key services-"simple, high-performance interface
Zookeeper is an open source implementation of Google's chubby. It is a reliable coordination system for large-scale distributed systems , including configuration maintenance, name services, distributed synchronization, group services, etc. The goal of zookeeper is to encapsulate complex and error-prone services that provide users with easy-to-use interfaces and performance-efficient, robust systems.

Detailed View:

What is the role of Zookeeper,zookeeper, what is the specific role in Hadoop and HBase

Avro:
Avro is a RPC project hosted by Doug Cutting, a bit like Google's Protobuf and Facebook thrift. Avro is used to do later Hadoop RPC, which makes Hadoop's RPC module communicate faster and with more compact data structures.

Sqoop: Data in the "Tools" relational database and Hadoop mutual transfer
Sqoop is a tool used to transfer data from Hadoop and relational databases to and from a relational database to HDFs in Hadoop, or to the data in HDFs into a relational database.

Detailed View:
Sqoop detailed description includes: Sqoop command, principle, flow

Mahout: " machine learning, data mining "
Apache Mahout is a scalable machine learning and data Mining library that currently supports the main 4 use cases of mahout:

Recommended mining: Collect user actions and use this to recommend things that you might like.

Aggregation: Collects files and groups related files.

Classification: Learn from existing classification documents, look for similar features in documents, and categorize them correctly for untagged documents.

Frequent itemsets mining: grouping a set of items and identifying which individual items will often appear together.

Cassandra: "Database"--"high performance, linear expansion, high efficiency
Apache Cassandra is a high- performance, linearly scalable, high-availability database that can run on commercial hardware or cloud infrastructure to create the perfect mission-critical data platform. In replication across the data center, Cassandra is best-in-class, providing users with lower latency and more reliable disaster backups. With strong support for log-structured update, anti-normalization and materialized views, and powerful built-in caches, the Cassandra Data model provides a convenient two-level index (column Indexe).

Chukwa: "Data collection System"
Apache Chukwa is an open source data collection system for monitoring large distribution systems. Built on the HDFs and map/reduce frameworks, it inherits the scalability and stability of Hadoop. The Chukwa also includes a flexible and powerful toolkit for displaying, monitoring, and analyzing results to ensure optimal use of data.

Ambari: "Configuring, Managing, and monitoring Apache Hadoop clusters"
Apache Ambari is a web-based tool for configuring, managing, and monitoring Apache Hadoop clusters , supporting Hadoop HDFS, Hadoop MapReduce, Hive, Hcatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides cluster health dashboards, such as heatmaps and the ability to view MapReduce, Pig, and hive applications to diagnose their performance characteristics with a friendly user interface.

Hcatalog
Apache Hcatalog is a mapping table and storage Management Service for Hadoop to build data, which includes:

Provides a mechanism for sharing patterns and data types.

Provides an abstract table so that users do not need to focus on the way and address of the data store.

Provides interoperability for data processing tools like pig, MapReduce, and hive.

--------------------------------------------------------------------------------------------------------------- ---------------------------------

Chukwa:
Chukwa is a large cluster monitoring system based on Hadoop, which is contributed by Yahoo.

--------------------------------------------------------------------------------------------------------------- ---------------------------------

Cloudera Series Products:

Founding organization: Cloudera Company

1.Cloudera Manager:

There are four functions

(1) Management

(2) Monitoring

(3) Diagnosis

(4) Integration

Cloudera Manager four functions

2.Cloudera CDH: English name: CDH (Cloudera ' s distribution, including Apache Hadoop)

Cloudera has made a corresponding change to Hadoop.

Cloudera Company's release, we call this version CDH (Cloudera distribution Hadoop).

Details can be viewed

Cloudera Hadoop What is CDH and CDH release introduction

Related information

CDH3 Real-combat Hadoop (HDFS), HBase, Zookeeper, Flume, Hive

CDH4 installation Practices HDFs, HBase, Zookeeper, Hive, Oozie, Sqoop

Summary of four kinds of installation methods of Hadoop CDH and example guidance

CDH4 and CDH5 series document download sharing for Hadoop

3.Cloudera Flume" Log collection System "
Flume is the Cloudera provided by the log collection system , flume support in the log system to customize various types of data senders, for data collection;
Flume is a highly available, highly reliable, distributed mass log capture, aggregation, and transmission system provided by Cloudera, Flume supports the customization of various data senders in the log system for data collection, while Flume provides simple processing of data The ability to write to various data-receiving parties (customizable).

Flume is the first cloudera to provide a log collection system, is currently under the Apache Incubation project, flume support in the log system to customize various types of data senders, for data collection, while Flume provides simple processing of data, The ability to write to various data recipients (customizable) flume provided from the console (console), RPC (THRIFT-RPC), text (file), tail (UNIX tail), syslog (syslog log System, Supports 2 modes such as TCP and UDP, and the ability to collect data on data sources such as exec (command execution).

The Flume uses a multi-master approach. To ensure consistency of configuration data, Flume[1] introduces zookeeper for saving configuration data, zookeeper itself guarantees consistency and high availability of configuration data, and zookeeper can notify Flume master nodes when configuration data changes. Flume Master synchronizes data using the gossip protocol.

Detailed View:

What is flume log collection, flume features

What is flume log collection, what is the principle of flume, and what problems Flume will encounter

4.Cloudera Impala "provides direct query for HDFS and hbase data sql" with hive difference????

Cloudera Impala provides direct query interaction for SQL that you store in Apache Hadoop data in Hdfs,hbase. In addition to using the same unified storage platform as Hive, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC Driver and user interface (Hue beeswax). Impala also offers a familiar platform for batch or real-time queries and unified platforms.

Detailed View:

What is Impala, how to install using Impala

5.Cloudera Hue "CDN Web Manager"
Hue is a CDH dedicated set of Web managers that includes 3 parts of Hue Ui,hue Server,hue db. Hue provides an interface for all CDH components of the Shell interface. You can write Mr in Hue, view files that modify HDFs, manage hive metadata, run Sqoop, write Oozie workflows, and much more.

Detailed View:

Cloudera hue Installation and Oozie installation

What is Oozie? Oozie Introduction

Cloudera Hue use experience sharing, problems and solutions

Spark "similar to Hadoop, but enables a memory distribution dataset"
Founding organization: Development of the University of California, Berkeley AMP Lab (algorithms, machines, and people Lab)

Spark is an open-source cluster computing environment similar to Hadoop, but there are some differences between the two that make spark more advantageous in some workloads, in other words, Spark enables the memory distribution dataset, in addition to providing interactive queries, It can also optimize iteration workloads.

Spark is implemented in the Scala language and uses Scala as its application framework. Unlike Hadoop, Spark and Scala are tightly integrated, and Scala can manipulate distributed datasets as easily as local collection objects.

Although the Spark was created to support an iterative job on a distributed dataset, it is actually a supplement to Hadoop that can be run in parallel in the Hadoo file system. This behavior can be supported by a third-party cluster framework named Mesos. Developed by the UC Berkeley AMP Lab (algorithms, machines, and People Lab), Spark can be used to build large, low-latency data analytics applications.

You can learn more about

What is popular Spark,spark and how to use Spark (1)

What is the core of popular science Spark,spark, and how to use Spark (2)

Youku Tudou improves big data analytics with Spark

New Hadoop member Hadoop-cloudera company joins Spark to Hadoop

Storm "distributed, fault-tolerant real-time computing system"
Founder: Twitter
Twitter is officially open source for Storm, a distributed, fault-tolerant, real-time computing system that is hosted on GitHub and follows the Eclipse public License 1.0. Storm is a real-time processing system developed by Backtype, and Backtype is now under Twitter. The latest version on GitHub is Storm 0.5.2, which is basically written in Clojure.

You can learn more about:

Introduction to Storm Introduction

Storm-0.9.0.1 Installation Deployment Guide

General knowledge of storm includes concepts, scenarios, composition

Big Data Architect: Which one does Hadoop and Storm re-elect?

Big Data architecture: FLUME-NG+KAFKA+STORM+HDFS real-time system combination

"Popular Science" #001 Big Data related technology

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.