Hadoop family's introduction to Big Data Hadoop

Source: Internet
Author: User

Tags: ambari spark Hadoop should Java API query Columnstore eco-distributed

Hadoop family's introduction to Big Data Hadoop

Big Data The word may seem strange to you a few years ago, but I'm sure you'll feel "familiar" when you hear the word Hadoop! More and more people are discovering that Hadoop is developing around or learning about Hadoop. As an entry-level novice at Hadoop, what are some of the places you find difficult? The construction of the operating environment is probably enough to make the novice headache. If every release of Hadoop can be like big fast dkhadoop to integrate the various environments together, once the installation of all, it will be a good thing for the novice!
The gossip pulled a little bit more and returned to the whole. This article is intended to share some of the basics of Hadoop--hadoop family products for everyone who is new to Hadoop. Through the understanding of the Hadoop family products, further help you learn good hadoop! At the same time, we also welcome your valuable suggestions!
One, Hadoop definition
Hadoop is a large family, an open-source ecosystem, a distributed operating system, and a framework based on the Java programming language. But its most sophisticated technology is HDFS and mapreduce, making it possible to distribute massive amounts of data in a distributed process.

Second, Hadoop products

HDFS (Distributed File System):
There are many different features of the existing file system, such as high fault tolerance (even in the middle of error, can continue to run), support multimedia data and streaming data access, efficient access to large data sets, strict data consistency, low deployment costs, deployment efficiency, etc., is the infrastructure of HDFS.

Mapreduce/spark/storm (Parallel Computing architecture):
1, data processing method for separation line calculation and online calculation:
Role description
Mapreducemapreduce often used for offline, complex big data calculations
Stormstorm for online real-time Big data computing, storm real-time is mainly a data processing;
Spark can be used offline and can also be used for online real-time Big data calculations, and Spark's real-time data is mostly processed in time zones, so spark is more flexible.

2. The data storage location is divided into disk computing and memory calculation:
Role description
MapReduce data exists on the disk
Spark and Strom data exist in memory

Pig/hive (Hadoop programming):
Role description
Pig is a high-level programming language that has very high performance in processing semi-structured data and can help us shorten the development cycle.
Hive is a data analysis query tool that shows very high performance in particular when using class SQL query analysis. Can be done in minutes to complete the ETL one night to complete the thing, this is the advantage, accounted for the opportunity!

Hbase/sqoop/flume (data import and export):
Role description
HBase is a Columnstore database that runs on the HDFS schema and is well integrated with pig/hive. HBase can be used nearly seamlessly with the Java API.
Sqoop is designed to facilitate the import of data from a traditional database to a Hadoop data collection (hdfs/hive).
The purpose of the flume design is to easily import data from the log file system directly into the Hadoop data collection (HDFS).
These data transfer tools are a great way to facilitate the use of people, improve productivity and focus on business analysis.

Zookeeper/oozie (System Management Architecture):
Role description
Zookeeper is a system Management coordination architecture for managing the basic configuration of distributed architectures. It provides a number of interfaces that make configuration management tasks simple.
The Oozieoozie service is used to manage workflows. Used to schedule different workflows so that each work is done. These architectures help us to manage the distributed computing architecture of big data in a lightweight manner.

AMBARI/WHIRR (System Deployment Management):
Role description
Ambari helps people quickly deploy the entire big data analytics architecture and monitor the system's health in real time.
WHIRRWHIRR's primary role is to help develop cloud computing quickly.

Mahout (machine learning):
Mahout is designed to help us quickly complete high IQ systems. Some of the logic of machine learning has been realized. This architecture allows us to quickly integrate more machine learning intelligence.

Hadoop family's introduction to Big Data Hadoop

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

Tags Index: