Introduction to hadoop for Big Data

Source: Internet
Author: User
Tags sqoop
The word "Big Data" may seem strange to you a few years ago, but I believe you will feel "familiar" when you hear the word "hadoop "! More and more people are exploring hadoop development or learning hadoop. As a beginner in hadoop, what do you think is difficult? The establishment of the runtime environment may be enough to cause headaches for new users. If every released version of hadoop can be integrated into various environments like dkhadoop and installed at one time, what a wonderful thing for new users!
The gossip goes a little longer and returns to the whole. This article is intended to share some basic hadoop knowledge-hadoop family products. By understanding the hadoop family products, we can help you better learn hadoop! At the same time, you are welcome to give your valuable comments!
1. hadoop Definition
Hadoop is a big family, an open-source ecosystem, a distributed operating system, and a Java programming language-based architecture. However, its best technology is HDFS and mapreduce, allowing it to process massive data in a distributed manner.
Ii. hadoop Products


HDFS (Distributed File System ):
It has many different features from the existing file system, such as high fault tolerance (even if an error occurs in the middle, it can continue to run), supports access to multimedia data and streaming media data, and efficient access to large data sets, data Consistency, reduced deployment costs, and improved deployment efficiency are the basic architecture of HDFS.


Mapreduce/spark/storm (Parallel Computing Architecture ):
1. separation line computing and online computing for data processing:
Role
Description
Mapreduce
Mapreduce is often used in offline complex big data computing.
Storm
Storm is used for online real-time big data computing. Storm's real-time processing is primarily a piece of data processing;
Spark
It can be used for offline or online real-time big data computing. Spark is mainly used to process data in time regions in real time. Therefore, spark is flexible.
2. Data storage locations are divided into disk computing and memory computing:
Role
Description
Mapreduce
Data in Disk
Spark and Strom
Data in memory
Pig/hive (hadoop programming ):
Role
Description
Pig
It is an advanced programming language with High Performance in processing semi-structured data and can help us shorten the development cycle.
Hive
Is a data analysis and query tool, especially when using SQL-like query and analysis, it shows extremely high performance. It takes one night to complete ETL in minutes. This is an advantage and takes the lead!
Hbase/sqoop/flume (data import and export ):
Role
Description
Hbase
It is a column-store database running on the HDFS architecture and has been well integrated with pig/hive. Hbase can be used almost seamlessly through Java APIs.
Sqoop
The purpose of the design is to facilitate the import of data from a traditional database to a hadoop data set (HDFS/hive ).
Flume
The purpose of the design is to easily import data from the log file system to the hadoop data set (HDFS.
These data transfer tools greatly facilitate the use of people, improve work efficiency, focus on business analysis.
Zookeeper/oozie (System Management Architecture ):
Role
Description
Zookeeper
Is a system management coordination architecture used to manage the basic configurations of the distributed architecture. It provides many interfaces to simplify configuration management tasks.
Oozie
Oozie service is used to manage workflows. It is used to schedule different workflows so that each job starts and ends. These architectures help us manage the big data distributed computing architecture in a lightweight manner.
Ambari/whirr (system deployment management ):
Role
Description
Ambari
Help relevant personnel quickly deploy and build the entire Big Data Analysis Architecture, and monitor the running status of the system in real time.
Whirr
Whirr is mainly used to help develop cloud computing quickly.
Mahout (Machine Learning ):
Mahout is designed to help us quickly complete systems with high IQ. Some machine learning logic has been implemented. This architecture allows us to quickly integrate more machine learning intelligence.
You like to pay more attention. Your attention is my biggest motivation.
If you need big data, you can trust me.

Introduction to hadoop for Big Data

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.