Basic Hadoop Concepts

Source: Internet
Author: User
Tags sqoop

Hadoop is a distributed storage and computing platform that is a massive amount of data

For data, we can roughly divide it into three categories:

Structured data (processing of structured data can be handled through an RDBMS, and fast query speed can be achieved by creating a search code as an index page)

Semi-structured data (you can typically use type XML for tag processing)

Unstructured data

In fact, unstructured data can occupy a significant proportion, and the storage and computation of non-institutional data becomes more difficult.

Hadoop is inspired by two Google papers, including MapReduce, which we can interpret as a mapreduce open source implementation, and Hadoop is written in the Java language.

First, for massive amounts of data, it needs to be stored and then analyzed.

Hdfs:hadoop Distributed File System

Mapreduce:hadoop framework for the completion of parallel data processing

Can be understood as hadoop=hdfs+mapreduce, which can be understood as a Hadoop cluster is an HDFS cluster plus a MapReduce cluster.

How HDFs accomplishes Distributed storage:

An HDFS cluster typically has a master node (in the new version of Hadoop, there are multiple master nodes already implemented) called Namenode (referred to as NN)

There will be N nodes in HDFs called Datanode (DN).

The actual data storage is the data node, and Namenode is mainly used to block the data, and then allocated to the data node for storage, in addition, Namenode used to receive user requests, management from the node, maintain the directory structure of the file system, manage the relationship between the file and the block, The relationship between block and data node, so as to accomplish the purpose of distributed storage of massive data.


HDFs Features: HDFs is designed to store large files and is not suitable for storage of massive small files

HDFs is the file system of the user space (in fact, the data is ultimately stored on a file system such as ext3, but it is necessary to abstract the data once again via HDFs)

HDFs does not support modifying data (new version supports append)

Mounts are not supported and accessed through system calls, only private access excuses, such as dedicated command-line tools, API


MapReduce, generally referred to as mapreduce, has three meanings:

Programming model

Computational framework

Specific implementation tools for MapReduce programming ideas

The idea of MapReduce is roughly divided into two phases: map and reduce

Map is to divide the processing of a large file into blocks for computation, to achieve distributed

Reduce is to summarize the results of each block

The calculation of the data is actually the way to extract key-value, when the map is sent to reduce, you must send the same key extracted to the same reduce process to the final merge

For the computation of data, it is necessary for the processing personnel to write the MapReduce program according to the programming model of MapReduce, so it is very restrictive for the computation of massive data in hdfs+mapreduce combination.

Hadoop also has a number of components that make up the ecosystem of Hadoop:

650) this.width=650; "src=" Http://s1.51cto.com/wyfs02/M02/75/F1/wKioL1ZFjd6yDaysAAEwMN0puBQ675.png "title="}0% Zbsm048[whxkipxy0auy.png "alt=" Wkiol1zfjd6ydaysaaewmn0pubq675.png "/>

Hdfs+mapreduce constitutes the core of Hadoop:

Hive:hive is developed by Facebook, and Hive abstracts the framework provided by the entire mapreduce into a system that, when the user wants to do something like a row query, can submit a SQL statement to hive. The hive implementation then transforms the user-friendly SQL statement into a mapreduce program execution and finally outputs the result to the user (which can be understood as a SQL interface provided by hive, but not fully compatible with SQL)

HBase: Because HDFs cannot be mounted and cannot modify data, when HBase is working on HDFS, it forms an hbase cluster that starts a process on the HBase node and then the data can exist on HBase before it is stored in HDFs by HBase, And HBase has a version number for the data record, which makes it possible to modify the data

In many cases, we need to analyze the logs generated by the Web server cluster, so how to store the logs generated by the Web server on HDFs, first, HDFs cannot be mounted, so it cannot be written as if it were not the same as a file system, which is flume, Scrib These log collection tools are implemented to store logs on HDFs.

In many cases, it may be necessary to use the power of the cluster to analyze and calculate and mine data stored on the RDBMS, so how to import data from RDBMS into HDFS, which is implemented by the Sqoop tool, can be exported from the RDBMS with Sqoop and stored on hbase first , which is then stored in HDFs by the HBase implementation, and then the data can be calculated by a MapReduce program written

Mahost is a tool for data mining, machine learning

Zookeeper: It can be understood that it is a coordinator to monitor whether the nodes on the cluster meet the requirements of the cluster


Hadoop is still good at providing improved data storage in HDFs, but MapReduce has a slightly less computational power, can be combined with the second-generation Big data solution Spark, which uses HDFS to complete the massive data distributed storage, with spark delivering massive data operations

This article is from the "Zxcvbnm Xuan ye" blog, please be sure to keep this source http://10764546.blog.51cto.com/10754546/1712511

Basic Hadoop Concepts

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.