Part of Hadoop
Hadoop is a Java implementation of Google's MapReduce. MapReduce is a simplified distributed programming model that allows programs to be distributed automatically to a large cluster of ordinary machines.
Hadoop is mainly composed of HDFs, MapReduce and HBase. The specific composition is as follows:
The composition diagram of Hadoop
1. Hadoop HDFs is an open source implementation of the Google GFs storage System, the primary scenario being the underlying component of a parallel computing environment (MAPREDUCE) and the underlying distributed file system of BigTable (such as HBase, hypertable). HDFs adopts Master/slave architecture. A HDFs cluster is composed of a namenode and a certain number of datanode. Namenode is a central server responsible for managing file System namespace and client access to files. Datanode is typically a node in the cluster, responsible for managing the storage that comes with them on the node. Internally, a file is actually partitioned into one or more blocks, which are stored in the Datanode collection. Namenode performs namespace operations on file systems, such as opening, closing, renaming files and directories, and determining the mappings of blocks to specific datanode nodes. Datanode the creation, deletion and replication of blocks under the command of Namenode. Namenode and Datanode are designed to run on ordinary, inexpensive Linux-running machines. HDFs is developed in the Java language, so it can be deployed on a wide range of machines. A typical deployment scenario is when a machine runs a separate Namenode node, and the other machines in the cluster run a Datanode instance. This architecture does not preclude the running of multiple datanode on a single machine, but this is relatively rare.
HDFs Architecture Diagram
2. The Hadoop MapReduce is a simple software framework that can be written on applications that run on large clusters of thousands of commercial machines and process terabytes of data in a reliable and fault-tolerant manner.
A MapReduce job (job) typically cuts the input dataset into separate blocks of data that are handled in a completely parallel manner by the map task. The framework sorts the output of the map first, and then enters the result into the reduce task. Usually the input and output of the job are stored in the file system. The entire framework is responsible for scheduling and monitoring tasks, as well as for performing tasks that have failed.
Hadoop MapReduce Process Flowchart
3. Hive is a data warehousing tool based on Hadoop, with a high processing capacity and low cost.
Main Features:
The method of storage is to map a structured data file to a database table.
Provides a class SQL language to implement the full SQL query functionality.
1. SQL statements can be converted to MapReduce task run, very suitable for data Warehouse statistical analysis.
Deficiencies:
1. Store and read data in the form of row storage (Sequencefile).
2. Inefficient: When you want to read a column of data table data, you need to first remove all the data and then extract a column of data, the efficiency is very low.
3. Occupy more disk space.
As a result of the above deficiencies, Dr. Charlie introduced a distributed data processing system in the record-unit storage structure into a storage structure as a unit, thereby reducing the number of disk access, improve query processing performance. In this way, because the same property values have the same data type and similar data attributes, compressed storage with attribute values is more compressed and saves more storage space.
Comparison diagram of row and column storage
HBase is a distributed, column-oriented open source database, which differs from the general relational database and is a database suitable for unstructured data storage. Another difference is that HBase is based on columns rather than on a row based pattern. HBase uses the same data model as bigtable. The user stores data rows in one table. A data row has a selectable key and any number of columns, one or more columns form a columnfamily, and a fmaily column is located in a hfile that is easy to cache data. Tables are loosely stored, so users can define a variety of different columns for a row. In HBase, the data is sorted by primary key, and the table is divided into multiple hregion by primary key.
HBase Datasheet Chart
"Recommended reading": 1. MapReduce Data flow optimization based on Hadoop system
2.Hadoop Technology Center
3. Using MapReduce and load balancing in the cloud