A detailed internal mechanism of the Hadoop core architecture hdfs+mapreduce+hbase+hive

Source: Internet
Author: User

Editor's note: HDFs and MapReduce are the two core of Hadoop, and the two core tools of hbase and hive are becoming increasingly important as hadoop grows. The author Zhang Zhen's blog "Thinking in Bigdate (eight) Big Data Hadoop core architecture hdfs+mapreduce+hbase+hive internal mechanism in detail" from the internal mechanism of the detailed analysis of HDFs, MapReduce, Hbase, The operational mechanism of hive, from the bottom to the data management, provides a detailed analysis of Hadoop.

CSDN Recommendation: Welcome to the free subscription "Hadoop and Big Data weekly " to obtain more Hadoop technical literature, Big Data technology analysis, Enterprise combat experience, the development trend of the biosphere.

The following is the author's original text:

Through this phase of research and analysis, from the perspective of the internal mechanism, how HDFS, MapReduce, Hbase, Hive is running, and based on the Hadoop Data Warehouse and distributed database internal implementation. If there is insufficient, follow-up timely modification.

The architecture of HDFs

The entire Hadoop architecture is mainly through HDFS to achieve the underlying support for distributed storage, and through Mr to implement the Distributed Parallel task processing program support.

HDFs uses a master-slave (MASTER/SLAVE) structure model, An HDFS cluster is comprised of a namenode and several datanode (multiple namenode configurations have been implemented in the latest Hadoop2.2 version-this is what some large companies have done by modifying the Hadoop source code, which has been implemented in the latest version). Namenode acts as the primary server, managing file system namespaces and client access to files. Datanode manages the stored data. HDFS supports data in the form of files.

Internally, the file is divided into blocks of data that are stored on a set of Datanode. Namenode executes the namespace of the file system, such as opening, closing, renaming files or directories, and also responsible for mapping data blocks to specific datanode. Datanode is responsible for the file read and write of the file system client, and the creation, deletion and copying of the database under the unified dispatch of Namenode. Namenode is the manager of all HDFs metadata, and user data is never namenode.

HDFs Architecture Diagram

The figure covers three roles: NameNode, DataNode, and Client. Namenode is the manager, Datanode is the file store, the client is the application that needs to obtain the Distributed File system.

File write:

1) The client initiates a file write request to Namenode.

2) Namenode returns information about the Datanode that the client has managed, based on file size and file block configuration.

3) The client divides the file into blocks, which, according to the Datanode address, writes the block sequentially to the Datanode block.

File read:

1) The client initiates a request to Namenode to read the file.

2) Namenode Returns the Datanode information for the file store.

3) The client reads the file information.

As a distributed file system, HDFs can be used as a reference point in data management:

File Block Placement: A block will have three backups, one on the datenode specified by Namenode, and one on Datanode with the specified Datanode not on the same machine, One is the specified datanode on the datanode on the same rack. The purpose of the backup is for data security, in order to take into account the same rack failures, as well as the performance of different data copies.

MapReduce Architecture

The MR Framework consists of a single jobtracker running on the master node and Tasktracker running on each cluster from the node. The primary node is responsible for scheduling all tasks that make up a job, which are distributed across different slave nodes. The primary node monitors their execution and re-executes previously failed tasks. The slave node is responsible only for the tasks assigned by the master node. When a job is submitted, Jobtracker receives the job and configuration information, distributes the configuration information to the slave node, dispatches the task, and monitors the execution of the Tasktracker. The Jobtracker can run on any computer in the cluster. Tasktracker is responsible for performing tasks, it must be running on Datanode, Datanode is both a data storage node and a compute node. Jobtracker distributes the map task and the reduce task to the idle Tasktracker, which run in parallel and monitor the operation of the task. If the jobtracker fails, Jobtracker will transfer the task to another idle tasktracker to run again.

HDFs and Mr Combine to form the core of the Hadoop Distributed system architecture. HDFs realizes Distributed File system on cluster, and Mr has distributed computing and task processing on cluster. HDFs provides support for file operation and storage during Mr Task processing, and he realizes the task of distributing, tracking and executing tasks on the basis of HDFs, and collects the results, which interact with each other and accomplish the main tasks of distributed cluster.

Parallel application development on Hadoop is based on the Mr Programming framework. Mr Programming model principle: Use an input key-value pair set to produce an output key-value pair set. The Mr Library implements this framework through map and reduce two functions. The user-defined map function accepts an input key-value pair and then produces an intermediate set of key-value pairs. Mr combines all values with the same key value and passes a reduce function. The reduce function takes a combination of key and associated value, and the reduce function merges these value values to form a smaller value collection. Usually we supply the middle value value to the reduce function through an iterator (the function of the iterator is to collect these value values) so that it can handle a large collection of value values that cannot be placed in memory all.

Description: (The third picture is drawn by the companion himself)

Process in short, large datasets are divided into small chunks of data, and several datasets are processed in a node in the cluster and produce intermediate results. A task on a single node, the map function reads data from a row of rows (K1,V1), the data goes into the cache, and the map function performs a map (based on Key-value) sort (the framework sorts the output of the map) after the execution of the input (K2,V2). Each machine performs the same operation. The process of sorting through the merge (K2,V2) on different machines (the process of shuffle can be understood as a process before reduce), and finally reduce is merged, (K3,V3), and exported to the HDFs file.

When it comes to reduce, before reduce, you can merge data for intermediate data (Combine) with the same key in the middle <key,value> pair. The process of combine is similar to the process of reduce, but combine is part of the map task and executes only after the map function is completed. Combine can reduce the number of intermediate results key-value, thus reducing network traffic.

The intermediate result of the map task is stored on the local disk as a file after combine and partition are done. The location of the intermediate result file notifies the master Jobtracker,jobtracker and then notifies the reduce task to which datanode to take the intermediate result. All the intermediate results generated by the map task are divided into r parts according to their key value, and the R-Reduce task is responsible for a key interval. Each reduce requires an intermediate result that falls within its responsible key interval to many map task nodes and then executes the reduce function, resulting in a final result. With R-Reduce tasks, there will be r final results, and in many cases this R final result does not need to be combined into one final result, because the R final result can be used as input to another computational task and start another parallel computing task. This creates a plurality of output data fragments (HDFs replicas) in the above figure.

HBase Data Management

HBase is Hadoop database. What's the difference between a traditional MySQL and Oracle? What is the difference between column and row data. What is the difference between a NoSQL database and a traditional relational data:

Hbase VS Oracle

1. HBase is suitable for a large number of insertions and readings. Enter a key to get a value or enter some key to get some value.

2, the bottleneck of hbase is hard drive transmission speed. HBase operation, it can insert into the data, can also update some data, but the update is actually insert, just insert a new timestamp line. The delete data, which is also an insert, simply inserts a row with a delete tag. All of HBase's operations are append-insert operations. HBase is a log set database. It is stored in the same way as a log file. It is bulk write to hard disk, usually in the form of file read and write. This read and write speed depends on how fast the drive is transferred to the machine. The bottleneck for Oracle is hard drive seek time. It is frequently used for random read and write operations. To update a data, first locate the block on the hard disk, and then read it into memory, modify it in the in-memory cache, and write it back in a while. Because the block you are looking for is different, there is a random reading. The drive's seek time is mainly determined by the rotational speed. While seek time, the technology basically did not change, this formed the seek time bottleneck.

3, HBase data can hold many different timestamp versions (that is, the same data can replicate many different versions, allowing data redundancy, but also an advantage). Data is sorted by time, so hbase is particularly well-suited to finding scenes that look for top N in chronological order. Find out what a person has recently browsed, recently wrote n blogs, n behaviors and so on, so hbase is very much used on the Internet.

4, the limitation of hbase. Can only do a very simple key-value query. It is suitable for high-speed insertion, while also having a large number of read operation scenarios. And this scenario is very extreme, not every company has this kind of demand. In some companies, it is normal OLTP (online transaction processing) to read and write randomly. In this case, the reliability of Oracle, the system is less responsible than hbase. And the hbase limitation is that it has only the primary key index, so there is a problem when modeling. For example, in a table, a lot of columns I want to do some kind of condition of the query. But you can only build a quick query on the primary key. So, it can't be said in general that the technology has advantages.

5. Oracle is a row database, and HBase is a column database. The advantage of a column database is that the data is analyzed in this scenario. The difference between data analysis and traditional OLTP. Data analysis, often based on a column as a query condition, the returned result is often a certain column, not all columns. In this case, the performance of the row-database response is inefficient.

Row database: Oracle, for example, the basic constituent unit of a data file: Block/page. The data in the block is written in a row of rows. There is a problem, when we want to read some of the columns in a block, we cannot read only these columns, we must read the whole block into memory, and then read out the contents of these columns. In other words: In order to read some columns in the table, you must read the entire table's rows to read the columns. That's the worst part of the database.

A column database: is stored as a column as an element. The elements of the same column are squeezed into a block. When you want to read some columns, you only need to read the relevant column blocks into memory, so that the amount of IO read is much less. Typically, the data elements of the same column are usually of similar format. This means that when the data format is similar, the data can be greatly compressed. Therefore, the column database in the data compression has a great advantage, compression not only saves the storage space, but also saves the IO. (This can be exploited when the data reaches millions, tens, and data queries are optimized to improve performance, as shown in scenarios)

Hive Data Management

Hive is the Data Warehouse infrastructure built on Hadoop. It provides a range of tools for data extraction, transformation, and loading, a large-scale data mechanism that can store, query, and analyze storage in Hadoop. You can map Hadoop's structured data file as a table in hive and provide class-SQL query functionality, which is supported in addition to updates, indexes, and transactions, which are not supported by SQL. The SQL statement can be converted to a MapReduce task to run as a SQL-to-mapreduce mapper. Provide shell, JDBC/ODBC, Thrift, web and other interfaces. Pros: Low cost can quickly implement simple mapreduce statistics through class-SQL statements. As a data warehouse, the data management of hive can be introduced from three aspects of metadata storage, data storage and data exchange according to the usage level.

(1) meta-data storage

Hive stores metadata in an RDBMS, and there are three ways to connect to a database:

• Inline mode: The metadata remains in the built-in database of Derby, typically for unit testing, allowing only one session connection

• Multi-user mode: Install MySQL locally and put the metadata inside MySQL

• Remote mode: Metadata is placed in a remote MySQL database

(2) Data storage

First, Hive does not have a dedicated data storage format and does not index the data, and it is used to organize the tables in hive very freely, simply by telling the column and row separators in the hive data when creating the table, which can parse the data.

Second, all data in hive is stored in HDFs, and hive contains 4 data models: Tabel, externaltable, Partition, buckets.

Table: Similar to a table in a traditional database, each table has a corresponding directory in hive to store the data. For example: A table zz, where the path in HDFs is:/wh/zz, where WH is the directory of the data warehouse specified by ${hive.metastore.warehouse.dir} in Hive-site.xml, All table data (not including external table) is stored in this directory.

Partition: An index similar to a column in a traditional database. In Hive, a partition in a table corresponds to a directory below the table, and all partition data is stored in the corresponding directory. For example: The ZZ table contains both DS and city two partition, then the HDFS subdirectory corresponding to ds=20140214,city=beijing is:/wh/zz/ds=20140214/city=beijing;

Buckets: The hash calculated for the specified column, the data is segmented according to the hash value, the purpose is to facilitate parallelism, each Buckets corresponds to a file. To score the user column to 32 buckets, first calculate the hash for the value of the user column, for example, the HDFs directory corresponding to the hash=0:/wh/zz/ds=20140214/city=beijing/part-00000; corresponding hash=20 , the directory is:/wh/zz/ds=20140214/city=beijing/part-00020.

Externaltable points to data that already exists in HDFs, you can create a partition. And table are the same in the metadata organization structure, there are big differences in actual storage. Table creation and data loading process, can be implemented with a unified statement, the actual data is transferred to the Data Warehouse directory, and then access to the data will be directly in the Data Warehouse directory to complete. When you delete a table, both the data and metadata in the table are deleted. Externaltable has only one process, because loading the data and creating the table is done at the same time. The world data is stored in the HDFS path specified behind the location and is not moved to the Data Warehouse.

(3) Data exchange

• User interface: Includes client, web interface, and database interface

• Metadata storage: typically stored in a relational database, such as Mysql,derby

· Hadoop: Use HDFS for storage and compute with MapReduce.

Key point: Hive stores metadata in the database, such as MySQL, Derby. The metadata in hive includes the name of the table, the column and partition of the table and its properties, the properties of the table (whether it is an external table), the directory where the table data resides, and so on.

The data for hive is stored in HDFs, and most of the queries are done by MapReduce.

Summarize:

Introduction to the most core distributed File System HDFs, MapReduce processing, and data warehousing tools hive and distributed database HBase for Hadoop distributed computing platform. Basically covers all the technical cores of the Hadoop distributed platform. From the architecture to data definition to data storage and processing, from macro to micro-system introduction, the Hadoop platform for large-scale data storage and task processing lay the foundation.

Citation Link: A detailed internal mechanism of the Hadoop core architecture hdfs+mapreduce+hbase+hive

A detailed internal mechanism of the Hadoop core architecture hdfs+mapreduce+hbase+hive

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.