Detailed Hadoop core Architecture hdfs+mapreduce+hbase+hive

Last Update:2014-12-22 Source: Internet

Author: User

Keywords DFS OK value name

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

By introducing the core Distributed File System HDFs, MapReduce processing, and data Warehouse Tools hive and distributed database HBase of the Hadoop distributed computing platform, it covers all the technical cores of the Hadoop distributed platform.
Through this stage of research and summary, from the internal mechanism of the detailed analysis of the HDFS, MapReduce, Hbase, Hive is how to operate, and based on the Hadoop data Warehouse and the implementation of the built-in distributed database. If there are deficiencies, the follow-up timely changes.
HDFs Architecture
The architecture of the entire Hadoop is to implement the underlying support for distributed storage through HDFS, and to implement the program support for distributed parallel task processing through Mr. The
HDFs adopts the master-slave (MASTER/SLAVE) structure model, A HDFS cluster is made up of a namenode and several datanode (many Namenode configurations have been implemented in the latest version of Hadoop2.2-a feature that some large companies have implemented by modifying the Hadoop source code, which has been implemented in the latest version). Namenode, as the primary server, manages file system namespaces and client access to files. Datanode manages stored data. HDFS supports data in the form of files.
Internally, the file is divided into several blocks of data, which are stored on a set of Datanode. Namenode executes the namespace of the file system, such as opening, closing, renaming files or directories, and also the mapping of data blocks to specific datanode. Datanode is responsible for processing file system client file read and write, and in the namenode of the unified scheduling of database creation, deletion and replication work. Namenode is the manager of all HDFs metadata, and user data never passes through Namenode.

Three roles are involved in the

Diagram: Namenode, DataNode, Client. Namenode is the manager, the Datanode is the file store, and the client is the application that needs to get the Distributed file system.
File write:
1 client requests to initiate file writes to Namenode.
2) Namenode returns information to the client that it manages Datanode based on file size and file block configuration.
3) Client divides the file into blocks and writes the block to the datanode in sequence according to the Datanode address.
File read:
1 Client initiates a request to read a file to Namenode.
2) Namenode returns datanode information for the file store.
3) Client reads the file information.
HDFs as a distributed file system in the area of data management can be used for reference:
Placement of blocks: a block will have three copies, one on the Namenode designated Datenode, A copy is placed on a datanode that is not on the same machine as the specified Datanode, and one on the datanode of the specified datanode on the same rack. The purpose of the backup is for data security, in a way that takes into account the failure of the same rack and the performance problems of different copies of the data. The
MapReduce Architecture
Mr Framework is composed of a jobtracker that runs on the primary node and a tasktracker that runs on each cluster from a node. The master node is responsible for scheduling all tasks that constitute a job, which are distributed across different nodes. The master node monitors their execution and restarts the previously failed tasks. From the node, only tasks assigned by the master node are responsible. When a job is committed, Jobtracker receives the submit job and configuration information, and distributes the configuration information to the node, dispatching the task and monitoring the execution of the Tasktracker. Jobtracker can run on any computer in the cluster. Tasktracker is responsible for performing tasks, it must run on Datanode, Datanode is both a data storage node and a compute node. Jobtracker distributes map tasks and reduce tasks to idle tasktracker, which run in parallel and monitor the operation of tasks. If JobtrackeR fails, Jobtracker will transfer the task to another idle tasktracker to run again. The
HDFs and Mr together form the core of the Hadoop Distributed system architecture. HDFs Distributed File system is implemented on the cluster, and Mr has distributed computing and task processing on the cluster. HDFS provides the support of file operation and storage in the process of Mr Task processing, Mr HDFs realizes the task distribution, tracking and execution, and collects the result, which is the main task of distributed cluster.
Parallel application Development on Hadoop is based on the Mr Programming framework. The principle of Mr Programming model: Using an input key-value to generate an output key-value pair set. The Mr Library implements this framework through the map and reduce two functions. The user-defined map function takes an input key-value pair and then produces a set of intermediate key-value pairs. Mr All combines value with the same key value and then passes a reduce function. The reduce function accepts the key and the associated value combination, and the reduce function merges these value values to form a smaller value set. Usually we supply the median value to the reduce function through an iterator (the role of the iterator is to collect these value values), so that a large number of value values that cannot be put in memory can be set.

process in short, large datasets are divided into small chunks of data, and several datasets are grouped into a node in the cluster to process and produce intermediate results. Tasks on a single node, the map function reads data (K1,V1) at a row, enters the cache, performs a map (based on Key-value) ordering through the map function (the frame sorts the output of the map), and then enters (K2,V2). Each machine performs the same operation. The process of sorting through the merge (K2,V2) on different machines (the shuffle process can be understood as a process before reduce), and finally reduce is merged, (K3,V3), and output to HDFs file.

when it comes to reduce, you can combine data (Combine) on intermediate data before you reduce it, merging <key,value> pairs of the same key in the middle. The combine process is similar to the process of reduce, but combine is a part of the map task that executes only after the map function is executed. Combine can reduce the number of intermediate result key-value pairs, thereby reducing network traffic.

The intermediate results of the
map task are stored on the local disk as files after combine and partition are finished. The location of the intermediate result file notifies the master Jobtracker,jobtracker and then notifies the reduce task which datanode to take the intermediate result. All the map tasks produce intermediate results according to their key value according to the hash function divided into R, r reduce task is responsible for a key interval. Each reduce requires a number of map task nodes to take the intermediate results that fall within their responsible key interval, and then execute the reduce function, resulting in a final result. With the R reduce task, there will be a final result, and in many cases the end result of this r is not to be merged into a final result, as the final result can be used as input to another computing task and another parallel computing task. This forms a plurality of output data fragments (HDFs replicas) in the above illustration.

hbase Data Management

HBase is the Hadoop database. What is the difference between traditional MySQL and Oracle? That is, what distinguishes the column data from the row data. What is the difference between a NoSQL database and a traditional relational data:

Hbase VS Oracle

1, hbase suitable for a large number of inserts at the same time read the situation. Enter a key to get a value or enter some key to get some value.

2, HBase bottleneck is hard drive transmission speed. HBase, it can insert into the data, or update some data, but the update is actually insert, just insert a new timestamp of a row. The delete data is also an insert, just one row with the delete tag on the insert line. All operations for HBase are append inserts. HBase is a log set database. It is stored in the same way as a log file. It is a lot of bulk to write to the hard disk, usually in the form of file reading and writing. The speed of reading and writing depends on how fast the transmission between the hard disk and the machine is. Oracle's bottleneck is hard drive seek time. It often reads and writes randomly. To update a data, first find the block on the hard disk, then read it into memory, modify it in the memory cache, and write it back over time. Because you are looking for a different block, there is a random read. The drive path time is mainly determined by the rotational speed. And the search time, the technology has not changed, which formed a search time bottleneck.

3, HBase data can hold many different timestamp versions (that is, the same data can replicate many different versions, allowing data redundancy, but also advantage). Data is sorted by time, so hbase is especially good for looking for top N in chronological order. Find out what someone has recently browsed, recently wrote n blogs, n behaviors, and so on, so hbase is very much used on the Internet.

4, HBase limitations. Can only do very simple key-value query. It is suitable for high-speed insertion, while there is a large number of read operation scenes. And this is a very extreme scenario, and not every company has that kind of demand. In some companies, ordinary OLTP (online transaction processing) is randomly read and written. In this case, Oracle reliability, the system is less responsible than hbase. And the hbase limitation is that it has only a primary key index, so it's a problem when modeling. For example, in a table, a lot of columns I want to do some kind of conditional query. But you can only build a quick query on the primary key. Therefore, it is not generally said that the technology has advantages.

5. Oracle is a row-type database, and HBase is a column database. The advantage of a column database is that the data analyses this scenario. The difference between data analysis and traditional OLTP. Data analysis, often with a column as the query criteria, the return of the result is often a certain column, not all columns. In this case, the performance of the row-type database response is inefficient.

Line Database: Oracle, for example, the basic unit of data files: Block/page. The data in the block is written in a row of rows. There's a problem, when we're going to read some columns in a block, we can't just read the columns, we have to read the whole chunk into memory and read the contents of the columns. In other words: to read some columns in the table, you must read all the rows of the entire table before you can read the columns. That's the worst part of the row database.

column database: is stored as an element of a column. The elements of the same column will squeeze in a block. When you want to read some columns, you only need to read the relevant column blocks into memory so that you can read less IO. Typically, the data elements of the same column are usually in a similar format. This means that when the data format is similar, the data can be greatly compressed. Therefore, the column-type database in the data compression has a great advantage, compression not only saves storage space, but also save IO. (This can be used when data reaches millions, tens, data query optimization, improve performance, show scenarios)

Hive Data Management

Hive is a data warehouse infrastructure built on Hadoop. It provides a range of tools for data extraction, transformation, and loading, a large data mechanism that can store, query, and analyze stored in Hadoop. You can map a structured data file in Hadoop to a table in hive and provide class SQL query functionality that is supported in addition to not supporting updates, indexes, and transactions. SQL statements can be converted to MapReduce tasks to run as SQL to MapReduce Mapper. Provides shell, JDBC/ODBC, Thrift, Web interface. Advantages: Low cost You can quickly implement simple mapreduce statistics through class SQL statements. As a data warehouse, hive Data management can be introduced from three aspects of metadata storage, data storage and data exchange according to the usage level.

(1) meta-data storage

Hive stores metadata in an RDBMS, there are three ways to connect to a database:

• Inline mode: The metadata remains in the inline database of Derby, typically used for unit testing, allowing only one session to connect

multiuser mode: Install MySQL locally, and put meta data into MySQL

Remote mode: Metadata is placed in a remote MySQL database

(2) data storage

First, Hive does not have a dedicated data storage format, and does not index data, for a very free table in the organization hive, you can parse the data simply by telling the column and row delimiters in the hive data when you create the table.

Second, all the data in hive is stored in HDFs, and hive contains 4 data models: Tabel, externaltable, Partition, Bucket.

table: Similar to a table in a traditional database, each table has a corresponding directory in hive to store the data. For example: A table zz, where the path in HDFs is:/wh/zz, where WH is the directory of the data warehouse specified in Hive-site.xml by $, and all table data (excluding external table) is saved in this directory.

Partition: An index similar to a partitioned column in a traditional database. In Hive, a partition in a table corresponds to a table of contents, and all partition data is stored in the corresponding directory. For example: The ZZ table contains DS and city two partition, then the HDFS subdirectory corresponding to ds=20140214,city=beijing is:/wh/zz/ds=20140214/city=beijing;

Buckets: The hash computed for the specified column, which splits the data according to the hash value, is designed to facilitate parallelism, and each buckets corresponds to a file. With the user column score to 32 bucket, first compute the hash for the value of the user column, for example, the HDFs directory corresponding to the hash=0 is:/wh/zz/ds=20140214/city=beijing/part-00000; corresponding to the hash=20 , the directory is:/wh/zz/ds=20140214/city=beijing/part-00020.

Externaltable point to data in an existing HDFS, you can create a partition. The same as table in the metadata organization structure, there are significant differences in actual storage. The table creation and data loading process can be implemented using a unified statement, the actual data is transferred to the Data Warehouse directory, and subsequent access to the data will be done directly in the Data Warehouse directory. When you delete a table, both the data and the metadata in the table are deleted. Externaltable only one procedure, because loading data and creating tables is done at the same time. World data is stored in the HDFS path specified after location and is not moved to the Data Warehouse.

(3) data exchange

User interface: Includes client, web interface and database interface

Metadata storage: typically stored in relational databases, such as Mysql,derby, etc.

　　· Hadoop: Store with HDFS and compute using MapReduce.

key point: Hive stores metadata in a database, such as MySQL, Derby. The metadata in hive includes the name of the table, the columns and partitions of the table and its properties, the properties of the table (whether it is an external table), the directory where the table data resides, and so on.

hive data is stored in HDFs, and most queries are done by MapReduce.

Summary:

through the core of the Hadoop distributed computing platform HDFS, mapreduce processing process, and data Warehouse Tools hive and distributed database HBase introduction. Basically covers all the technical cores of the Hadoop distributed platform. From the system architecture to the data definition to data storage, and from macro to micro system, the paper lays the foundation for large-scale data storage and task processing on the Hadoop platform.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More