Hadoop data management mainly includes hadoop's Distributed File System HDFS, distributed database hbase, and data warehouse tool hive data management.
1. HDFS Data Management
HDFS is the cornerstone of distributed computing. hadoop distributed file systems and other distributed file systems have many similar characteristics:
A single namespace for the entire cluster;
Data consistency. This model is suitable for writing multiple reads at a time. The client cannot see the existence of the file before the file is successfully created;
The file is divided into multiple file blocks, and each file block is allocated to the data node, and the copy file block is configured to ensure data security.
HDFS manages file systems through three important roles: namenode, datanode, and client. Namenode can be viewed as a manager in a distributed file system. It is mainly responsible for managing File System namespaces, cluster configuration information, and storage block replication. Namenode stores the metadata of the file system in the memory. The information mainly includes the file information, the information of the file block corresponding to each file, and the information of each file block in datanode. Datanode is the basic unit of file storage. It stores file blocks in the local file system and stores metadata of all blocks, at the same time, all existing block information is periodically sent to namenode. The client is the application that needs to obtain distributed file system files. The following three operations describe how HDFS manages data.
File writing
1) The client initiates a file write request to the namenode.
2) according to the file size and file block configuration, namenode returns the information of the datanode managed by the client.
3) The client divides the file into multiple blocks and writes it to each datanode block in sequence based on the datanode address information.
File Reading
1) The client initiates a File Read Request to the namenode.
2) namenode returns the datanode information of the file storage.
3) The client reads the file information.
File block replication
1) namenode: the block of some files does not meet the minimum number of copies or some datanode is invalid.
2) Notify datanode to copy blocks to each other.
3) datanode starts to directly Replicate each other.
HDFS, as a distributed file system, has several functions worth using for reference in data management:
Placement of file blocks: one block has three copies, and the other is stored on the datanode specified by namenode, the other part is placed on the datanode that is not on the same machine as the specified datanode, And the last part is placed on the datanode that is different from the specified datanode on different rack. The purpose of backup is to ensure data security. This configuration method mainly considers the failure of the same rack and the performance problems caused by data copying between different rack.
Heartbeat Detection: Uses heartbeat to check the health status of datanode. If any problem is found, data backup is used to ensure data security.
Data Replication (in scenarios such as datanode failure, balanced storage utilization of datanode, and balanced data interaction pressure ): when using hadoop, you can use the HDFS balancer command to configure threshold to balance the disk utilization of each datanode. If threshold is set to 10%, the average utilization of all datanode disks will be calculated first when the balancer command is executed. Then, if the disk utilization of a datanode exceeds the average value, the datanode block will be transferred to the datanode with low disk utilization, which is very useful for adding new nodes.
Data Verification: CRC32 is used for data verification. When writing a file block, in addition to writing data, verification information is also written. when reading the data, you need to verify the data before reading it.
Security Mode: When the Distributed File System is started, the security mode is available (you can also run the command to enter security mode). When the Distributed File System is in security mode, the content in the file system cannot be modified or deleted until the security mode ends. The security mode is mainly used to check the validity of data blocks on each datanode when the system starts, and to copy or delete some data blocks as necessary according to the policy. During the actual operation, if you modify or delete a file when the system starts, an error message indicating that the file cannot be modified in security mode is displayed. You only need to wait for a while.
2. hbase Data Management
Hbase is a distributed database similar to bigtable. Like bigtable, hbase is a sparse, long-term storage (stored on hard disk), and multidimensional sorting ing table. The index of this table is the row keyword, column keyword, and timestamp. Each value is an uninterpreted string with no type. You can store data in a table. Each row has a sorted primary key and any number of columns. Since the data is stored in Sparse Mode, each row of data in the same table can have different columns. The column name format is "<family >:< label>", which is composed of strings. Each table has a family set, which is fixed and unchanged, similar to the table structure, you can only change the table structure to change the family set of the table,The label value can be changed relative to each row..
Hbase stores data in the same family in the same directory, while hbase write operations lock rows. Each row is an atomic element and can be locked. All Database updates have a timestamp. Each update generates a new version, while hbase retains a certain number of versions. This value can be set. The client can choose to get the version closest to a certain time point or get all versions at a time. For details, see http://jiajun.iteye.com/blog/899632.
How does hbase, as a distributed database, manage data from clusters? Hbase manages data in a distributed Cluster Based on the architecture consisting of zookeeper, regionserver, master, and client.
1) zookeeper
Ensure that there is only one master in the cluster at any time; store all the addressing portals of the region; monitor the status of the region server in real time, and notify the master of the launch and deprecation information of the region server in real time; the schema for storing hbase, including the table and column family of each table.
2) Master
An hbase can only deploy one master server. It uses the Leader Election Algorithm algorithm to ensure that only the unique master server is active. zookeeper stores the server address information of the master server. If the master server is paralyzed, you can use the lead Election Algorithm to select a new master server from the slave server.
Allocate a region to the region server. It is responsible for load balancing of the region server. It discovers the invalid region server and re-allocates the region on it.
3) regionserver
The region server maintains the Region allocated to it by the master and processes IO requests to these region. The region server is responsible for splitting the region that becomes too large during running.
4) Client
Contains the interface for accessing hbase. The client maintains some caches to speed up access to hbase, such as the location information of Regione.
3. Hive Data Management
Hive is the basic architecture of data warehouse built on hadoop. It provides a series of tools for data extraction, conversion, and loading. This is a mechanism for storing, querying, and analyzing large-scale data stored in hadoop. Hive defines a simple SQL-like query language called Ql, which allows users familiar with SQL to query data. As a data warehouse, hive data management can be described in three aspects: Metadata storage, data storage, and data exchange.
1) Metadata Storage
Hive stores metadata in RDBMS. There are three modes to connect to the database:
Single User Mode: This mode is used to connect to an in-memory database Derby, which is generally used for unit test.
Multi User Mode: connects to a database over a network. This is the most common mode.
Remote Server Mode: used for non-Java clients to access metadatabase. A metastoreserver is started on the server, and the client uses the thrift protocol to access metadatabase through metastoreserver.
Metadata storage: usually stored in relational databases, such as MySQL and Derby.
2) Data Storage
First, hive does not have a special data storage format, nor does it create indexes for data. Users can freely organize tables in hive, you only need to tell the column separator and row Separator in hive data when creating the table, it can parse the data.
Secondly, all data in hive is stored in HDFS. Hive contains four data models: Table, External table, partition, and bucket.
3) data exchange
Data exchange is divided into the following parts:
User Interface: includes the client, Web interface, and database interface.
Metadata storage: data about data. The metadata in hive includes the table name, the table's columns and partitions, its attributes, the table's attributes (whether it is an external table or not), and the directory where the table data is located.
The interpreter, compiler, and optimizer completes the generation of hql query statements from lexical analysis, syntax analysis, compilation, and optimization to query plans. The generated query plan is stored in HDFS and then executed by mapreduce.
Hadoop: hive data is stored in HDFS. Most of the queries are completed by mapreduce (queries containing * do not generate mapredcue tasks, such as select * From TBL ).
4. Integration of hive and hbase
Hbase is a distributedNon-relational databases based on column Storage. The query efficiency of hbase is very high, mainly because of the query and display results.
Hive isDistributed Relational Database Service. It is mainly used for parallel distributed processing of large amounts of data. All queries in hive except "select * from table;" must be executed in map \ reduce mode. Because map \ reduce is required, it may take 8 to 9 seconds to query a table with only one row and one column, if it is not queried using the select * from table; method. However, hive is better at processing large amounts of data. When there is a lot of data to be processed and the hadoop cluster has enough capacity, it can reflect its advantages. Hive and hbase can be integrated through the hive storage interface.
Hive and hbase integration function implementation is to use the two itself external API interface communication, mutual communication is mainly rely on the hive_hbase-handler.jar tool class (hive storage handlers:
The following describes hadoop data management from hadoop's Distributed File System HDFS, distributed database hbase, and data warehouse tool hive, they all implement Three-dimensional data management from the macro to the micro through their own data definition and architecture, and complete large-scale data storage and task processing on the hadoop platform.