Through the introduction of the core Distributed File System HDFS, MapReduce processing process of the Hadoop distributed computing platform, as well as the Data Warehouse tool hive and the distributed database HBase, it covers all the technical cores of the Hadoop distributed platform.
Through this stage research summary, from the internal mechanism angle detailed analysis, HDFS, MapReduce, Hbase, Hive is how to run, as well as based on the Hadoop Data Warehouse construction and the distributed database interior concrete realization. If there are deficiencies, the follow-up timely changes.
HDFS Architecture
The architecture of the whole Hadoop is mainly through HDFs to realize the bottom support of distributed storage, and the program support of distributed parallel task processing is realized by Mr.
HDFs uses the master-slave (MASTER/SLAVE) structure model, and a HDFS cluster consists of a namenode and several datanode ( Multiple Namenode configurations have been implemented in the latest Hadoop2.2 version-this is what some large companies have done by modifying the Hadoop source code, in the latest version. Namenode, as the primary server, manages file system namespaces and client access to files. Datanode manages stored data. HDFS supports data in the form of files.
Internally, the file is divided into several blocks of data, which are stored on a set of Datanode. Namenode executes the namespace of the file system, such as opening, closing, renaming files or directories, and also the mapping of data blocks to specific datanode. Datanode is responsible for processing file system client file read and write, and in the namenode of the unified scheduling of database creation, deletion and replication work. Namenode is the manager of all HDFs metadata, and user data never passes through Namenode.
Figure: HDFs architecture diagram
The figure deals with three roles: Namenode, DataNode, and Client. Namenode is the manager, the Datanode is the file store, and the client is the application that needs to get the Distributed file system.
File write:
1 Client to Namenode to initiate file write request.
2 Namenode According to the file size and file block configuration, return to the client it manages the Datanode information.
3 The client divides the file into blocks and writes the block to the Datanode in order according to the Datanode address.
File reads:
1 Client to Namenode to initiate the request to read the file.
2) Namenode returns datanode information for the file store.
3 Client reads the file information.
As a distributed file system, HDFs can be used for reference in data management.
Placement of blocks: a block will have three copies, one on the datenode specified by Namenode, one on the datanode that is not on the same machine as the specified Datanode, One is the specified datanode on the datanode on the same rack. The purpose of the backup is for data security, in a way that takes into account the failure of the same rack and the performance problems of different copies of the data.
MapReduce Architecture
The Mr Frame is composed of a jobtracker running on the master node and a tasktracker running on each cluster from the node. The master node is responsible for scheduling all tasks that constitute a job, which are distributed across different nodes. The master node monitors their execution and restarts the previously failed tasks. From the node, only tasks assigned by the master node are responsible. When a job is committed, Jobtracker receives the submit job and configuration information, and distributes the configuration information to the node, dispatching the task and monitoring the execution of the Tasktracker. Jobtracker can run on any computer in the cluster. Tasktracker is responsible for performing tasks, it must run on Datanode, Datanode is both a data storage node and a compute node. Jobtracker distributes map tasks and reduce tasks to idle tasktracker, which run in parallel and monitor the operation of tasks. If Jobtracker fails, Jobtracker transfers the task to another idle tasktracker.
HDFs and Mr Together form the core of the Hadoop Distributed system architecture. HDFs Distributed File system is implemented on the cluster, and Mr has distributed computing and task processing on the cluster. HDFS provides the support of file operation and storage in the process of Mr Task processing, Mr HDFs realizes the task distribution, tracking and execution, and collects the result, which is the main task of distributed cluster.
Parallel application development on Hadoop is based on the Mr Programming framework. The principle of Mr Programming model: Using an input key-value to generate an output key-value pair set. The Mr Library implements this framework through the map and reduce two functions. The user-defined map function takes an input key-value pair and then produces a set of intermediate key-value pairs. Mr All combines value with the same key value and then passes a reduce function. The reduce function accepts the key and the associated value combination, and the reduce function merges these value values to form a smaller value set. Usually we supply the median value to the reduce function through an iterator (the role of the iterator is to collect these value values), so that a large number of value values that cannot be put in memory can be set.
Description: (The third picture for the companion to draw himself)
Process in short, large datasets are divided into small chunks of data, and several datasets are grouped into one node of the cluster to process and produce intermediate results. Tasks on a single node, the map function reads data (K1,V1) at a row, enters the cache, performs a map (based on Key-value) ordering through the map function (the frame sorts the output of the map), and then enters (K2,V2). Each machine performs the same operation. The process of sorting through the merge (K2,V2) on different machines (the shuffle process can be understood as a process before reduce), and finally reduce is merged, (K3,V3), and output to HDFs file.
When it comes to reduce, you can merge the intermediate data (Combine) with the same key in the middle before reduce. The combine process is similar to the process of reduce, but combine is a part of the map task that executes only after the map function is executed. Combine can reduce the number of intermediate result key-value pairs, thereby reducing network traffic.
The intermediate results of the map task are stored on the local disk as files after combine and partition are finished. The location of the intermediate result file notifies the master Jobtracker,jobtracker and then notifies the reduce task which datanode to take the intermediate result. All the map tasks produce intermediate results according to their key value according to the hash function divided into R, r reduce task is responsible for a key interval. Each reduce requires a number of map task nodes to take the intermediate results that fall within their responsible key interval, and then execute the reduce function, resulting in a final result. With the R reduce task, there will be a final result, and in many cases the end result of this r is not to be merged into a final result, as the final result can be used as input to another computing task and another parallel computing task. This forms a plurality of output data fragments (HDFs replicas) in the above illustration.
HBase Data Management
HBase is the Hadoop database. What is the difference between traditional MySQL and Oracle? That is, what distinguishes the column data from the row data. What is the difference between a NoSQL database and a traditional relational data:
Hbase VS Oracle
1, hbase suitable for a large number of inserts at the same time read the situation. Enter a key to get a value or enter some key to get some value.
2, HBase bottleneck is hard drive transmission speed. HBase, it can insert into the data, or update some data, but the update is actually insert, just insert a new timestamp of a row. The delete data is also an insert, just one row with the delete tag on the insert line. All operations for HBase are append inserts. HBase is a log set database. It is stored in the same way as a log file. It is a lot of bulk to write to the hard disk, usually in the form of file reading and writing. The speed of reading and writing depends on how fast the transmission between the hard disk and the machine is. Oracle's bottleneck is hard drive seek time. It often reads and writes randomly. To update a data, first find the block on the hard disk, then read it into memory, modify it in the memory cache, and write it back over time. Because you are looking for a different block, there is a random read. The drive path time is mainly determined by the rotational speed. And the search time, the technology has not changed, which formed a search time bottleneck.
3, the HBase data can save many different timestamp versions (that is, the same data can replicate many different versions, allowing data redundancy, but also advantage). Data is sorted by time, so hbase is especially good for looking for top N in chronological order. Find out what someone has recently browsed, recently wrote n blogs, n behaviors, and so on, so hbase is very much used on the Internet.
4, HBase limitations. Can only do very simple key-value query. It is suitable for high-speed insertion, while there is a large number of read operation scenes. And this is a very extreme scenario, and not every company has that kind of demand. In some companies, ordinary OLTP (online transaction processing) is randomly read and written. In this case, Oracle reliability, the system is less responsible than hbase. And the hbase limitation is that it has only a primary key index, so it's a problem when modeling. For example, in a table, a lot of columns I want to do some kind of conditional query. But you can only build a quick query on the primary key. Therefore, it is not generally said that the technology has advantages.
5. Oracle is a row-type database, and HBase is a column database. The advantage of a column database is that the data analyses this scenario. The difference between data analysis and traditional OLTP. Data analysis, often with a column as the query criteria, the return of the result is often a certain column, not all columns. In this case, the performance of the row-type database response is inefficient.
Row database: Oracle, for example, the basic unit of data files: Block/page. The data in the block is written in a row of rows. There's a problem, when we're going to read some columns in a block, we can't just read the columns, we have to read the whole chunk into memory and read the contents of the columns. In other words: to read some columns in the table, you must read all the rows of the entire table before you can read the columns. That's the worst part of the row database.
Column database: is stored as an element of a column. The elements of the same column will squeeze in a block. When you want to read some columns, you only need to read the relevant column blocks into memory so that you can read less IO. Typically, the data elements of the same column are usually in a similar format. This means that when the data format is similar, the data can be greatly compressed. Therefore, the column-type database in the data compression has a great advantage, compression not only saves storage space, but also save IO. (This, can be used in when the data to millions, tens, data query optimization between, improve performance, show scenarios)
Hive Data Management
Hive is a data warehouse infrastructure built on Hadoop. It provides a range of tools for data extraction, transformation, and loading, a large data mechanism that can store, query, and analyze stored in Hadoop. You can map a structured data file in Hadoop to a table in hive and provide class SQL query functionality that is supported in addition to not supporting updates, indexes, and transactions. SQL statements can be converted to MapReduce tasks to run as SQL to MapReduce Mapper. Provides shell, JDBC/ODBC, Thrift, Web interface. Advantages: Low cost You can quickly implement simple mapreduce statistics through class SQL statements. As a data warehouse, hive Data management can be introduced from three aspects of metadata storage, data storage and data exchange according to the usage level.
(1) meta-data storage
Hive storage of metadata in an RDBMS, there are three ways to connect to a database:
• Inline mode: The metadata remains in the inline database of Derby, typically used for unit testing, allowing only one session to connect
• Multiuser mode: Install MySQL locally, and put meta data in MySQL
• Remote mode: Metadata placed in remote MySQL database
(2) Data storage
First, Hive does not have a dedicated data storage format, and does not index data, for tables that can be very free to organize hive, simply by telling the column and row delimiters in the hive data when you create the table, which resolves the data.
Second, all the data in the hive is stored in HDFs, and the hive contains 4 data models: Tabel, externaltable, Partition, Bucket.
Table: Similar to a table in a traditional database, each table has a corresponding directory in the hive to store the data. For example: A table zz, where the path in HDFs is:/wh/zz, where WH is the directory of the data warehouse specified in Hive-site.xml by $, and all table data (excluding external table) is saved in this directory.
Partition: An index similar to a partitioned column in a traditional database. In Hive, a partition in a table corresponds to a table of contents, and all partition data is stored in the corresponding directory. For example: The ZZ table contains DS and city two partition, then the HDFS subdirectory corresponding to ds=20140214,city=beijing is:/wh/zz/ds=20140214/city=beijing;
Buckets: The hash computed for the specified column, splitting the data according to the hash value, in order to facilitate parallelism, each buckets corresponding to a file. With the user column score to 32 bucket, first compute the hash for the value of the user column, for example, the HDFs directory corresponding to the hash=0 is:/wh/zz/ds=20140214/city=beijing/part-00000; corresponding to the hash=20 , the directory is:/wh/zz/ds=20140214/city=beijing/part-00020.
Externaltable point to data in an existing HDFS, you can create a partition. The same as table in the metadata organization structure, there are significant differences in actual storage. The table creation and data loading process can be implemented using a unified statement, the actual data is transferred to the Data Warehouse directory, and subsequent access to the data will be done directly in the Data Warehouse directory. When you delete a table, both the data and the metadata in the table are deleted. Externaltable only one procedure, because loading data and creating tables is done at the same time. World data is stored in the HDFS path specified after location and is not moved to the Data Warehouse.
(3) Data exchange
• User interface: Includes client, web interface and database interface
• Meta-data storage: typically stored in relational databases, such as Mysql,derby
· Hadoop: Store with HDFS and compute using MapReduce.
Key point: Hive stores metadata in a database, such as MySQL, Derby. The metadata in hive includes the name of the table, the columns and partitions of the table and its properties, the properties of the table (whether it is an external table), the directory where the table data resides, and so on.
Hive data is stored in HDFs, and most queries are done by MapReduce.
Summary:
This paper introduces the core Distributed File System HDFs, MapReduce process of the Hadoop distributed computing platform, the Data Warehouse tool hive and the distributed database HBase. Basically covers all the technical cores of the Hadoop distributed platform. From the system architecture to data definition to data storage, and from macro to micro system, it lays the foundation for large-scale data storage and task processing on Hadoop platform.
This article from China Statistics Network, reprint please indicate the source.