Hadoop Learning-Basic concepts

Last Update:2018-07-26 Source: Internet

Author: User

Tags rsync

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Official Tutorials Dot Here
Related forum design concept

HDFs is designed to store large files, typically running on a cluster model on a common business Server, and data access based on a streaming data access model. HDFs stores metadata for all files in the memory of name nodes (Namenode), enabling efficient management of "large" files (GB or even larger files) with distributed features. For applications with large volumes of small files, the memory space of the name node is put under great pressure and is likely to become a system performance bottleneck. Furthermore, the HDFS is designed for the MapReduce computing framework, and the data stored are mainly used for subsequent processing analysis, whose access model is "one write, multiple reads"; Therefore, after the data is stored in HDFs, only new data can be appended to the end of the file, and the file cannot be modified. In addition, the HDFS is optimized for efficient transfer of large files, and in order to accomplish this goal, it makes a big concession on the "low latency" feature, so it does not apply to smaller access delays.
HDFs is a file system that works in user space, its tree-like file system is independent, not like the traditional file system working in kernel space to mount to the current operating system of the directory tree on the HDFs access, the traditional implementation of file or directory management commands such as LS, cat, etc. here also can not be used normally.
name Node (namenode) and Data node (Datanode)

The name node manages the HDFs namespace, which is a tree-organized directory and file metadata information that is persisted on the name node local disk and saved as the name "Space Mirror (namespace image)" and "Edit Log" two files. A name node does not store a block of data, it only needs to know where each file corresponds to the data block, that is, the data node where the data block is actually stored. However, the name node does not persist the corresponding information for the location where the block is stored, because the information is started in the HDFs cluster and is rebuilt from the information sent by the name node based on each data node.
The main tasks of a data node include storing or reading blocks of data according to the name node or customer's requirements, and periodically reporting the information about their saved blocks to the name node.
By default, HDFs stores three replicas for each block of data in the cluster to ensure data reliability, availability, and performance. In a large cluster, these three replicas are typically saved to data nodes in different racks to cope with two common failures: single data node failures and network failures that cause all hosts on a rack to go offline. In addition, as described in the previous MapReduce run model, saving multiple replicas for a block of data is also beneficial for MapReduce to handle node failures transparently during the execution of a job, and provides a realistic support for the collaborative processing of MapReduce to enhance performance. The name node checks whether the number of copies per block meets the requirements according to the periodic report of the data node, and the number below the configured number will be replenished, while the extra will be discarded to access the HDFs file system

HDFs user interface:
1.hadoop DFS command line interface;
2.hadoop dfsadmin command line interface;
3.web interface;
4.HDFS API; Write Data

When you need to store files and write data, the client program initiates a namespace update request first to the name node, the name node checks the user's access rights and the file is already present, and if there is no problem, the namespace picks up an appropriate data node to allocate a free block of data to the client program. The client program sends the data directly to the corresponding data node, and after the storage is finished, the data node copies the data blocks to the other nodes according to the instructions of the name node.

1. Before saving data to the HDFs cluster, the HDFS client needs to know in advance the "block size" used by the target file system and the "Replication factor (Replication Factor, which is the number of replicas that each block needs to save)". Before submitting the data to HDFs, the HDFs client needs to segment the file to be saved by block size and initiates a block storage request to the name node, at which point, according to the replication factor, the client asks the name node to give the same number of free blocks as the replication factor, which is assumed to be 3;
2. The name node needs to identify at least 3 data nodes (with replication factors) that have free blocks available and respond to the client in the order in which the addresses of these 3 nodes are from near from the client;
3. The client initiates a data store request only to the nearest data node (assumed to be DN1) ; When this most recent data node store is complete, the data block is copied to one of the remaining data nodes (assumed to be DN2), and the DN2 is responsible for synchronizing the data block to the last data node (assuming DN3) after the transfer completes, which is also known as the "Replication Pipeline" (Replication Pipeline
4. When the three data nodes are all stored, they notify the name node of the "store completed" information, and then the name node notifies the client that the store is complete;
5. The client stores all the remaining data blocks in this manner, and notifies the name node to close the file after all block storage completes, and the name node then stores the file's metadata information in persistent storage; Read Data

HDFS provides a POSIX-wind access interface, and all data operations are transparent to client programs. When the client program needs to access the data in HDFs, it first establishes a connection based on the TCP/IP protocol with the node listening to the name nodes, and then initiates a request to read the file through the client Protocol, which returns a block identifier of the associated file based on the user request ( Blockid) and a data node that stores this data block. The client then initiates the request and retrieves the required data blocks to the port on which the data node is listening.

1. The client requests access to a file from the name node;
2. The name node responds to the client in two lists: (a) all blocks of data contained in this file, (b) a list of data nodes for each block of this file;
3. The client reads each block of data from the nearest data node in the storage list and completes the merge locally; the reliability of the name node

The downtime of the name node will cause all data in the HDFs file system to become unavailable, and if the namespace-mirrored file or edit log file on the name node is corrupted, the entire HDFS will not even be rebuilt and all data will be lost. Therefore, for data availability, reliability, and so on, additional mechanisms must be provided to ensure that such failures do not occur, and Hadoop provides two solutions for this purpose.
The easiest way to do this is to store more than one copy of the persistent metadata information on a name node in a different storage device in real time. The name node of Hadoop can use a number of different namespace storage devices through the property configuration, while the name node writes to multiple devices synchronously. When a name node fails, a new physical host can load a copy of the available namespace mirror and edit the log copy to complete the rebuild of the namespace. However, depending on the size of the edit log and cluster size, this process may take a long time.
Another way is to provide a second name node (secondary namenode). The second name node does not really play the name node role, and its primary task is to periodically merge the edit log into the namespace mirroring file lest the edit log become too large. It runs on a separate physical host and requires the same amount of memory resources as the name node to complete the file merge. In addition, it saves a copy of the namespace image. However, according to its working mechanism, the second name node lags behind the master node, so some data loss is still unavoidable when the name node fails.
Hadoop 0.23 introduces a highly available mechanism for name nodes--setting up two name nodes to work in the primary standby model, and all services are immediately transferred to the standby node when the primary node fails. In a large-scale HDFS cluster, the HDFS Federation (HDFS Federation) mechanism was introduced in the Hadoop 0.23 version in order to avoid the name node becoming a system bottleneck. HDFs Federation, each name node manages a namespace volume (namespace volume) consisting of namespace metadata and a block pool containing information about all blocks, and the namespace volumes on each name node are isolated from each other, so The corruption of one name node does not affect the continued provision of services by other name nodes.
In the HDFs cluster, each data node sends a heartbeat message (HEARTBEAT) to the name node periodically (every 3 seconds) to notify its "health" condition, and the name node does not receive the heartbeat information of a data node within 10 minutes. It is considered to have failed and is removed from the list of available data nodes, regardless of whether the node itself or the network problem is the cause of the failure.
The data transmission between the client and the node is based on the TCP protocol, each sending a message to the data node, and the data node returns a response message to the client; Therefore, if the client retries several times, it still fails to receive the response message from the data node normally, It discards this data node instead of using the second data node in the list provided by the name node.
The noise in the network transmission may cause the data corruption, in order to avoid the data node storing the wrong data, the client sends the data to the data node and transmits the checksum of the data, and the data node stores the checksum with the data. In the HDFs cluster, data nodes periodically report each block of information they hold to the name node, however, before sending the relevant information for each block, it verifies that the data block appears corruption based on checksum, and if an error occurs, the data node will no longer advertise itself to hold the block of data to the name node. The name node can thus be informed that the data node has corrupted data blocks.MapReduce

MapReduce is a programming framework that provides programmers with a fast-developing programming environment for massive data-processing programs, and enables handlers developed based on this mechanism to run in a stable, fault-tolerant way on a cluster of large commercial hardware. At the same time, MapReduce is a running framework that needs to provide a running environment for the programs developed based on the mapreduce mechanism and transparently manage the various details of the operation. Each MapReduce program that needs to be run by the MapReduce run framework is also known as a mapreduce job (MapReduce job) that needs to be submitted by the client, which is responsible for receiving the job from a specialized node in the cluster. According to the cluster configuration and the job attributes to be processed, the appropriate operating environment is provided. Its running process is divided into two stages: the map stage and the reduce stage, each stage is responsible for the specific data processing operation by starting a certain number of tasks (i.e. process) according to the properties of the job itself, the availability of resources in the cluster and the user's configuration.
In the MapReduce cluster, the host responsible for receiving the job submitted by the client is called the Master node, and the process that is responsible for receiving the job on this node is called Jobtracker. The node that is responsible for running the map task or the reduce task is called the slave node, and its running job processing process is tasktracker. By default, a slave node can run two map tasks and two reduce tasks at the same time. MapReduce Logical Architecture

MapReduce Running Frame

The MapReduce program, also known as a mapreduce job, is typically composed of mapper code, reducer code, and its configuration parameters (such as where to read data from, and where to save the output data). The prepared job can be submitted through the Jobtracker (Job submission node), and then the running framework is responsible for completing other tasks that follow. These follow-on tasks mainly include the following areas.
1. Dispatch
Each mapreduce job is divided into smaller units called tasks (Task), and larger jobs can be divided into more tasks than can be run by the entire cluster, which requires the scheduler program to maintain a task queue and track the processes associated with the running state task. So that the queued task is dispatched to a node that is in the available state. In addition, the scheduler is responsible for the task coordination of different operations.
For a running job, the intermediate data is grouped, sorted, and sent to the reduce job only after the map task is completed.the completion time of the map phase depends on the completion time of its slowest job。 Similarly, the end of the last task in the reduce phase is completed and its final results are available. As a result, the MapReduce job completion speed is determined by the stragglers in each of the two phases, which may, at worst, cause the job to be completed for a long time. For the sake of optimal execution, Hadoop and Google MapReduce have implementedSpeculative Execution(speculative execution) mechanism where the same task starts multiple copies of the execution on a different host, running the framework to get the results back from its fastest-performing task. However, speculative execution does not eliminate other lag scenarios, such as the speed at which intermediate key values are distributed to data.
2. Collaborative work on data and code (Data/code co-location)
The term "data distribution" can be misleading because the mechanism that MapReduce tries to ensure is that the code to be executed is delivered to the node where the data resides, since the amount of data in the code is usually much smaller than the data itself. Of course, MapReduce does not eliminate data transfer, such as when a task is working on a node where the data is already starting many tasks, this task will have to run on other available nodes. At this point, considering that the server in the same rack has more network bandwidth, a better choice is to pick a node from the same rack within the data node to perform this task.
3. Sync (synchronization)
In asynchronous environment, a group of concurrent processes that send messages to each other because of direct constraints are mutually cooperative and wait for each other, and the process of executing processes at a certain speed is called interprocess synchronization, which can be divided into process synchronization (or thread synchronization) and data synchronization. In the programming approach, the main methods of maintaining interprocess synchronization are memory barriers (Memory barrier), mutexes (mutexes), semaphores (semaphore) and locks (lock), enhancement (Monitor), messages (message), pipelines (Pipe), and so on. MapReduce completes the process by isolating the process in the map phase from the process in the reduce phase, that is, when all the tasks in the map phase are completed, the intermediate key value pairs that are generated are grouped according to the key, After sorting through the network to the reducer to start the reduce phase of the task, so this process is also known asShuffle and sort。
4. Errors and troubleshooting (Error and fault handling)
The MapReduce running framework itself is a commercial server designed to be prone to failure, so it must have a good fault-tolerant capability. When any type of hardware failure occurs, the MapReduce running framework can reboot on a newly selected node with its own task running on the associated node. Similarly, in the event of any program failure, the running framework will be able to catch exceptions, log exceptions, and automatically recover from the exception. In addition, in a larger cluster, any other failure that goes beyond the programmer's understanding, the MapReduce running framework is also safe to survive.Hadoop cluster

The core components of Hadoop provide large data storage capabilities for MapReduce and Hdfs,hdfs, while MapReduce provides a development and operational environment for programmers to develop programs that process large data.
When the dispatch runs a job, the map task can be run directly on the HDFS data node stored by the data to be processed, which avoids a large amount of data transmission, realizes the locality of data processing, and thus greatly improves the complete efficiency of the whole process, which is also the way of the Hadoop cluster deployment.

In a small cluster of less than 50 nodes, Namenode and jobtracker can be combined to run on the same node. And the whole cluster has 5 kinds of core processes running , they are the jobtracker and tasktracker of the MapReduce cluster, and the Namenode, Datanode and Secondarynamenode of the HDFs cluster. Hadoop Eco-Circle

Operating Environment Hadoop Operating Environment

Hadoop is developed based on the Java language, so its operation relies heavily on the JDK (Java Development Kit), and many of the features of Hadoop depend on the features provided by Java 6 and later versions, except for the JDK. The normal operation of the Hadoop cluster may depend on other software to realize the maintenance, monitoring and management of the cluster according to the actual environment. These are software such as cron, NTP, SSH, Postfix/sendmail, and rsync. Cron is typically used to perform periodic tasks such as temporary files that expire in the Hadoop cluster, archive compression logs, and NTP to implement time synchronization for each node of the cluster; SSH is not necessary, However, the SSH service is used when the entire cluster is started at once on the master node of the MapReduce or HDFs, and Postfix/sendmail notifies the administrator of the Cron's execution results, and rsync can be used to implement the synchronization of the configuration files.
1. The host name of the node
Hadoop has a unique way of referencing the nodes based on the host, which has been a headache for many Hadoop administrators. In practice, each node in the cluster should be avoided, especially from nodes (Datanode and Tasktracker) using localhost as the host name of the native, unless it is in a pseudo distributed environment.
2. Users, groups, and directories
A complete Hadoop cluster contains the MapReduce cluster and the HDFs cluster, and the MapReduce cluster contains Jobtracker and tasktracker two classes of processes and many on-demand task class processes ( such as the map Task), the HDFs cluster contains three types of processes, Namenode, Secondarynamenode, and Datanode. In the interests of security, these processes should be started as regular users, and the process of MapReduce cluster processes and HDFs clusters should also use different users, such as mapred and HDFs users respectively. When you install Hadoop using the CDH RPM package, these users are automatically created, and if you install based on the TAR package, you need to create them manually.

Hadoop typically has three modes of operation: local (standalone), pseudo-distributed (pseudo-distributed) mode, and fully distributed (fully distributed) mode
There are four levels of configuration for Hadoop: Clustering, processes, jobs, and individual operations, the first two of which are configured by the Cluster Administrator, and the next two are part of the programmer's job scope. Hadoop configuration file

The Core-site.xml, Mapred-site.xml, and hdfs-site.xml three profiles are most critical in Hadoop's configuration files.

Hadoop-env.sh: Used to define configuration information related to the running environment of the Hadoop, such as configuring Java_home environment variables, specifying specific options for the Hadoop JVM, specifying the directory path where the log files reside, and the location of master and slave files;

Core-site.xml: Used to define system-level parameters, such as HDFs URLs, temporary directories for Hadoop, and configuration files for Rack-aware clusters, etc. The parameter definitions in this override the default configuration in the Core-default.xml file to configure the characteristics of the Hadoop cluster, which acts on all processes and clients;

HDFS-SITE.XML:HDFS cluster work attributes related settings, such as the number of copies of the file, block size and whether to use mandatory permissions, and so on, this parameter definition will overwrite hdfs-default.xml file in the default configuration;

Mapred-site.xml: Configure settings for MapReduce cluster work properties, such as the default number of reduce tasks, the default upper and lower limits of memory that can be used by tasks, and so on, which overrides the default configuration in Mapred-default.xml files;

A list of slave hosts for the Slaves:hadoop cluster, which, when Master starts, connects all hosts in this list through SSH and initiates the Datanode and tasktracker processes for them;

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More