Apache Hadoop Distributed File System description __java

Source: Internet
Author: User
Tags documentation mkdir hadoop mapreduce hadoop ecosystem hadoop fs

Original from: https://examples.javacodegeeks.com/enterprise-java/apache-hadoop/apache-hadoop-distributed-file-system-explained/

========== This article uses Google translation, please refer to Chinese and English learning ===========

In this case, we will discuss in detail the Apache Hadoop Distributed File System (HDFS), its components and architecture. HDFs is one of the core components of the Apache Hadoop ecosystem. 1. Introduce

Apache Hadoop provides a distributed file system and a framework for converting large datasets using the MapReduce paradigm. HDFs is designed to reliably store very large datasets while running on commercial hardware. It is fault tolerant and provides high throughput access to stored data. Although the HDFs interface is patterned after the Unix file system, it relaxes some POSIX requirements to improve the performance of the applications it addresses and provides streaming access to the data stored in the file system.
2. HDFs Design

The following are properties of HDFs that are different from other file systems and enable HDFs to reliably handle large amounts of data.
2.1 system Failure

HDFs is designed to work on a set of commodity hardware. System failures are considered to be specifications. Because of the large number of HDFS-dependent components, the fact that these components have extraordinary failure probabilities also results in the failure of one component or other component at all times. So HDFs is designed to detect failures and perform automatic recovery to provide the required performance is one of the core attributes of HDFs.
2.2 can handle a lot of data

HDFs is designed for applications that rely on large amounts of data. This data can also be gigabytes, too byte, or PB. As a result, HDFs is adjusted to support this large dataset and is extended to large system clusters to store this data without impacting data throughput.
2.3 Conformance Model

The HDFs is adjusted to accommodate applications that need to write data one or more times and read more data. Because these applications are assumed to depend on the read multiple read model, it simplifies data consistency issues and allows HDFS to provide high throughput data access.
2.4 Portability

HDFs is intended to be ported across heterogeneous hardware and software platforms. This makes the adaptation of HDFs very easy, and it becomes the platform of choice depending on the application of the distributed large data set.
3.HDFS Node

HDFS Namenode and Datanode have two main components.
3.1 Namenode

HDFs follows the master-slave architecture, where Namenode is the node that acts as the master node. A HDFS cluster contains only one namenode. The main function of Namenode is to manage the file system namespace and to control the client authentication of files stored in the HDFs cluster. It also handles the mapping of data stored in different datanode.
3.2 Datanode

Datanode is the node where the name represents the actual data stored in the cluster. There are multiple Datanode in the cluster, and the number of datanodes is usually the same as the node of the hardware node in the cluster. Datanode serves read and write requests from the client, and also handles the operations of block creation, block deletion, and replication-related data blocks.
4.HDFS Architecture

In this section, we will look at the basic architecture of the Hadoop Distributed File System (HDFS).
4.1 Namenode and Datanode's work

HDFs is a block-structured file system, which means that all individual files are divided into small blocks of data with fixed block sizes. These blocks are then stored in the machine cluster in the Datanode. Namenode handles functions such as opening, closing, and renaming files or directories. The Namenode also handles the mapping of data in the cluster, which means that Namenode tracks which block of data is stored on which Datanode and how to handle the replication of that data.
4.2 HDFs Namespaces

The HDFs namespace defines how data is stored and accessed in a cluster. HDFs a traditional hierarchical organization that supports files and directories. It also supports almost all required functions to handle namespace operations, such as creating or deleting files or directories, moving files/directories from one place to another, and so on.

As we discussed in section 3rd, Namenode is the component that maintains the namespace of the HDFs file system. Maintain any action on the data in Namenode, such as creating or deleting files, moving files or directories.
4.3 data replication

Because HDFs is designed to reliably and securely store large amounts of data on a set of commodity hardware. Because this hardware is prone to failure, HDFS needs to process data in a way that makes it easy to retrieve data in the event of a hardware failure in one or more systems. HDFS uses data replication as a strategy for providing fault-tolerant functionality. Applications that use HDFs can configure the replication factor and the block size of the data as required.

The problem now is how to determine replication, and what to do if all replicas are in a single rack in the cluster and the entire rack fails. HDFs tries to maintain a rack-aware replication strategy, which actually requires a lot of tweaking and experience. A simple but not optimal strategy is to place each copy of the block on a unique rack for the entire rack failure. At least the copy block is safe in the other rack.

In most production systems, replication factor three is used. In these cases. HDFs uses a unique rack strategy with a slightly different version. It typically places a copy on a node in the local rack, another on a node on a completely different remote rack, and a third copy on a different node on the remote rack. This strategy improves write speed by switching between racks by writing on two different racks instead of three racks. This provides backup in the event of a node failure and in case of a rack failure. This policy improves write performance without affecting data reliability.
4.4 Failure

The primary goal and goal of the Hadoop Distributed File System (HDFS) is to reliably access data even in the case of a failure. Because failures are more common than commercial hardware clusters, HDFS needs a strategy to handle failures. Three common types of failures are:
Namenode Fault Datanode Fault network partition


Each datanode in the cluster sends a periodic message to the Namenode, which is called the heartbeat. This heartbeat communicates to namenode that a particular datanode is working properly and is alive. Now in the case of datanode failure, there will be no heartbeat from Datanode to Namenode. Similarly, in the case of a network partition, a subset of the Datanode can loosen its connection to the Namenode and stop sending a heartbeat. Once Namenode stops getting a heartbeat from a particular datanode or set of Datanode, declare that the nodes are dead, and then start checking the corrupted process, including checking that all the blocks in the death Datanode still have enough copies, if not, It starts the process to create a duplicate copy to obtain the smallest number of replicas configured in the application.

The Namenode fault is more serious because the Namenode system is the only single point of failure for the complete HDFs cluster. If the Namenode system fails, the entire cluster is useless, requires manual intervention, and needs to be set to another namenode.
4.5 Data Accessibility

Now to allow applications to access data stored in the HDFs cluster, it provides a Java API for applications to use. If you need to use the C language, you also provide the C language wrapper through the Java API.

In addition to Java and C Api,hdfs, there is an option to access HDFS data through a Web browser through a web-based port, which can be configured in HDFs settings.

The third secondary option is to use the file system shell. HDFs also provides a command-line interface called the FS shell that allows users to interact with the data in HDFs. The syntax for this command-line interface is similar to the Linux shell command. For example:

#To make a new directory
Hadoop fs-mkdir/user1/project1

#List The content of the file
Hadoop fs-ls/user1/p Roject1

#Upload A file from the local system to HDFS
Hadoop fs-put desktop/textfile.txt/user1/project1

For more examples and instructions on the FS shell command, you can view the article Apache Hadoop FS Command sample 5. Configure HDFs

The HDFs configuration is very simple, and it does not take much time to set up the HDFs cluster. All of the HDFs configuration files are included in the Hadoop package by default and can be configured directly.

Note: We assume that the Hadoop package has been downloaded, uncompressed, and placed in the desired directory. In this article, we will discuss only the configurations required for HDFs. Detailed articles on how to set up Hadoop and Hadoop clusters. Follow these tutorials:

How to install the Apache Hadoop Apache Hadoop cluster Setup sample (with virtual machine) 5.1 configuration HDFs

HDFS is configured with a set of XML files that exist in the Hadoop configuration directory by default. This configuration directory exists in the root directory of the Hadoop folder and is named Conf.

First, we will modify the Conf/hadoop-sites.xml file, we need to set three properties in this file, namely Fs.default.name,dfs.data.dir,dfs.name.dir

To modify a file, open the file in the editor and add the following line of code:

<configuration>
   <property>
      <name>dfs.replication</name>
      <value>2< /value>
   </property>
   <property>
      <name>dfs.namenode.name.dir</name>
      <value>/usr/local/hadoop/hdfs/namenode</value>
   </property>
   <property>
      < Name>dfs.datanode.data.dir</name>
      <value>/usr/local/hadoop/hdfs/datanode</value>
   </property>
</configuration>

The first configuration we set here is Dfs.replication, which sets the replication factor to be used by the Distributed file system. In this case, we set it to two.

The next configuration is to define the Namenode path, the Dfs.namenode.name.dir, where the value needs to be the directory where the Namenode information is stored.

The third and last configuration we need to set is the path to define Datanode, that is, Dfs.datanode.data.dir, which defines the path to the directory where Datanode information is stored.


Note: Make sure that the Namenode and Datanode directories are created and that the directory where the data is stored is owned by the user who will run Hadoop. Enable users to have read and write permissions in the directory.
5.2 Format Namenode

Now the next step is to format the Namenode we just configured. The following command is used to format Namenode:

HDFs Namenode-format

This command should be executed without any errors on the console output. If it doesn't do anything wrong, we're good at starting the Apache Hadoop instance on our Ubuntu system.
5.3 Start HDFs

Now we're ready to start the Hadoop file system. To start HDFs, run the start-dfs.sh file using the following command:

/usr/local/hadoop/sbin/start-dfs.sh


Once this script executes without any errors, HDFs will start and run.
6. Use Shell to interact with HDFs

Now we'll see some commands that need to interact with the HDFs shell. In this section, we'll see only the basic introductory commands, and we'll just use the command line interface. Commands that communicate with the cluster exist in the script Bin/hadoop. This script loads the Hadoop software package using the Java Virtual Machine (JVM) and executes the user command.
6.1 Creating a directory

Usage:


Example:

Hadoop Fs-mkdir/user/root/dir1

The commands in the second line are used to list the contents of a particular path. We will see this command in the next section. We can see in the screenshot that Dir1 was created


6.2 Listing the contents of the directory

Usage:


Example:

Hadoop fs-ls/user/root/
This command is similar to the LS command in the Unix shell.


6.3 Uploading files in HDFs

command is used to copy one or more files from the local system to the Hadoop file system.
Usage:

Hadoop fs-put  
Example:

Hadoop fs-put desktop/testfile.txt/user/root/dir1/


6.4 Download files from HDFs

Downloads files from HDFs to the local file system.

Usage:

Hadoop Fs-get  

Example:

Hadoop fs-get/user/root/dir1/testfile.txt downloads/

Like the put command, the get command obtains or downloads files from the Hadoop file system to the local file system in the Downloads folder.


Note: For more information about file system commands and other important commands, see the Apache Hadoop FS Command sample article, or you can check the full documentation of the shell command in the documentation on the Apache Hadoop Web site: File system commands and HDFs command 7. Interacting with HDFs using MapReduce

As we discussed, HDFS is the basic component of Hadoop and MapReduce. The Hadoop MapReduce job obtains data from HDFs and stores the final result data in HDFs.

Hadoop also provides a Java API through which we can perform HDFS functions in Java applications. In this section, we'll see how to use the Java API in Java code.

package Com.javacodegeeks.examples.HDFSJavaApi;

Import java.io.IOException;
Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.FileSystem;
Import Org.apache.hadoop.fs.FSDataInputStream; Import Org.apache.hadoop 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.