"Book pick" Big Data development deep HDFs

Source: Internet
Author: User
Keywords Cloud computing Big Data Hadoop authoritative guide
This paper is an excerpt from the book "The Authoritative Guide to Hadoop", published by Tsinghua University Press, which is the author of Tom White, the School of Data Science and engineering, East China Normal University. This book begins with the origins of Hadoop, and integrates theory and practice to introduce Hadoop as an ideal tool for high-performance processing of massive datasets. The book consists of 16 chapters, 3 appendices, covering topics including: Haddoop;mapreduce;hadoop Distributed File System, Hadoop I/O, MapReduce application development, MapReduce working mechanism, MapReduce type and format MapReduce characteristics, how to build Hadoop cluster, how to manage Hadoop;pig;hbase;hive;zookeeper, open Source Tool Sqoop, and finally provide rich case analysis. This book is an authoritative reference for Hadoop, where programmers can explore how to analyze massive datasets, from which an administrator knows how to install and run a Hadoop cluster.

The following covers all the elements of chapter III:

Design of 3.1 HDFs

The concept of 3.2 HDFs

3.3 Command-Line interface

3.4 Hadoop File System

3.5 Java Interface

3.6 Data Flow

3.7 Importing data through Flume and Sqoop

3.8 Parallel replication via DISTCP

3.9 Hadoop Archive

When the dataset is larger than the storage capacity of a separate physical computer, it is necessary to partition it (partition) and store it on several separate computers. The file systems that are stored across multiple computers in the management network are called Distributed File Systems (distributed filesystem). The system is built on the network, it is bound to introduce the complexity of network programming, so the Distributed file system is more complex than the ordinary disk file system. For example, making a file system tolerant of node failures without losing any data is a great challenge.

Hadoop has a distributed system called HDFS, the Hadoop distributed filesystem. In informal or old documents, as well as in configuration files, sometimes referred to as DFS, they are one thing. HDFs is the flagship file system of Hadoop and the focus of this chapter, but in fact Hadoop is a comprehensive file system abstraction, so we'll see how Hadoop integrates with other file systems (such as the local file system and the Amazon S3 System).

Design of 3.1 HDFs

HDFs stores oversized files in streaming data access mode and runs on commercial hardware clusters. Let's take a closer look at the description below.

Oversized file "Oversized file" refers to files that have hundreds of MB, hundreds of GB, or even hundreds of TB size. There is already a Hadoop cluster that stores petabytes of data. The construction idea of streaming data access HDFs is this: one write, multiple reads are the most efficient access mode. Datasets are typically generated by the data source or copied from the data source, and are then analyzed over a long period of time on this dataset. Each analysis will involve most of the data or even all of the dataset, so the time lag to read the entire dataset is more important than the time delay in reading the first record. Commercial hardware Hadoop does not need to run on expensive and highly reliable hardware. It is designed to run on a cluster of commercial hardware (common hardware that can be purchased at various retail outlets), so the probability of node failure is very high at least for a large cluster. HDFs encountered the above failure, was designed to be able to continue to run without allowing the user to detect the obvious interruption. Similarly, applications that are not suitable for running on HDFs are also worth studying. At present, some applications are not suitable for running on HDFS, but they may be improved in the future. Low-latency data access requires low latency data access applications, such as a dozens of-millisecond range, which is not suitable for running on HDFs. Remember, HDFs is optimized for high data throughput, which can be at the expense of increased time delays. Currently, for low latency access requirements, HBase (see chap. 12th) is a better choice. A large number of small files because Namenode stores the filesystem's metadata in memory, the total number of files that the file system can store is limited to the namenode memory capacity. Based on experience, storage information for each file, directory, and block of data accounts for approximately 150 bytes. So, for example, if you have 1 million files and each file occupies a block of data, you need at least MB of RAM. Although it is possible to store millions of files, storing billions of files is beyond the current hardware capabilities. Multiple user writes, any modification of file HDFs may have only one writer, and write operations always add data to the end of the file. It does not support operations with multiple writers, nor does it support any modification at any location in the file. These operations may be supported later, but they are relatively inefficient.

The concept of 3.2 HDFs

3.2.1 Data Block

Each disk has a default block size, which is the smallest unit of data read/write for the disk. A file system built on a single disk manages the blocks in the filesystem through a disk block, which can be an integer multiple of the disk block. The file system block is typically thousands of bytes and the disk block is typically 512 bytes. This information-the file system block size-is transparent to file system users who need to read/write files. However, the system still provides tools, such as DF and fsck, to maintain the file system, which operates on the blocks in the file system.

HDFs also has the concept of block, but much larger, the default is MB. Similar to file systems on a single disk, files on the HDFs are divided into chunks (chunk) of block size as separate storage units. Unlike other file systems, however, a file less than one block in the HDFs does not occupy the entire block of space. If there is no special point, the reference to "block" in this book refers specifically to the block in HDFs.

Why is the block so large in the HDFs?

The HDFS block is larger than the disk block to minimize addressing overhead. If the block is set large enough, the time it takes to transfer data from the disk is significantly greater than the time required to locate the block's starting position. Thus, the time to transfer a file consisting of multiple blocks depends on the disk transfer rate.

Let's do a calculating quickly, if the addressing time is about 10ms and the transfer rate is MB/s, to make the addressing time only 1% of the transmission time, we will set the block size to about MB. The default block size is actually MB, but in many cases HDFs uses 128 MB block settings. Later, as the new generation of disk drives increases, the size of the blocks is set larger.

But this number will not be set too large. The map task in MapReduce typically processes data in only one block at a time, so if the number of tasks is too small (less than the number of nodes in the cluster), the job will run slower.

Abstracting blocks in a distributed file system can have many benefits. The first and most obvious benefit is that a file can be larger than the size of any disk in the network. All the blocks of a file do not need to be stored on the same disk, so they can be stored using any disk on the cluster. In fact, although uncommon, for the entire HDFS cluster, you can store only one file that has a block full of all the disks in the cluster.

The second advantage is that using abstract blocks rather than entire files as storage units simplifies the design of the storage subsystem. Simplification is the goal of all systems, but it is especially important for distributed systems with a wide range of failures. Setting the storage subsystem control unit to block simplifies storage management (because the size of the block is fixed, it is relatively easy to calculate how many blocks a single disk can store). It also eliminates concerns about metadata (blocks are just part of storing data – and file metadata, such as permission information, does not need to be stored with the block so that other systems can manage the metadata separately).

Not only that, blocks are also good for data backup, which provides data tolerance and increases availability. Copy each block to a handful of separate machines (default 3) to ensure that data is not lost after a block, disk, or machine fails. If a block is found to be unavailable, the system reads another copy from elsewhere, and the process is transparent to the user. A block lost as a result of damage or machine failure can be replicated from other candidates to another machine that works correctly to ensure that the number of copies is returned to normal levels. See section 4.1 for a discussion of data integrity to learn more about how to handle data corruption. Similarly, some applications may choose to set a higher number of copies for some commonly used file blocks to disperse the read load in the cluster.

Like the disk file system, the fsck instruction in HDFs can display block information. For example, executing the following command lists which blocks of each file in the file system are composed of (see 10.1.4.2):

% Hadoop fsck/-files-blocks 3.2.2 Namenode and Datanode


and multiple Datanode (workers). Namenode the namespace that manages the file system. It maintains the file system tree and all files and directories throughout the tree. This information is persisted on a local disk in two files: namespace mirroring files and editing log files. Namenode also records the data node information for each block in each file, but it does not permanently hold the block's location information, because the information is rebuilt by the data node at system startup.

Clients (client) access the entire file system by interacting with Namenode and Datanode on behalf of the user. The client provides a file system interface similar to the POSIX (Portable Operating system interface), so users do not need to know Namenode and datanode to implement their functions while programming.

Datanode is the working node of the file system. They store and retrieve blocks of data (by client or namenode) as needed, and periodically send a list of the blocks they store to Namenode.

Without Namenode, the file system will not be available. In fact, if the machine running the Namenode service is destroyed, all the files on the filesystem will be lost because we don't know how to rebuild the file according to the Datanode block. Therefore, it is important to implement fault tolerance for Namenode, which Hadoop provides two mechanisms.

The first mechanism is to back up the files that make up the persistent state of the file system metadata. Hadoop can be configured to enable Namenode to hold the persisted state of the metadata on multiple filesystems. These writes are synchronized in real time and are atomic operations. The general configuration is to write a remote mounted network file system (NFS) while the persistent state is written to the local disk.

Another possible approach is to run an auxiliary namenode, but it cannot be used as a namenode. The important role of this auxiliary namenode is to periodically merge namespace mirrors by editing the log to prevent the edit log from being too large. This auxiliary namenode is typically run on another separate physical computer because it takes up a lot of CPU time and namenode the same amount of memory to perform the merge operation. It saves a copy of the merged namespace mirror and is enabled when the Namenode fails. However, the state of the secondary namenode save always lags behind the master node, so it is inevitable that some of the data will be lost when the primary node is completely invalidated. In this case, the Namenode metadata stored on NFS is typically replicated to the secondary Namenode and run as a new master Namenode.

For more information, see the discussion of file system mirroring and editing logs in section 10.1.1.

3.2.3 Federal HDFs

Namenode The reference relationship of each file in the file system and each block of data in memory, which means that for an oversized cluster with a large number of files, memory becomes a bottleneck limiting the system's horizontal scaling (see section 9.4.2). The federated HDFs introduced in the 2.x Release series allows the system to be extended by adding namenode, where each namenode a portion of the file System namespace. For example, a namenode might manage all the files in the/user directory, while another namenode might manage all the files in the/share directory.

In a federated environment, each Namenode maintains a namespace volume (namespace volume), including the source data for the namespace and the block pool of all the data blocks for the files under that namespace. Namespace volumes are independent of each other, and 22 do not communicate with each other, and even the expiration of one of the Namenode does not affect the availability of namespaces maintained by other Namenode. The data block pool is no longer split, so the Datanode in the cluster need to be registered to each Namenode and store blocks of data from multiple data block pools.

To access the Federated HDFs cluster, the client needs to map the file path to Namenode using the client-mounted data table. This feature can be configured and managed through Viewfilesystem and Viewfs://uri.

High availability of 3.2.4 HDFs

The federated use of Namenode metadata in multiple file systems and the creation of monitoring points through standby Namenode can prevent data loss, but it is still not possible to achieve high availability of file systems. Namenode still has the problem of single point failure (SPOF). If Namenode fails, then all clients-including mapreduce jobs-cannot read, write, or column (list) files because Namenode is the only place where metadata and files are mapped to blocks of data. In this case, the Hadoop system will not be able to provide services until a new namenode comes online.

In such cases, to recover from a failed namenode, the system administrator has to start a new namenode with a copy of the file system metadata and configure the Datanode and the client to use the new Namenode. The new Namenode can respond to the service: 1 The image of the namespace is imported into memory, 2 redo the edit log, and 3 receives enough data block reports from the Datanode and exits safe mode. For a large cluster with a large number of files and data blocks, the cold start of the namenode takes 30 minutes, or even longer.

System recovery time is too long, also affect day-to-day maintenance. In fact, the probability of namenode failure is very low, so it is very important to plan the system failure time in practical application.

The 2.x release series for Hadoop adds support for high availability (HA) in HDFs for the above issues. In this implementation, a pair of active-standby (Active-standby) Namenode is configured. When an active Namenode expires, the standby Namenode takes over its task and starts servicing requests from the client without any significant disruption. The implementation of this goal requires the following modifications to the architecture.

The sharing of edit logs needs to be implemented through highly available shared storage between

Namenode. (in earlier highly available implementations, an NFS filter was needed to assist the implementation, but more choices will be provided in later versions, such as systems built on Zookeeper bookkeeper.) When the standby namenode takes over, it reads through the shared edit log up to the end to synchronize the status with the active Namenode and continues to read the new entries written by the active Namenode. Datanode needs to send data block processing reports to two namenode at the same time, because the mapping information for the blocks is stored in Namenode memory, not disk. The client needs to use a specific mechanism to handle the Namenode failure problem, which is transparent to the user.

After the active Namenode is invalidated, the standby Namenode can be implemented quickly (a few 10 seconds) for the task to take over, because the latest state is stored in memory: includes the latest edit log entries and the latest block mapping information. The actual observed failure time is slightly longer (takes about 1 minutes) because the system needs to be conservative in determining whether the activity Namenode really fails.

In cases where the active Namenode fails and the standby Namenode also fails, of course the probability of such occurrence is very low, and the administrator can still declare a standby namenode and implement a cold start. This kind of situation is not worse than the non-high availability (no-ha) situation, and it is an improvement from an operational point of view, as the above processing is already a standard process and implanted in Hadoop.

Failover and avoidance

A new entity in a system called a Failover controller (Failover_controller) manages the conversion process of transferring an active namenode to an alternate namenode. The failover controller is pluggable, but its initial implementation is based on zookeeper and ensures that there is only one active namenode. Each namenode runs a lightweight failover controller that is designed to monitor the failure of the host Namenode (through a simple heartbeat mechanism) and failover when Namenode fails.

Administrators can also initiate failover manually, for example, during routine maintenance. This is known as "smooth failover" because the failover controller can organize two namenode sequential switching roles.

However, in the case of non-stationary failover, it is not possible to know exactly whether the failed Namenode has stopped running. For example, a failover may also be triggered when the Internet is very slow or the network is fragmented, but the previous activity Namenode is still running and is still active namenode. High-availability implementations are further optimized to ensure that previously active namenode do not perform operations that compromise the system and cause the system to crash-a method known as "circumvention" (fencing). The system introduces a range of circumvention mechanisms, including killing the Namenode process, and recovering access to the shared storage directory (typically using the provider-specified NFS command) to mask the appropriate network port through remote administration commands. The last resort means that the previous activity Namenode can be circumvented by a fairly visible technique called the Stonith (shootthe other node), which is mainly powered by a specific power supply unit.

Client failover is transparently handled through the client class library. The simplest implementation is to implement failover control through the client's configuration file. The HDFS URI uses a logical hostname that maps to a pair of Namenode addresses (set in the configuration file), and the client class library accesses each namenode address until processing is complete.

3.3 Command-Line interface

Now we have a command line interaction to further understand HDFs. HDFs has many other interfaces, but the command line is the simplest and most familiar to many developers.

Referring to the description of setting up Hadoop in the pseudo distribution mode in Appendix A, we first run HDFs on a single machine. Later, I'll explain how to run HDFs on a cluster to provide scalability and fault tolerance.

When we set up a pseudo distribution configuration, there are two property items that need to be explained further. The first item is Fs.default.name, set to hdfs://localhost/, to set the default file system for Hadoop. The file system is specified by the URI, where we have used the HDFs URI to configure the default file system for Hadoop HDFs. The HDFs daemon uses this property entry to determine the host and port of the HDFs Namenode. We will run Namenode on the localhost default port 8020. This allows the HDFS client to know where Namenode runs and connects to it.

The second property, Dfs.replication, is set to 1 so that HDFS does not set the file system block copy to 3 by default. When run on a single datanode, HDFs cannot replicate the block to 3 Datanode, so it continues to give a warning that the block copy is insufficient. After you set this property, there will be no more problems.

Basic operation of File system

Now that the file system is available, we can perform all common file system operations, such as reading files, creating new directories, moving files, deleting data, listing directories, and so on. You can enter the Hadoop fs-help command to obtain a detailed Help file for each command.

First copy a file from the local file system to HDFs:

% Hadoop fs-copyfromlocal input/docs/quangle.txt hdfs://localhost/user/tom/quangle.txt This command invokes the shell command FS of the Hadoop file system, which provides a series of subcommand, in this case we execute-copyfromlocal. The local file Quangle.txt is copied to the HDFs instance running on localhost, and the path is/user/tom/quangle.txt. In fact, we can simplify the command format to omit the URI of the host and use the default setting, omitting Hdfs://localhost because the item is already specified in Core-site.xml.

% Hadoop fs-copyfromlocal Input/docs/quangle.txt/user/tom/quangle.txt We can also use relative paths and copy files to HDFs home directory, in this case/ User/tom:


% Hadoop fs-copyfromlocal input/docs/quangle.txt quangle.txt We copy the files back to the local file system and check for consistency:


% Hadoop fs-copytolocal quangle.txt quangle.copy.txt% MD5 input/docs/quangle.txt quangle.copy.txt MD5 (input/docs/q Uangle.txt = a16f231da6b05e2ba7a339320e7dacd9 MD5 (quangle.copy.txt) = a16f231da6b05e2ba7a339320e7dacd9 MD5 key values are the same Indicates that the document survived and was preserved intact on the HDFs trip. Finally, let's take a look at the HDFs file list. Let's create a new catalog to see how it appears in the list:


% Hadoop fs-mkdir Books% Hadoop fs-ls. Found 2 Items drwxr-xr-x-tom SuperGroup 0 2009-04-02 22:41/user/tom/books-rw-r--r--1 Tom SuperGroup 118 2009-04-02 22 : The result information returned by 29/user/tom/quangle.txt is very similar to the output of the Unix command ls-l, with only minor differences. The 1th column shows the file mode. The 2nd column is the number of backups for this file (which is not in the traditional UNIX file system). Since we set the default number of copies to 1 for the entire filesystem, this is also shown as 1. The opening directory of this column is empty because the concept of a replica is not used in this example-the directory is stored as metadata in Namenode, not datanode. Columns 3rd and 4th show which users and groups belong to the file. The 5th column is the size of the file, in bytes, and the directory is 0. Columns 6th and 7th are the last modified date and time of the file. Finally, the 8th column is the absolute path to the file or directory.


File access rights in

HDFs

For files and directories, the HDFS permission pattern is very similar to POSIX.

Provides a total of three types of permission modes: Read-only (R), write-Access (w), and executable (x) permissions. Read-only permissions are required when reading files or listing directory content. Write permission is required to write a file or to create a new and deleted file or directory on a directory. Executable permissions can be ignored for files, because you cannot execute files (unlike POSIX) in HDFs, but you need that permission when accessing subkeys of a directory.

Each file and directory has the owning user (owner), group, and mode. This mode consists of the permissions of the user, the permissions of the members within the group, and the permissions of other users.

By default, the identity of the client can be uniquely determined by the user name and group name of the running process. However, because the client is remote, any user can simply create a new account on the remote system to access it on its behalf. Therefore, as a mechanism for sharing file system resources and preventing accidental loss of data, permissions can only be used by users in the cooperative community and not in an unfriendly environment. Note that the latest version of Hadoop already supports Kerberos user authentication, which goes beyond these limits, as detailed in the "Security" section on page No. 325. However, in addition to the above limitations, enabling permission control is important to prevent users or automated tools and programs from accidentally modifying or deleting important parts of the file system (this is also the default configuration, see the Dfs.permissions property).

If permission checking is enabled, the owning user right is checked to verify that the client's user name matches the owning user, and that the owning group permission is also checked to confirm that the client is a member of the user group, and if not, check for additional permissions.

Here is the concept of a Super User (super-user), which is the identity of the namenode process. For Superuser, the system does not perform any permission checks.

3.4 Hadoop File System

Hadoop has an abstract file system concept, HDFs is just one of the implementations. The Java abstract class Org.apache.hadoop.fs.FileSystem defines a file system interface in Hadoop, and the abstract class has several concrete implementations, as shown in table 3-1.

Table 3-1. Hadoop file System

The file system URI scheme Java implementation (both contained in the Org.apache.hadoop package) describes the local file FS. LocalFileSystem uses a local disk file system with client checksums. There is no local Disk file system Rawlocalfilesystem using checksums. For more information, see section 4.1.2 HDFS HDFS HDFS. Distributedfilesystem Hadoop's Distributed File system. The HDFS is designed to be used in conjunction with MapReduce to achieve high-performance hftp hftp hdfs.hftpfilesystem A file system that provides read-only access to HDFs on HTTP (although the name is hftp but not ftp-independent). Typically used in conjunction with DISTCP (see section 3.8) To enable replication of data between clusters running different versions of HDFs hsftp hsftp HDFs. Hsftpfilesyste provides HDFS read-only access to the file system on HTTPS (IBID., FTP-independent) Webhdfs Webhdfs Hdfs.web.WebHdfsFileSystem is based on HTTP, A file system that provides secure read-write access to HDFs. Webhdfs is a HAR HAR FS built to replace HFTP and hsftp. Harfilesystem

A file system built on top of other file systems for file archiving. The Hadoop archive file system is typically used to reduce the use of namenode memory when it is necessary to archive files in HDFs. See section 3.9

HFS (cloud storage) KFS Fs.kfs.kosmosFileSystem cloudstore (formerly Kosmos file system) is a file system similar to that of HDFs or Google's GFs, written in C + +. For more information, see Http://kosmosfs.sourceforge.net/FTP FTP Fs.ftp.FTPFileSystem File system S3 (native) supported by the FTP server s3n Fs.s3native.NativeS3FileSystem a file system supported by Amazon S3. See Http://wiki.apache.org/hadoop/AmazonS3 S3 (block based) S3 Fs.sa.S3FileSystem file system supported by Amazon S3, storing files in block format (similar to HDFS) to resolve S3 5 GB file size limits distributed RAID HDFs HDFs. The Distributedraidfilesystem raid version of the HDFS is designed for archiving purposes. For each file in HDFs, create a (smaller) checksum file and allow the copy of the data in the HDFs to be reduced from 3 to 2, thereby reducing the 25%~30% storage space, but the probability of data loss remains unchanged. Distributed RAID mode needs to run a raidnode background process View VIEWFS VIEWFS in the cluster. Viewfilesystem client tables that are mounted against other Hadoop file systems. Typically used for federated Namenode to create mount points. See section 3.2.3 for details. Hadoop provides a number of interfaces to the file system, typically using a URI scheme to select the appropriate file system instance for interaction. For example, the file system command line interpreter we encountered in the previous section can manipulate all Hadoop file system commands. To list the files in the local file system root, enter the following command:


% Hadoop fs-ls file:///Although running MapReduce programs can access any file system (sometimes convenient), it is recommended that you select a distributed file system with data-local optimization, such as HDFS (see section 2.4), when working with large datasets.

Interface

Hadoop is written in Java and can invoke the interaction of all Hadoop file systems through the Java API. For example, the file system's command interpreter is a Java application that uses the Java FileSystem class to provide file system operations. Other file system interfaces will also be briefly described in this section. These interfaces are commonly used with HDFS, because other file systems in Hadoop generally have the tools to access the basic file system (FTP, FTP clients, S3, S3 tools, etc.), but most of them can be used for any Hadoop file system.

1. HTTP

There are two ways to access HDFs via http: direct access, HDFS background processes directly to requests from clients, and through proxies (one to multiple), clients typically use the Distributedfilesystem API to access HDFs. Both of these methods are shown in Figure 3-1.

Figure 3-1. Access HDFs directly via HTTP or through multiple HDFS proxies HDFs

In the first case, a Web server embedded by Namenode (running on port 50070) provides directory services, directory listings are stored in XML or JSON format, and file data is transmitted as data streams by Datanode Web servers (running on port 50075).

The original HTTP interface (HFTP and HSFTP) is read-only, but the new WEBHDFS implementation supports all file system operations, including Kerberos authentication. Webhdfs must be enabled by setting the Dfs.webhdfs.enalbe option to true, and you can use the Webhdfs URI only after you enable it.

The second method relies on one or more stand-alone proxy servers to access HDFs through HTTP. (Because the proxy service is stateless, it can be run after the standard load balancer.) All network traffic to the cluster needs to be proxied. With a proxy server, you can use stricter firewall policies and bandwidth throttling policies. Typically, a proxy server enables data transfer between Hadoop clusters deployed in different data centers.

The original HDFs proxy server (in Src/contrib/hdfsproxy) is read-only and is accessed by the client using the HSFTP filesystem implementation (Hsftp URI). Starting with version 1.0.0, a new proxy server called HTTPFS (with read and write capabilities) is implemented, and the same HTTP interface is provided with WEBHDFS, so clients can access both types of interfaces through Webhdfsuri.

The HTTP REST API used in the WEBHDFS is formally defined in the specification and is expected to be used directly by clients who later write in a non-Java language.

2. C language

Hadoop provides a C language library called Libhdfs, a mirror of the Java,filesystem interface class (which is written to access the C language library of HDFs, but it can actually access all Hadoop file systems). It uses the Java Native Interface (Java Native interface,jni) to invoke the Java file System client.

The C language API is very similar to the Java API, but its development generally lags behind the Java API, so some new features may not be supported at the moment. CAPI documentation can be found in the Libhdfs/docs/api directory of the Hapdoop release package.

Hadoop is LIBHDFS binary encoded with precompiled 32-bit Linux, but for other platforms, it needs to be compiled in Http://wiki.apache.org/hadoop/LibHDFS's tutorial.

3. FUSE

The User space file system (filesystem in Userspace,fuse) allows the integration of file systems implemented in user space into a UNIX file system. By using the Fuse-dfs function module of Hadoop, any Hadoop file system (but typically HDFS) can be mounted as a standard file system. You can then interact with the file system using UNIX tools such as LS and cat, and you can call the POSIX library to access the file system in either programming language.

Fuse-dfs is implemented in C, which calls Libhdfs and serves as an interface to access HDFs. Documentation on how to compile and run Fuse-dfs can be found in the Src/contrib./fuse-dfs directory of the Hadoop release.

3.5 Java Interface

In this section, we delve into the filesystem class of Hadoop: An API that interacts with a file system in Hadoop. While we focus primarily on the HDFs instance, the Distributedfilesystem, we should, in general, integrate filesystem abstract classes and write code to make them portable in different file systems. This is important for testing the programs you write, for example, you can quickly test using stored data from a local filesystem.

3.5.1 reads data from the Hadoop URL

The easiest way to read a file from the Hadoop file system is to use the Java.net.URL object to open the data stream and read the data from it. The specific format is as follows:

InputStream in = null; try {in = new URL (' Hdfs://host/path '). OpenStream ();//process in} finally {Ioutils.closestream (in);} Enable Java programs to recognize Hadoo P's HDFs URL scheme also requires some extra work. The method used here is to invoke the Seturlstreamhandlerfactory method of the Java.net.URL object through the Fsurlstreamhandlerfactory instance. Each Java virtual machine can only call this method once, so it is usually invoked in a static method. This limitation means that if the other components of the program-such as a third-party component not under your control-have declared a urlstreamhandlerfactory instance, you will not be able to use this method to read data from Hadoop. Another alternative approach is discussed in the next section.

Example 3-1 shows a program that displays files in the Hadoop file system in standard output, similar to the cat commands in UNIX.

Example 3-1. Display files for the Hadoop file system in standard output by URLStreamHandler instance

public class Urlcat {static {url.seturlstreamhandlerfactory (New fsurlstreamhandlerfactory ()); Main (string[] args) throws Exception {inputstream in = null; try {in = new URL (Args[0]). OpenStream (); Ioutils.copybytes (in, System.out, 4096, false); Finally {Ioutils.closestream (in);}} We can call the concise Ioutils class in Hadoop and close the data stream in the finally clause, and we can also copy the data between the input stream and the output stream (System.out in this case). The last two parameters of the Copybytes method, the first setting is used to copy the buffer size, and the second setting closes the data stream after the replication ends. Here we choose to close the input stream ourselves so that System.out does not have to close the input stream.


Here is a running example:

% Hadoop urlcat hdfs://localhost/user/tom/quangle.txt on the top of the Crumpetty tree the Quangle wangle Sat, But hi s face for you could not the, on account of the his Beaver Hat.

3.5.2 reads data through the FileSystem API

As explained in the previous section, it is sometimes impossible to set the URLStreamHandlerFactory instance in an application. In this case, you need to use the FileSystem API to open the input stream for a file.

The Hadoop file system represents a file through the Hadoop path object, rather than the Java.io.File object, because its semantics are too tightly tied to the local file system. You can treat a path as a Hadoop file system URI, such as Hdfs://localhost/user/tom/quangle.txt.

FileSystem is a common file system API, so the first step is to retrieve the file system instance that we need to use, this is HDFs. Getting the FileSystem instance has the following static factory methods:

public static filesystem get (Configuration conf) throws IOException public static filesystem get (URI Uri, Configurati On Conf. throws IOException public static filesystem get (URI Uri, Configuration conf, String user) throws Ioexceptionconfi The Guration object encapsulates the configuration of the client or server and is implemented by setting the configuration file to read the classpath (such as Conf/core-site.xml). The first method returns the default file system (specified in Conf/core-site.xml and, if not specified, uses the default local file system). The second method determines the file system to use with the given URI scheme and permissions, and returns the default file system if no scheme is specified in the given URI. Third, access to the file system as a given user is critical to security. (see section 9.6).

In some cases, you may want to get a running instance of the local filesystem, at which point the getlocal () method you can use is easily accessible.

public static LocalFileSystem getlocal (Configuration conf) throws IOException with filesystem instance, we call open () function to get the input stream of a file:


Public Fsdatainputstream open (path f) throws IOException public abstract Fsdatainputstream open (path f, int buffersize) th Rows IOException

The first method uses the default buffer size of 4 KB.

Finally, we rewrite example 3-1 to get example 3-2.

Example 3-2. Direct use of filesystem to display files in the Hadoop file system in standard output format

public class Filesystemcat {public static void main (string[] args) throws Exception {String uri = args[0]; Configuration conf = new Configuration (); FileSystem fs = Filesystem.get (Uri.create (URI), conf); InputStream in = null; try {in = Fs.open (new Path (URI)); Ioutils.copybytes (in, System.out, 4096, false); Finally {Ioutils.closestream (in);}} }

The results of the program operation are as follows:

% Hadoop filesystemcat hdfs://localhost/user/tom/quangle.txt on the top of the Crumpetty tree the Quangle Sat, But his face is not a could.

Fsdatainputstream objects

In fact, the open () method in the FileSystem object returns the Fsdatainputstream object, not the standard Java.io class object. This class is a special class that inherits the Java.io.DataInputStream interface and supports random access, from which data can be read from anywhere in the stream.

package org.apache.hadoop.fs; public class Fsdatainputstream extends DataInputStream implements Seekable, positionedreadable {//implementation elided The Seekable interface supports locating the specified location in the file and provides a query method that queries the current position relative to the file's starting position offset (GetPos ()):


public interface seekable {void Seek (long pos) throws IOException; long GetPos () throws IOException; Boolean Seekto Newsource (Long Targetpos) throws IOException; }

Calling Seek () to locate a location larger than the file length throws a IOException exception. Unlike the Java.io.InputStream skip (), Seek () can be moved to any absolute position in the file, and Skip () will only be positioned to another new location relative to the current position.

Example 3-3 is a simple extension of example 3-2, which writes a file to the standard output two times: once it has been written, it is again streamed to the starting position of the file.

Example 3-3. Displays a file in the Hadoop file system two times on standard output using the Seek () method

public class Filesystemdoublecat {public static void main (string[] args) throws Exception {String uri = args[0]; Configuration conf = new Configuration (); FileSystem fs = Filesystem.get (Uri.create (URI), conf); Fsdatainputstream in = null; try {in = Fs.open (new Path (URI)); Ioutils.copybytes (in, System.out, 4096, false); In.seek (0); Go back to the start of the ' File Ioutils.copybytes (in, System.out, 4096, false); Finally {Ioutils.closestream (in);}} The results of running on a small file are as follows:


% Hadoop filesystemdoublecat hdfs://localhost/user/tom/quangle.txt on the top of the Crumpetty tree the Quangle E Sat, But his face and could not. On the top of the "Crumpetty" tree The Quangle Wangle sat, But he face for you could not. The Fsdatainputstream class also implements the Positionedreadable interface, which reads part of a file from a specified offset:


Public interface Positionedreadable {public int read (long position, byte[] buffer, int offset, int length) throws Ioexcep tion; public void readfully (long position, byte[] buffer, int offset, int length) throws IOException; public void readfully (long position, byte[] buffer) throws IOException; The Read () method reads up to length byte of data from the specified position of the file and is deposited at the specified offset of buffer buffers. The return value is the actual number of bytes read: The caller needs to check this value, which may be less than the specified length. The readfully () method reads the bytes of the specified length length into the buffer (or reads the buffer.length length byte data in a version that accepts only a buffer byte array) unless it has been read to the end of the file. In this case, a Eofexception exception is thrown.


All of these methods preserve the current offset of the file and are thread-safe (Fsdatainputstrean is not designed for concurrent access, so it is better to create more than one instance), so they provide a convenient way to access other parts of the file when reading the body of the file--possibly metadata--. In fact, this is just a seekable interface implemented in the following pattern.

Finally, it is important to remember that the Seek () method is a relatively expensive operation that requires careful use. It is recommended that you use streaming data to build an application's access pattern, such as using mapreduce, rather than executing a large number of Seek () methods.

3.5.3 Write Data

The FileSystem class has a series of methods for creating new files. The easiest way to do this is to specify a path object for the prepared file, and then return an output stream for writing the data:

Public Fsdataoutputstream Create (Path f) Throws IOException This method has multiple overloaded versions, allowing us to specify whether we need to force overwriting of existing files, the number of file backups, the buffer size used to write files, File Block size and file permissions.


The Create () method can create a parent directory for a file that needs to be written and does not currently exist. Although this is convenient, it is sometimes not desirable. If you want the parent directory to not exist to cause file writes to fail, you should first call the exists () method to check whether the parent directory exists.

There is also an overloaded method progressable used to pass the callback interface, so that the data can be written to the Datanode progress notification to the application:

package org.apache.hadoop.util; Public interface Progressable {public void Progress ();} Another way to create a new file is to append data (and some other overloaded versions) to the end of an existing file using the Append () method:


public Fsdataoutputstream append (Path f) throws IOException This append operation allows one writer to append data at the last offset to access the file after it is opened. With this API, some applications can create borderless files, for example, the application can continue to append logs after the log file is closed. The append operation is optional and not all Hadoop file systems implement the operation. For example, HDFS support for appending implementations [in the Hadoop 1.x version is a problem of reliability, so it is generally recommended that you use append in later versions of 1.x, because these versions contain a new, overridden implementation. , but the S3 file system is not supported.


Example 3-4 shows how to copy a local file to the Hadoop file system. Each time Hadoop calls the progress () method--that is, after each of the kilobytes of data packets is written to the Datanode pipeline--prints a point that shows the entire run. Note that this operation is not implemented through the API, so subsequent versions of Hadoop can perform this operation, depending on whether the version has been modified. The API just lets you know what's going on.

Example 3-4. Copy local files to the Hadoop file system

public class Filecopywithprogress {public static void main (string[] args) throws Exception {String localsrc = args [0]; String DST = args[1]; InputStream in = new Bufferedinputstream (new FileInputStream (LOCALSRC)); Configuration conf = new Configuration (); FileSystem fs = Filesystem.get (Uri.create (DST), conf); OutputStream out = fs.create (new Path (DST), new progressable () {public void progress () {System.out.print (".");}}); Ioutils.copybytes (in, out, 4096, true); Typical applications are as follows:


% Hadoop filecopywithprogress input/docs/1400-8.txt hdfs://localhost/user/tom/1400-8.txt ........ Currently, other Hadoop file systems do not invoke the progress () method when writing to files. The following chapters will show the importance of progress to mapreduce applications.


Fsdataoutputstream objects

The filesystem instance's create () method returns the Fsdataoutputstream object, similar to the Fsdatainputstream class, and also has a method for querying the current location of the file:

package org.apache.hadoop.fs; public class Fsdataoutputstream extends DataOutputStream implements syncable {public long GetPos () throws IOException {/ /implementation Elided}//Implementation elided} But unlike the Fsdatainputstream class, the Fsdataoutputstream class is not allowed to locate in a file. This is because HDFs only allows you to write to an open file sequentially or append data at the end of an existing file. In other words, it does not support writing at a location other than the end of the file, so it doesn't make sense to position at write-ins.


3.5.4 Directory

The filesystem instance provides a way to create a directory:

public boolean mkdirs (Path f) throws IOException This method can create a one-time new parent directory that is necessary but not yet, like the Java.io.File () method of the Mkdirs class. Returns True if the directory (and all parent directories) have been successfully created.


Typically, you do not need to explicitly create a directory because the parent directory is automatically created when the Create () method is written to the file.

3.5.5 Query File system

1. File metadata: Filestatus

An important feature of any file system is the ability to provide its directory structure to browse and retrieve information about the files and directories it saves. The Filestatus class encapsulates the metadata for files and directories in the file system, including file lengths, block sizes, replicas, modification times, owner, and permission information.

The filesystem Getfilestatus () method is used to get the Filestatus object of a file or directory. Example 3-5 shows its usage.

Example 3-5. Show file status information

public class Showfilestatustest {private minidfscluster cluster)//Use a in-process HDFS cluster for testing private Fi Lesystem FS; @Before public void SetUp () throws IOException {Configuration conf = new Configuration (); if (System.getproperty Ild.data ") = = null) {System.setproperty (" Test.build.data ","/tmp ");} cluster = new Minidfscluster (conf, 1, true, NULL); FS = Cluster.getfilesystem (); OutputStream out = fs.create (new Path ("/dir/file")); Out.write ("Content" GetBytes ("UTF-8")); Out.close (); @After public void teardown () throws IOException {if (fs!= null) {Fs.close ();} if (cluster!= null) {Cluster.shutdo WN (); @Test (expected = filenotfoundexception.class) public void Throwsfilenotfoundfornonexistentfile () throws IOException {Fs.getfilestatus (New Path ("No-such-file"));} @Test public void Filestatusforfile () throws IOException {Path file = new Path ("/dir/file"); Filestatus stat = fs.getfilestatus (file); Assertthat (Stat.getpath (). Touri (). GetPath (), is ("/dir/file ")); Assertthat (Stat.isdir (), is (false); Assertthat (Stat.getlen (), is (7L)); Assertthat (Stat.getmodificationtime (), Is (Lessthanorequalto (System.currenttimemillis ())); Assertthat (Stat.getreplication (), ((short) 1); Assertthat (Stat.getblocksize (), is (1024 * 1024L)); Assertthat (Stat.getowner (), is ("Tom")); Assertthat (Stat.getgroup (), is ("supergroup")); Assertthat (Stat.getpermission (). ToString (), is ("rw-r--r--")); @Test public void Filestatusfordirectory () throws IOException {path dir = new Path ("/dir"); Filestatus stat = fs.getfilestatus (dir); Assertthat (Stat.getpath (). Touri (). GetPath (), is ("/dir")); Assertthat (Stat.isdir (), is (true); Assertthat (Stat.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.