About HDFs
Hadoop is plainly a file cluster that provides the processing of analytics big data, most importantly the HDFS (Hadoop Distributed File System), or Hadoop distributed filesystem.
1.
HDFs is a system that stores large files in a streaming data access mode (one-time write-once-read mode). It does not need a high-end hardware system, the general market hardware can meet the requirements.
Currently not suitable for the application of HDFs: low-latency data access, a large number of small files, many users write arbitrary modified files.
2.
HDFs storage is in blocks, usually with a block size of 64M. The reason to be divided into such a large block, mainly to reduce the addressing time, because at present, the data transmission rate is more and more fast, for HDFS processing big data, if the frequent addressing will inevitably make the running time longer.
The HDFs cluster has two node name nodes and multiple data nodes. Where the name node acts as a manager, the data node acts as a worker. The name node is equivalent to the branch fork point on the HDFs file tree, and the data node is labeled with the stored information for all blocks. So the loss of the name node means that HDFs is paralyzed. So Hadoop offers two mechanisms to solve this problem:
One is to replicate the persisted state files that make up the file system metadata. That is, a remote NFS mount is written to the local disk as well.
The other is to set a level two name node.
3.
HDFS provides the interaction of the command-line interface.
4.
Hadoop is an abstract filesystem concept, and HDFs is a concrete implementation, and the Java abstract class Org.apache.hadoop.fs.FileSystem presents a file system for Hadoop, with a few specific implementations.
As shown, Hadoop provides an interface for many files, usually through URLs to determine which file system to use for interaction.
5.
Hadoop is a Java implementation, so the Java interface is undoubtedly one of the most serious, and here are some of the specific implementations of the Java interface.
(1) Data read:
Reading data using a URL
java identifies the URL scheme for the Hadoop file system by invoking a Fsurlstreamhandlerfactory instance to invoke the Seturlstreamhandlerfactory method in the URL.
Note: This method can only be called once in a Java virtual machine, so it is usually set to static, so if the other part of the program (perhaps not the third-party part you control) sets a urlstreamhandlerfactory, It's been too long to read data from Hadoop.
Code:
Input run:
% Hadoop Urlcat hdfs://localhost/user/tom/test.txt
Results:
Hello World, Hello World
Hello World
Hello World, Hello World
Reading data using the FileSystem API
Just look at the code, watch the notes.
(2) Data write
The FileSystem class has a series of methods for creating files.
Public Fsdataoutputstream Create (PATHF) throws IOException
Creating a file with Create is available exists () to determine whether its parent directory exists.
There is also an overloaded method progressable for passing the callback interface, so that the application we write will be told the progress of the data being written to the data node.
Package org.apache.hadoop.util;
public interface progressable{
Publicvoid progress ();
}
You can also create a file using the following methods:
Public Fsdataoutputstream Append (PATHF) throws IOException
This method allows data to be appended at the end of an open file.
(3) Catalogue
FileSystem topic How to create a directory:
Public Boolean mkdirs (Path f) thorwsioexception
(4) Querying the file system
The Filestatus class encapsulates the metadata for files and directories in the file system, including file length, block size, copy, modification time, owner, and licensing information.
FileSystem's Getfilestatus () provides a way to get the state object of a file or directory.
If you are just judging whether a file exists, you can use the exists (Path f) method mentioned earlier.
Hadoop sometimes uses a wildcard when querying a batch file, so it provides a way to perform wildcard characters.
Hadoop supports the same wildcard character as the Unix bash two filesystem methods:
Public filestatus[] Globstatus (pathpathpattern) throws IOException
Public filestatus[] Globstatus (Path pathpattern,pathfileter filter) throws IOException
Wildcard characters:
(5) Delete data
The delete () method in filesystem can permanently delete the directory.
Public Boolean Delete (Path f,boolean recursive) throwsioexception
Hadoop authoritative Guide Learn note three