HDFS introduction statement: This article is my personal understanding and notes based on the Hadoop authoritative guide. It is only for your reference. If you have any questions, I hope to point out that you can learn and make progress together. To put it bluntly, Hadoop is a file cluster that provides big data processing and analysis. The most important one is HDFS (HadoopDistributedFileSystem), that is, Had.
HDFS introduction statement: This article is my personal understanding and notes based on the Hadoop authoritative guide. It is only for your reference. If you have any questions, I hope to point out that you can learn and make progress together. To put it bluntly, Hadoop is a File cluster that provides big data processing and analysis. The most important of them is HDFS (Hadoop Distributed File System), that is, Had.
HDFS Introduction
Disclaimer: This article is my personal understanding and notes based on the Hadoop authoritative guide. It is for your reference only. If you have any questions, I hope to point out that you can learn and make progress together.
To put it bluntly, Hadoop is a File cluster that provides big data processing and analysis. The most important one is HDFS (Hadoop Distributed File System), which is the Hadoop Distributed File System.
1,
HDFS is a system that stores ultra-large files in Stream Data Access Mode (one write mode for multiple reads. It does not need a high-end hardware system, and ordinary hardware on the market can meet the requirements.
Currently, HDFS is not suitable for applications such as low-latency data access, a large number of small files, and arbitrary file modifications written by multiple users.
2,
HDFS storage is block-based. Generally, the block size is 64 MB. The reason why HDFS is divided into such a large block is mainly to reduce the addressing time, because currently, the data transmission rate is getting faster and faster. When HDFS processes big data, frequent addressing will inevitably lead to a longer running time.
HDFS clusters have two node names and multiple data nodes. The Name node acts as the manager, and the data node acts as the worker. The Name node is equivalent to the branches and forks in the HDFS file tree, while the data node is labeled with the storage information of all blocks. Therefore, the loss of Name nodes means that HDFS is paralyzed. Therefore, Hadoop provides two mechanisms to solve this problem:
One is to copy a persistent state file that forms the metadata of the file system. That is, a remote NFS mount is also written to the local disk.
Set a second-level Name node.
3,
HDFS provides command line interface interaction.
4,
Hadoop is an abstract file system concept. HDFS is a specific implementation. The java Abstract class org. apache. hadoop. fs. fileSystem shows a file system of Hadoop, and there are several specific implementations.
As shown in, Hadoop provides many file interfaces. It usually uses URLs to determine which file system is used for interaction.
5,
Hadoop is implemented in java, so the java interface is undoubtedly the top priority. Below are some specific implementations of the java interface.
(1) Data Reading:
Use URL to read data
Java's URL scheme for identifying the Hadoop file system is to call the setURLStreamHandlerFactory method in the URL through an FsUrlStreamHandlerFactory instance.
Note: This method can only be called once in a Java virtual machine, so it is usually set to static. Therefore, if another part of the program (which may not be a third-party part under your control) is set to URLStreamHandlerFactory, data cannot be read from Hadoop for a long time.
Code:
Enter run:
% Hadoop URLCat hdfs: // localhost/user/tom/test.txt
Result:
Hello world
Hello world
Hello world Hell [this article from Internet (http://www.68idc.cn)] o world
Use FileSystem API to read data
Check the Code directly. Pay attention to the comments.
(2) Data Writing
The FileSystem class has a series of file creation methods.
Public FSDataOutputStream create (Pathf) throws IOException
Using create to create a file is available exists () to determine whether the parent directory exists.
There is also an overload method Progressable that is used to pass the callback interface, so that the application we write will be notified of the progress of Data writing to the data node.
Package org. apache. hadoop. util;
Public interface Progressable {
Publicvoid progress ();
}
You can also use the following methods to create a file:
Public FSDataOutputStream append (Pathf) throws IOException
This method allows you to append data to the end of an opened file.
(3) Directory
FileSystem describes how to create a directory:
Public Boolean mkdirs (Path f) thorwsIOException
(4) query the File System
The FileStatus class encapsulates metadata of files and directories in the file system, including the file length, block size, copy, modification time, owner, and license information.
GetFileStatus () of FileSystem provides a method to obtain the State object of a file or directory.
If you only determine whether a file exists, you can use the exists (Path f) method mentioned above.
Sometimes, wildcards are often used in Hadoop to query batch files. Therefore, it provides
Hadoop supports two FileSystem methods with the same wildcard as Unix bash:
Public FileStatus [] globStatus (PathpathPattern) throws IOException
Public FileStatus [] globStatus (Path pathPattern, PathFileter filter) throws IOException
Wildcard:
(5) delete data
The delete () method in FileSystem can permanently delete directories.
Public Boolean delete (Path f, Boolean recursive) throwsIOException