What is a distributed file system
The increasing volume of data, which is beyond the jurisdiction of an operating system, needs to be allocated to more operating system-managed disks, so a file system is needed to manage files on multiple machines, which is the Distributed file system. Distributed File system is a file system that allows files to be shared across multiple hosts over a network, allowing users on multiple machines to share files and storage space.
HDFs concept
HDFs is the short name of the Hadoop distribute file system, which is a distributed file system for Hadoop. One of the HDFs design concepts is that it can run on normal hardware, and even if the hardware fails, the fault tolerance strategy can be used to ensure that the data is available.
HDFs Architecture Namenode and Datanode
Namenode is the management node for the entire file system. It maintains a file directory tree for the entire file system, meta-information for the file/directory, and a list of data blocks for each file. The same is the request of the receiving user.
Documents include:
Fsimage: Metadata image file. Stores Namenode memory metadata information for a certain period of time.
Edits: Operation log file.
Fstime: The time that the last checkpoint was saved.
These files are all stored in the Linux file system.
Datanode provides storage services for real data.
Block: The most basic unit of storage, the default size of HDFs in HADOOP2 is 128M, unlike the normal filesystem, in HDFs, if a file is smaller than the size of a block of data, it does not occupy the entire block of storage space.
The HDFs frame is composed as follows:
Ddd
First, the client and Namenode to communicate, get some metadata information, and then Namenode query the corresponding metadata information back to the client (note: The metadata information is saved in memory and disk in one copy, that is, safe and fast);
The client then begins to read the data, noting that it is read sequentially and not simultaneously read (using the data proximity principle);
At the same time Namanode will give dataname some information, Datanode will be the level of data replication.
Meta Data Storage Details
The storage format for the metadata is as follows:
1
|
NameNode (Filename,replicas,block-ids,id2host ...)
|
Example:
1
|
/TEST/A.LOG,3,{BLK_1,BLK_2},[{BLK_1:[H0,H1,H3]},{BLK_2:[H0,H2,H4]}]
|
Description
A.log stored 3 copies, the file was cut into three pieces, respectively: Blk_1,blk_2, the first piece is stored in the H0,H1,H3 three machines, the second block is stored on the H0,H2,H4.
HDFS Shell Common Commands
The call file system (FS) shell command should use the form Bin/hadoop FS.
All FS shell commands use the URI path as the parameter.
The URI format is Scheme://authority/path. The scheme for HDFs is HDFs, the scheme is file for the local filesystem. The scheme and authority parameters are optional, and if not specified, the default scheme specified in the configuration is used.
1
|
Hadoop fs-cat Hdfs://host1:port1/file1 Hdfs://host2:port2/file2
|
Outputs the contents of the path-specified file to the specified file;
1
|
Hadoop fs-cp/user/hadoop/file1/user/hadoop/file2
|
Copies the file from the source path to the destination path. This command allows for multiple source paths, at which point the target path must be a directory;
1
|
Hadoop fs-get/user/hadoop/file LocalFile
|
Copy files to local file system;
1
|
Hadoop fs-put Localfile/user/hadoop/hadoopfile
|
Copy single or multiple source paths from the local file system to the target file system.
Using the Java interface to manipulate HDFs
After you create a new Java project, add the appropriate jar, and the HDFs-related jar files are in/share/hadoop/common and/share/hadoop/common/lib and/share/hadoop/hdfs, If you use MAVEN, you do not need to import each.
Since Hadoop is written by Java, all Hadoop file system interactions can be invoked through the Java API. The FileSystem class to provide file system operations.
Use filesystem to download the files in HDFs using the standard output code as follows:
1 2 3 4 |
FileSystem fs = Filesystem.get (NewURI (new Configuration ()); Specify Namenode InputStream in = Fs.open (new Path ("/hellohdfs.txt")); Files on HDFs New FileOutputStream (new File ("/root/myfile")); Print to myfile under root true); //Copy the contents of in into the out
|
Read the local file into HDFs with the following code:
Fsdataoutputstream out = fs.create (new Path ("/words.txt"));
1 2 3 4 5 6 7 |
New FileInputStream (new File ("C:/w.txt")); Read the files on the local system Fsdataoutputstream out = fs.create (new Path ("/words.txt")); true); ``` If the upload is unsuccessful and the display does not have permission, you can use "' Java FS = Filesystem.get (NewURI (New Configuration (),"root");
|
Pretending that the current user is a root user can resolve an issue with no permissions.
Delete one of the files in HDFs:
1
|
Boolean flag = Fs.delete (new Path (true);
|
Create a new file in HDFs:
1
|
Boolean flag = Fs.mkdirs (new Path ("/root/hellohdfs"));
|
rpc-Remote Procedure Call protocol
RPC refers to remote Procedure call, which is a protocol that requests services from a remote computer program over a network without needing to know the underlying network technology.
RPC takes client/server mode. The requestor is a client and the service provider is a server. First, the client call process sends a call message with process parameters to the service process, and then waits for the reply message. On the server side. The process remains dormant until the call information arrives. When a call arrives, the server obtains the process parameters, evaluates the results, sends a reply message, and then waits for the next call message. Finally, the client calls the process to accept the reply message, obtains the process result, and then calls to continue execution.
The result of Hadoop's entire system is built on top of RPC.
Hdfs-hadoop Distributed File System