Concept
HDFS
HDFS (Hadoop distributed FileSystem) is a file system designed specifically for large-scale distributed data processing in a framework such as MapReduce. A large data set (100TB) can be stored in HDFs as a single file, and most other file systems are powerless to achieve this. Data blocks (block)
The default most basic storage unit for HDFS (Hadoop distributed FileSystem) is a block of 64M data.
As with normal files, the data in the HDFs file system is stored in chunks of data that are partitioned into 64M blocks.
Unlike the normal file system, HDFs, if a file is smaller than the size of a block of data, does not occupy the entire block of storage space. Meta Data Node (NameNode), from the Metadata node (secondary NameNode) and Data node (DataNode) metadata Node (NameNode) is used to manage the namespace of the file system
It stores metadata for all files and folders in a file system tree.
This information is also saved on the hard disk as a file: a namespace image (Namespaceimage) and a modification log (editlog).
It also preserves which data blocks are included in a file and which data nodes are distributed. However, this information is not stored on the hard disk, but is collected from the data node when the system is started.
from the Meta Data Node (Secondarynamenode)
The metadata node is not the alternate node when the metadata node is having problems, and the metadata node is responsible for different things.
The main function is to periodically merge the image files of the metadata node namespace with the modified log files to prevent the log files from being too large.
The merged namespace image file is also saved from the metadata node, which can be recovered when the metadata node fails.
The Data node (DataNode) is where the data is actually stored in the file system.
The client or metadata information (NameNode) can request a data block to be written to or read from the data node.
Its periodic return to the metadata node for its stored data block information.
HDFs: (Hadoop Distributed File System) distributed filesystem, providing high-throughput application data access, HDFs is like a traditional hierarchical file system for external clients. You can create, delete, move, or rename files, and so on. But the architecture of HDFS is built on a specific set of nodes, which is determined by its own characteristics. These nodes include NameNode (only one), which provides a metadata service inside HDFs, and DataNode, which provides a storage block for HDFs. Since there is only one NameNode, this is a disadvantage of HDFS (single point of failure).
The files that are stored in HDFS are partitioned into chunks and then copied to multiple computers (DataNode). This is very different from the traditional RAID architecture. The size of the block (typically 64MB) and the number of copied blocks are determined by the client when the file is created. NameNode can control all file operations. All traffic inside HDFS is based on the standard TCP/IP protocol.
Introduction to the basic concept of HDFS
1, Block:hdfs the default basic storage unit is 64M data block, and ordinary file system is the same, HDFS files are divided into 64M pieces of data block storage. Unlike the normal file system, HDFs, if a file is smaller than the size of a block of data, does not occupy the entire block of storage space.
2. Meta Data node (NAMENODE) and Data node (DataNode)
Metadata Node Save content:
A, the namespace that is used primarily to manage the file system, which stores metadata for all files and folders in a file system tree. This information will also be saved to the following file on your hard disk: namespace image (namespace image) and Changelog (edit log)
B, it also holds a file containing which data blocks, which data nodes are distributed. However, this information is not stored on the hard disk, but is collected from the data node when the system is started.
Data node Save content
Where data is actually stored. The client or metadata information (Namenode) can request a data block to be written to or read from the data node. Its periodic return to the metadata node for its stored data block information.
3. From the Meta Data Node (secondary namenode)
The metadata node is not the alternate node when the metadata node is having problems, it is responsible for different things with the metadata node. Its main function is to periodically merge the namespace image file of the metadata node with the modified log to prevent the log file from being too large. This will be believed in the narrative below. The merged namespace image file is also saved from the metadata node, which can be recovered when the metadata node fails.
Basic file Commands
The HDFs File System command takes the form:
Hadoop fs–cmd where cmd is a specific file command, is a variable set of parameters, and the cmd command is usually the same as the UNIX counterpart. For example, the file List command is: Hadoop fs–ls.
Here's a look at the most common file management tasks in Hadoop: adding files and directories
Hadoop Fs–mkdir/user/mdss
The mkdir command for Hadoop automatically creates the parent directory (if it doesn't exist before), similar to the mkdir command for the –P option in Unix.
Hadoop Fs–ls
This command lists directory and file information
Hadoop FS–LSR
This command loops through the list of directories, subdirectories, and file information
Hadoop Fs–put Example.txt/user/mdss
This command places the Example.txt file of the local file system into the/USER/MDSS directory of the HDFs file system.
Retrieving Files
Hadoop Fs–get/user/mdss/example.txt.
This command retrieves the Example.txt file in HDFs to the local file system, in contrast to the-put command.
Hadoop Fs–cat User/mdss/example.txt
Displays the contents of the Example.txt file in the HDFs file system.
We can use the UNIX pipeline in the Hadoop file command and send its results to other UNIX commands for further processing. For example, if the file is very large (as in a typical Hadoop file) and you want to quickly check its contents, you can pass the output of the cat command in Hadoop to the Unix command head.
Hadoop Fs–cat/user/mdss/example.txt | Head
Hadoop internally supports the tail command to see the last 1000 bytes.
Hadoop fs–tail/user/mdss/example.txt Deleting files
RM Deletes the Example.txt file from the HDFs file system, and the RM command can also delete the empty directory.
Hadoop Fs–rm/user/mdss/example.txt
The RMR command can iterate through the directories and files in subdirectories.
Hadoop fs–rmr/user/mdss/will delete the/user/mdss/directory and subdirectories
Copying Files
Copying files from the local file system to the HDFs File system command: copyfromlocal
Hadoop fs–copyfromlocal Example.txt/user/mdss/example.txt
Copying files from the HDFs file system to the local file system command: copytolocal
Hadoop Fs–copytolocal/user/mdss/example.txtexample.txt
Check Help
The Help commands for viewing a command are as follows:
Hadoop fs–help ls
HDFs file Command list
Cat Hadoop fs–cat file [file ...] Displays the contents of the file. To read a compressed file, you should use the text command.
CHGRP Hadoop Fs–chgrp [-R] Group PATH [path ...] changes the group of files and directories. Option-R recursively executes the change. The user must be the owner or superuser of the file.
chmod Hadoop fs–chmod [-r]mode[,mode ...] path [path ...] Change access permissions for files and directories. Similar to UNIX-specific commands, mode can be a 3-bit 8 binary, or {augo}+/-{rwxx}. Option-r recursively executes the change. The user must be the file owner or Superuser.
Chown Hadoop Fs–chown [-R] [OWNER] [: [GROUP]] PATH [path ...] Change the owner of the file and directory. Option-R executes the change recursively. The user must be a super user.
copyfromlocal Hadoop fs–copyfromlocal localsrc [localsrc ...] DST is equivalent to put, which copies files from the local file system.
Copytolocal Hadoop FS–COPYTOLOCAL[-IGNOREECRC] [-CRC] src [src ...] LOCALDST is equivalent to get, which copies the files to the local file system.
Count Hadoop Fs–count [-Q] PATH [path ...] Displays the number of subdirectories determined by PATH, the number of files, the number of bytes used, and all file/directory names. The option-Q displays the quota information.
CP hadoop FS–CP src [src ...] DST copies files from the source to the destination. If more than one source is specified, the destination must be a directory.
Du hadoop fs–du path [path ...] Displays the file size and, if path is a directory, displays the size of each file in that directory. The file name is represented by the full URI protocol prefix. Note that although du reflects disk usage, it cannot be words too literally because the actual disk usage depends on the block size and replica coefficients.
Dus Hadoop fs–dus path [path ...] Similar to Du, but when used as a table of contents, Dus displays the sum of the file sizes.
Expunge Hadoop Fs–expunge empties the Recycle Bin. If the Recycle Bin property is turned on, when the file is deleted, it is first moved to the temp directory. The trash/. Files will not be permanently deleted until the user-set delay is exceeded. The expunge command forces the deletion. trash/all files in the directory.
Get Hadoop FS–GET[IGNORECRC] [-CRC] SRC [src ...] LOCALDST Copy the file to the local file system. If more than one source file is specified, the local destination must be a directory. If LOCALDST is set to-, the file is copied to stdout.
Getmerge hadoop fs–getmerge src [src ...] LOCALDST[ADDNL] Gets all the files specified by SRC, merges them into a single file, and writes to LOCALDST in the local file system. Option ADDNL will add a newline character at the end of each file.
Help Hadoop fs–help[cmd] Displays the usage information for command CMD. If CMD is not displayed, the usage information for all commands is displayed.
LS Hadoop fs–lspath[path ...] Lists files and directories, and each entry point displays the file name, permissions, owner, group, size, and modification time. The file entry points also display their copy coefficients.
LSR Hadoop FS–LSR path [path ...] The recursive version of LS.
mkdir Hadoop fs–mkdir Path [path...] Creating a directory creates all the missing parent directories (similar to UNIX-like mkdir–p) in the path.
Movefromlocal Hadoop fs–movefromlocallocalsrc [localsrc ...] DST is similar to put, except that local sources are deleted after they are copied.
Movetolocal Hadoop fs–movetolocal [-CRC] src [src ...] LOCALDST displays a "Not implemented yet" message.
MV Hadoop fs–mv src [src ...] DST moves files from the source to the destination. If you specify more than one source file, the destination must be a directory. Moving across file systems is not allowed.
Put Hadoop fs–put localsrc[localsrc ...] DST copies files or directories from the local file system to HDFs. If LOCALSRC is set to-, the input is stdin and DST must be a file.
RM Hadoop fs–rm Path[path ...] Delete files and empty directories.
RMR Hadoop fs–rmr path [path ...] The recursive version of RM.
Setrep Hadoop Fs–setrep [-R] [-W] rep PATH [path ...] changes the target copy factor of the file into the Rep. The option-R will recursively change the target replica coefficients of all files in the directory specified by path. The replica factor takes a certain amount of time to reach the target value. The option-W will wait for the copy factor to match the target value.
Stat Hadoop fs–stat [Format]path [PATH ...] Displays the statistics in the file. The format string is completely printed, but is replaced by the format set below
%b file size in chunks
%f according to the file type is the string "directory" or "Regularfile"
%n file name
%o Block Size
%r copy
%y the UTC time displayed in Yyyy-mm-dd HH:mm:ss format.
%Y the number of milliseconds since January 1, 1970 (UTC).
Tail Hadoop fs–tail [-f]file displays the last 1KB data in FILE.
Test Hadoop Fs–test–[ezd]path checks the PATH in the following type.
Whether or not the-e path exists. If path exists, returns 0.
-Z file is empty. If the length is 0, returns 0.
-D is a directory. If path is a directory, 0 is returned.
Text Hadoop fs–text file [file ...] Displays the text content of the file. When a file is a text file, it is the same as cat. When the file is in compressed format (gzip and the binary sequence file format of Hadoop), it is decompressed first.
Touchz hadoop fs–touchz file [file ...] Create a file with a length of 0. If the file already exists and the length is not 0, an error is made.