Overview of HDFs fundamentals and basic operations

Source: Internet
Author: User
Tags posix hadoop fs

HDFs Fundamentals

The Hadoop Distributed File System (HDFS) is designed to be suitable for distributed file systems running on common hardware (commodity hardware). It has a lot in common with existing Distributed file systems. But at the same time, the difference between it and other distributed file systems is obvious. HDFs is a highly fault-tolerant system that is suitable for deployment on inexpensive machines. HDFS provides high-throughput data access and is ideal for applications on large-scale datasets. HDFs relaxes a subset of POSIX constraints to achieve the purpose of streaming data from the file system. HDFs was first developed as the infrastructure for the Apache Nutch search engine project. HDFs is part of the Apache Hadoop core project.

HDFs is characterized by high fault tolerance (fault-tolerant) and is designed to be deployed on low-cost (low-cost) hardware. And it provides high throughput (hi throughput) to access application data for applications with very large datasets (large data set). HDFs relaxes (relax) POSIX requirements (requirements) so that data in the form of a stream can be accessed (streaming access) in the file system.

The main composition of HDFs: Namenode,secondarynamenode,datanonde;

The basic operation of HDFs: Read, write, balance;

Figure 1,namenode

Figure 2,secondarynamenode

Figure 3, the client's read request

Figure 4, reading the block on the Datanode

Figure 5, write file request

Figure 6, preparing to write

Figure 7, a single piece of pipeline write data node

Figure 8, multi-block pipeline write Data node

Figure 9 Rewriting a damaged copy of a block of data

Figure 10 An unbalanced cluster

Figure 11 a balanced cluster



Programs running on HDFs have a very large set of data. A typical HDFs file size is a GB to TB level. Therefore, HDFs is tuned to support large files. It should provide high aggregate data bandwidth, support for hundreds of nodes in a cluster, and tens files in a cluster.

        Most HDFS programs require a write-once-read operation mode for file operations. Once a file is created, written, and closed, it does not need to be modified. This assumption simplifies data-consistent issues and makes high-throughput data access possible. A map-reduce program or web crawler can be perfectly suited to this model.                          ,         &NB Sp                          ,         &NB Sp                          ,         &NB Sp                          ,         &NB Sp                          ,         &NB Sp                                  &NBS P;

       hdfs is a master-slave structure, an HDFS cluster is made up of a name node, It is a master server that manages file namespaces and regulates client access to files, and of course some data nodes, usually a node, a machine that manages the storage of the corresponding nodes. HDFs opens the file namespace and allows user data to be stored as files. The internal mechanism is to split a file into one or more blocks, which are stored in a set of data nodes. A file or directory operation that the name node uses to manipulate the file namespace, such as open, close, rename, and so on. It also determines the mapping of blocks to data nodes. The data node is responsible for read and write requests from the file system client. The data node also performs block creation, deletion, and block copy instructions from the name node.                          ,         &NB Sp                          ,         &NB Sp                          ,         &NB Sp         

          name nodes and data nodes are software that runs on ordinary machines, typically gnu/linux, HDFs is written in Java, any Java-enabled machine can run a name node or data node, using the Java language of the ultra-lightweight, it is easy to deploy HDFS to a wide range of machines. A typical deployment is a dedicated machine that runs the name node software, and each of the other machines in the cluster runs a data node instance. The architecture does not repel instances that run multiple data nodes on a single machine, but this is not the case with actual deployments. Only one name node in the cluster simplifies the architecture of the system greatly. The name node is the repository for arbitrators and all HDFs metadata, and the user's actual data does not pass through the name node.                          ,         &NB Sp                          ,         &NB Sp                          ,         &NB Sp                          ,         &NB Sp  

HDFs is a good distributed file system, it has many advantages, but there are some drawbacks, including: not suitable for low latency data access, unable to efficiently store a large number of small files, do not support multi-user write and arbitrary modification of files.

                                                                               Basic operations commands commonly used in HDFS1,hdfs Shell

    View Hadoop FS                          ,         &NB Sp                          ,         &NB Sp                          ,         &NB Sp                          ,         &NB Sp                          &NBSP

2, common commands                                 &N Bsp                          ,         &NB Sp                          ,         &NB Sp                          ,         &NB Sp                          ,         &NB Sp                          ,         &NB Sp                          ,         &NB Sp                                                            ,         &NB Sp                          ,         &NB Sp                          ,         &NB Sp                

-ls Show current directory structure

the command option represents the current directory structure for viewing the specified path, followed by the HDFs path.                                                                                          The path in is the HDFs root, and the content format displayed is very similar to the content format displayed by the Linux command ls–l, parsing the content format of each line:

                          &N Bsp         1. The initial letter indicates whether the folder (if "D") or the file (if "-"),                           & nbsp                          ,         &NB Sp                          ,         &NB Sp                   2. the following 9-bit characters represent permissions;

3. The following number or "-" indicates the number of copies. If it is a file, use a number to represent the number of copies;

4. The following "root" indicates the owner;

5. The following "supergroup" denotes a group;

6. The following "0", "6176", "37645" means the file size, the unit is byte;

7. The later time indicates the modification time, the format is month and day hours;

8. The last item indicates the file path.

-du the file size under the statistics directory

-MV Mobile

This command option represents moving HDFs files into the specified HDFs directory. followed by two paths, the first represents the source file, and the second represents the destination directory.

-CP replication

This command option indicates that the file specified by the HDFs is copied into the specified HDFs directory. followed by two paths, the first one is the copied file, the second is the destination

-RM Delete files/blank folders

This command option means to delete the specified file or empty directory

-RMR Recursive deletion

This command option indicates that all subdirectories and files under the specified directory are deleted recursively

-put Uploading Files

This command option means that files on Linux are copied to HDFs

-copyfromlocal copying from local

Operation consistent with-put

-movefromlocal moving from local

This command means moving files from Linux to HDFs

Getmerge merging to Local

The meaning of this command option is to merge all the file contents in the specified HDFs directory into a local Linux file

-cat viewing the contents of a file

The command option is to view the contents of the file

-text viewing the contents of a file

The command option can be thought of as acting and using the same as-cat, here slightly.

-mkdir Creating a blank folder

The command option indicates that the folder is created, followed by the folder in HDFs that will be created

-setrep setting the number of replicas

The command option is to modify the number of copies of the saved file, followed by the number of copies, followed by the file path

-touchz Creating a blank file

This command option is to create a blank file in HDFs

-stat displaying statistics for files

This command option displays some statistics about the file

-tail viewing the contents of a file trailer

This command option displays the contents of the last 1 K bytes of the file. Typically used to view logs. If you have option-F, the contents of the file will be automatically displayed when the content changes.

-chmod Modifying file permissions

The use of this command option is similar to the chmod usage in the Linux shell, which is to modify the permissions of the file

-chown Modify Owner

This command option indicates the owner of the modified file

-help Help

The command option displays the Help information, followed by the command options that need to be queried.


3, the reading and writing process of data

Pipeline (PipeLine) write: When the client writes data to the HDFs file, the data is first written to the local file, assuming that the HDFs file has a replication factor of 3, and when the local file is stacked to a piece of data, the client obtains a list of data nodes from the name node. This list also contains data nodes that hold copies of the data blocks. When a client refreshes a data block to the first data node. The first data node begins to receive data at 4KB, writes each small chunk to the local library, and transfers each small chunk to the second data node in the list. Similarly, the second data node writes the small piece of data to the local library and passes it to the third data node, and the third data node is written directly to the local library. A data node can pass data pipelining to the next node at the same time as the previous node data, so the data is passed from one data node to the next.   As a result, the time to complete the write data for three nodes is not much different.

                                                                                              

Overview of HDFs fundamentals and basic operations

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.