HDFs theory and basic commands

Source: Internet
Author: User
Tags posix hdfs dfs hadoop fs

The sixth chapter of the code and the parts of the command I have not verified, first recorded, after verification if there are changes to update.

First, what is

1. is an easy-to-expand Distributed File System

2. Can run on a large number of ordinary low-cost machines, provide fault-tolerant mechanism

3. Can provide a large number of users with good performance file access services

Second, the advantages

High fault tolerance: Data automatically saves multiple copies, and when a copy is lost, it recovers automatically

Suitable for batch processing: mobile computing rather than data, data location exposure to the computing framework

Suitable for large data processing: GB, TB, or even petabytes; number of files over millions; 10k+ node size

Streaming file access: Write once, read multiple times, and ensure data consistency

Can be built on inexpensive machines: improve reliability with multiple replicas, and provide fault tolerance and recovery mechanisms

Third, shortcomings

Low latency data access: such as millisecond, low latency and high throughput;

Not suitable for small file access: Consumes Namenode large amount of memory; Seek time exceeds read time;

Concurrent write, File random modification: A file can only have one writer; only support append

Iv. HDFs Architecture

Master Master(only one): can be used to manage HDFs namespaces, manage block mapping information, configure replica policies, handle client read and write requests

NameNode : Fsimage and fsedits can be combined regularly, pushed to NameNode, and when active NameNode fails, quickly switch to the new active NameNode.

Datanode: Slave (with multiple), storing actual data blocks, performing block reads/writes

Client: File segmentation; interact with Namenode, get file location information, interact with Datanode, read or write data, manage HDFs, Access HDFs

HDFS block: The file is cut into a fixed-size block of data, the default block size is 64MB, configurable. If the file size is less than 64MB, it will be saved as a single block. The database is so large that it is designed to ensure that the data transfer time exceeds the seek time (high throughput rate). When a file is stored in HDFs, it is divided into blocks by size and stored on different nodes. By default, each block has three replicas.

Write Process :

Read Process :

Typical physical topology:

Five, HDFS strategy

Block copy placement policy :

Copy 1: On the same client node;

Replica 2: On nodes in different racks;

Replica 3: On another node of the same rack as the second replica

Other copies: Random selection

Reliability Policy:

Three common error conditions: file corruption, network or machine failure; Namenode hung up.

File integrity: CRC32 check; replace corrupted file with other copy

Network or machine failure: the use of heartbeat,datanode regularly to namenode hair heartbeat;

Namenode hang up: through the following policy to protect the metadata information, fsimage (file system image), Editlog (Operation log), multiple storage, master and standby Namenode real-time switching

HDFS not suitable for storing small files:

1. Meta-information stored in Namenode memory, a node of memory is limited, access to a large number of small files consumes a lot of seek time, analogy copy a large number of small files and copies of the same size of a large file.

2.NameNode Storage block number is limited, a block meta-information consumes about four bytes of memory,  store 100 million blocks, about 20GB of memory, if a file size of 10K, The 100 million file size is only 1TB (but consumes NAMENODE20GB memory)

Six, HDFs access mode

HDFS shell command;

HDFS Java API;

HDFS REST API;

HDFS fuse: Implements the fuse protocol;

HDFS lib hdfs:c/c++ access interface;

HDFS other language programming API;

Use thrift Implementation, support C + +, Python, PHP, C # and other languages;

HDFS Shell Command

1. upload the local file to HDFs

Bin/hadoop Fs-copyfromlocal/local/data/hdfs/data

2. Delete files/ directories

Bin/hadoop Fs-rmr/hdfs/data

3. Create a directory

Bin/hadoop Fs-mkdir/hdfs/data

4. Some scripts

Under the Sbin directory: Start-all.sh;start-dfs.sh;start-yarn.sh;hadoop-deamon (s). Sh;

To start a service individually:

hadoop-deamon.sh start Namenode;

hadoop-deamons.sh start Namenode (login to each node via SSH);

5. File Management command fsck:

Check the health status of files in HDFs

Find missing blocks and blocks with too few or too many replicas

View all data block locations for a file

Delete Corrupted blocks of data

6. Data Block redistribution

Bin/start-balancer.sh-threshold <percentage of Diskcapacity>

Percentage of disk Capacity:hdfs achieves a balance-of-drive usage bias value, the lower the value, the more balanced the nodes are, but the longer it takes.

7. Set the catalog share

Limit a directory to use disk space:

Bin/hadoop Dfsadmin-setspacequota 1t/user/username

Limit the maximum number of subdirectories and files that a directory contains:

Bin/hadoop Dfsadmin-setquota 10000/user/username

8. Add/ Remove nodes

Join the new Datanode:

Step 1: Copy the installation package (including configuration files, etc.) on the existing datanode to the new Datanode;

Step 2: Start the new datanode:sbin/hadoop-deamon.sh start Datanode

removing old Datanode

Step 1: Add Datanode to the blacklist, and update the Blacklist, on the Namenode, the Datanode host or IP join configuration options dfs.hosts.exclude the specified file

Step 2: Remove Datanode:bin/hadoopdfsadmin-refreshnodes

HDFS Java API Introduction

Configuration class: The object of this class encapsulates the config information, which comes from core-*.xml;

FileSystem class: A file system class that allows you to manipulate files/directories using the methods of this class. A file system object is generally obtained by filesystem static method get;

Fsdatainputstream and Fsdataoutputstream classes: input and output streams in HDFs. It is obtained by filesystem open method and create method respectively.

The above classes are from the Java package: Org.apache.hadoop.fs

such as: Copy the local file to HDFs;

Configuration config = new configuration ();

FileSystem HDFs = filesystem.get (config);

Path Srcpath = new Path (srcfile);

Path Dstpath = new Path (dstfile);

Hdfs.copyfromlocalfile (Srcpath, Dstpath);

Create an HDFs file;

byte[] buff– file contents

Configuration config = new configuration ();

FileSystem HDFs = filesystem.get (config);

Path PATH = new Path (fileName);

Fsdataoutputstream outputstream = hdfs.create (path);

Outputstream.write (Buff, 0, buff.length);

Supplement (from Baidu Encyclopedia): The rack is used to secure the docking board, housing and equipment in the telecommunications cabinet. Usually 19 inches wide and 7 feet high. For the IT industry, it can be easily understood as the cabinet that holds the server. Standard racks are also known as "19-inch" racks. Rack Servers look like computers, like switches, routers, and so on. Rack-mount servers are installed in a standard 19-inch cabinet. This type of structure is more of a functional server.

Vii. new features of Hadoop 2.0

NameNode HA

NameNode Federation

HDFS Snapshot (snapshot)

HDFS Cache (In-memory caches)

HDFS ACL

Heterogeneous tiered storage architecture (heterogeneous Storage hierarchy)

Heterogeneous tiered Storage Architecture

HDFS abstracts all storage media into disk with the same performance

<property>

<name>dfs.datanode.data.dir</name>

<value>/dir0,/dir1,/dir2,/dir3</value>

</property>

Create a background:

A wide variety of storage media, a cluster of heterogeneous media, such as: disk, SSD, RAM, etc.

Multiple types of task attempts to run simultaneously in the same Hadoop cluster need to address issues such as batch processing, interactive processing, and real-time processing.

Data of different performance requirements, preferably stored on different types of storage media

Principle:

Each node is composed of a variety of heterogeneous storage media

<property>

<name>dfs.datanode.data.dir</name>

<value>[disk]/dir0,[disk]/dir1,[ssd]/dir2,[ssd]/dir3</value>

</property>

HDFs only provides a heterogeneous storage structure, and does not know the performance of storage media;

HDFS provides the user with an API to control what media the directory/file is written to;

HDFS provides administrators with administrative tools to limit the available share of each media per user; The current level of completion is low

Phase 1:datanode supports heterogeneous storage media (HDFS-2832, complete)

Phase 2: Provide access to the user API (HDFS-5682, unfinished)

HDFS ACL Implementation of POSIX-based ACLs

Creating a background: Limitations of existing rights management

Supplement to current POSIX file Rights Management (HDFS-4685);

Start the function;

Set Dfs.namenode.acls.enabled to True

method of Use;

HDFs dfs-setfacl-m user:tom:rw-/bank/exchange

HDFs dfs-setfacl-m user:lucy:rw-/bank/exchange

HDFs dfs-setfacl-m group:team2:r--/bank/exchange

HDFs dfs-setfacl-m group:team3:r--/bank/exchange

HDFS Snapshot

Background: The files and directories on HDFs are constantly changing, and snapshots can help users save data at a certain time;

Role: To prevent user misoperation delete data and data backup.

Use:

A directory can produce snapshots when and only if it is snapshottable;

Bin/hdfs dfsadmin Allowsnapshot <path>

Create/delete snapshots;

Bin/hdfs dfs-createsnapshot <path>[<snapshotname>]

Bin/hdfs Dfs-deletesnapshot<path>[<snapshotname>]

Snapshot storage location and features: Snapshots are read-only and cannot be modified

Snapshot location:

? <snapshottable_dir_path>/.snapshot

? <snapshottable_dir_path>/.snapshot/snap_name

HDFS Cache

Background:

The 1.HDFS itself does not provide data caching capabilities, but instead uses the OS cache. Easy memory waste, eg. a block of three copies is cached at the same time.

2. Multiple computing frameworks coexist, with HDFs as a shared storage system

MapReduce: Offline computing, taking full advantage of disk

Impala: Low-latency computing, making the most of memory

Spark: Memory Compute Framework

3.HDFS should allow a variety of mixed computing types to coexist in a cluster, reasonable use of memory, disk and other resources, such as high-frequency access to the characteristics of files should be cached as long as possible, to prevent displacement to disk

Realize:

Users need to explicitly add a directory or file to/from the cache by command: block-level caching is not supported, automatic caching is not supported, and cache expiration time can be set.

Cache directory: Caches only one level of files and does not recursively cache all files and directories.

Organize the cache resources as pool, and divide the cache into different pool by using Yarn's resource management method. Each pool has a class of Linux rights management mechanisms, cache caps, expiration times, and so on.

Independently managed memory, not integrated with the resource management system yarn, the user can set the cache size for each DN, independent of yarn

HDFs theory and basic commands

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.