The sixth chapter of the code and the parts of the command I have not verified, first recorded, after verification if there are changes to update.
First, what is
1. is an easy-to-expand Distributed File System
2. Can run on a large number of ordinary low-cost machines, provide fault-tolerant mechanism
3. Can provide a large number of users with good performance file access services
Second, the advantages
High fault tolerance: Data automatically saves multiple copies, and when a copy is lost, it recovers automatically
Suitable for batch processing: mobile computing rather than data, data location exposure to the computing framework
Suitable for large data processing: GB, TB, or even petabytes; number of files over millions; 10k+ node size
Streaming file access: Write once, read multiple times, and ensure data consistency
Can be built on inexpensive machines: improve reliability with multiple replicas, and provide fault tolerance and recovery mechanisms
Third, shortcomings
Low latency data access: such as millisecond, low latency and high throughput;
Not suitable for small file access: Consumes Namenode large amount of memory; Seek time exceeds read time;
Concurrent write, File random modification: A file can only have one writer; only support append
Iv. HDFs Architecture
Master Master(only one): can be used to manage HDFs namespaces, manage block mapping information, configure replica policies, handle client read and write requests
NameNode : Fsimage and fsedits can be combined regularly, pushed to NameNode, and when active NameNode fails, quickly switch to the new active NameNode.
Datanode: Slave (with multiple), storing actual data blocks, performing block reads/writes
Client: File segmentation; interact with Namenode, get file location information, interact with Datanode, read or write data, manage HDFs, Access HDFs
HDFS block: The file is cut into a fixed-size block of data, the default block size is 64MB, configurable. If the file size is less than 64MB, it will be saved as a single block. The database is so large that it is designed to ensure that the data transfer time exceeds the seek time (high throughput rate). When a file is stored in HDFs, it is divided into blocks by size and stored on different nodes. By default, each block has three replicas.
Write Process :
Read Process :
Typical physical topology:
Five, HDFS strategy
Block copy placement policy :
Copy 1: On the same client node;
Replica 2: On nodes in different racks;
Replica 3: On another node of the same rack as the second replica
Other copies: Random selection
Reliability Policy:
Three common error conditions: file corruption, network or machine failure; Namenode hung up.
File integrity: CRC32 check; replace corrupted file with other copy
Network or machine failure: the use of heartbeat,datanode regularly to namenode hair heartbeat;
Namenode hang up: through the following policy to protect the metadata information, fsimage (file system image), Editlog (Operation log), multiple storage, master and standby Namenode real-time switching
HDFS not suitable for storing small files:
1. Meta-information stored in Namenode memory, a node of memory is limited, access to a large number of small files consumes a lot of seek time, analogy copy a large number of small files and copies of the same size of a large file.
2.NameNode Storage block number is limited, a block meta-information consumes about four bytes of memory, store 100 million blocks, about 20GB of memory, if a file size of 10K, The 100 million file size is only 1TB (but consumes NAMENODE20GB memory)
Six, HDFs access mode
HDFS shell command;
HDFS Java API;
HDFS REST API;
HDFS fuse: Implements the fuse protocol;
HDFS lib hdfs:c/c++ access interface;
HDFS other language programming API;
Use thrift Implementation, support C + +, Python, PHP, C # and other languages;
HDFS Shell Command
1. upload the local file to HDFs
Bin/hadoop Fs-copyfromlocal/local/data/hdfs/data
2. Delete files/ directories
Bin/hadoop Fs-rmr/hdfs/data
3. Create a directory
Bin/hadoop Fs-mkdir/hdfs/data
4. Some scripts
Under the Sbin directory: Start-all.sh;start-dfs.sh;start-yarn.sh;hadoop-deamon (s). Sh;
To start a service individually:
hadoop-deamon.sh start Namenode;
hadoop-deamons.sh start Namenode (login to each node via SSH);
5. File Management command fsck:
Check the health status of files in HDFs
Find missing blocks and blocks with too few or too many replicas
View all data block locations for a file
Delete Corrupted blocks of data
6. Data Block redistribution
Bin/start-balancer.sh-threshold <percentage of Diskcapacity>
Percentage of disk Capacity:hdfs achieves a balance-of-drive usage bias value, the lower the value, the more balanced the nodes are, but the longer it takes.
7. Set the catalog share
Limit a directory to use disk space:
Bin/hadoop Dfsadmin-setspacequota 1t/user/username
Limit the maximum number of subdirectories and files that a directory contains:
Bin/hadoop Dfsadmin-setquota 10000/user/username
8. Add/ Remove nodes
Join the new Datanode:
Step 1: Copy the installation package (including configuration files, etc.) on the existing datanode to the new Datanode;
Step 2: Start the new datanode:sbin/hadoop-deamon.sh start Datanode
removing old Datanode
Step 1: Add Datanode to the blacklist, and update the Blacklist, on the Namenode, the Datanode host or IP join configuration options dfs.hosts.exclude the specified file
Step 2: Remove Datanode:bin/hadoopdfsadmin-refreshnodes
HDFS Java API Introduction
Configuration class: The object of this class encapsulates the config information, which comes from core-*.xml;
FileSystem class: A file system class that allows you to manipulate files/directories using the methods of this class. A file system object is generally obtained by filesystem static method get;
Fsdatainputstream and Fsdataoutputstream classes: input and output streams in HDFs. It is obtained by filesystem open method and create method respectively.
The above classes are from the Java package: Org.apache.hadoop.fs
such as: Copy the local file to HDFs;
Configuration config = new configuration ();
FileSystem HDFs = filesystem.get (config);
Path Srcpath = new Path (srcfile);
Path Dstpath = new Path (dstfile);
Hdfs.copyfromlocalfile (Srcpath, Dstpath);
Create an HDFs file;
byte[] buff– file contents
Configuration config = new configuration ();
FileSystem HDFs = filesystem.get (config);
Path PATH = new Path (fileName);
Fsdataoutputstream outputstream = hdfs.create (path);
Outputstream.write (Buff, 0, buff.length);
Supplement (from Baidu Encyclopedia): The rack is used to secure the docking board, housing and equipment in the telecommunications cabinet. Usually 19 inches wide and 7 feet high. For the IT industry, it can be easily understood as the cabinet that holds the server. Standard racks are also known as "19-inch" racks. Rack Servers look like computers, like switches, routers, and so on. Rack-mount servers are installed in a standard 19-inch cabinet. This type of structure is more of a functional server.
Vii. new features of Hadoop 2.0
NameNode HA
NameNode Federation
HDFS Snapshot (snapshot)
HDFS Cache (In-memory caches)
HDFS ACL
Heterogeneous tiered storage architecture (heterogeneous Storage hierarchy)
Heterogeneous tiered Storage Architecture
HDFS abstracts all storage media into disk with the same performance
<property>
<name>dfs.datanode.data.dir</name>
<value>/dir0,/dir1,/dir2,/dir3</value>
</property>
Create a background:
A wide variety of storage media, a cluster of heterogeneous media, such as: disk, SSD, RAM, etc.
Multiple types of task attempts to run simultaneously in the same Hadoop cluster need to address issues such as batch processing, interactive processing, and real-time processing.
Data of different performance requirements, preferably stored on different types of storage media
Principle:
Each node is composed of a variety of heterogeneous storage media
<property>
<name>dfs.datanode.data.dir</name>
<value>[disk]/dir0,[disk]/dir1,[ssd]/dir2,[ssd]/dir3</value>
</property>
HDFs only provides a heterogeneous storage structure, and does not know the performance of storage media;
HDFS provides the user with an API to control what media the directory/file is written to;
HDFS provides administrators with administrative tools to limit the available share of each media per user; The current level of completion is low
Phase 1:datanode supports heterogeneous storage media (HDFS-2832, complete)
Phase 2: Provide access to the user API (HDFS-5682, unfinished)
HDFS ACL Implementation of POSIX-based ACLs
Creating a background: Limitations of existing rights management
Supplement to current POSIX file Rights Management (HDFS-4685);
Start the function;
Set Dfs.namenode.acls.enabled to True
method of Use;
HDFs dfs-setfacl-m user:tom:rw-/bank/exchange
HDFs dfs-setfacl-m user:lucy:rw-/bank/exchange
HDFs dfs-setfacl-m group:team2:r--/bank/exchange
HDFs dfs-setfacl-m group:team3:r--/bank/exchange
HDFS Snapshot
Background: The files and directories on HDFs are constantly changing, and snapshots can help users save data at a certain time;
Role: To prevent user misoperation delete data and data backup.
Use:
A directory can produce snapshots when and only if it is snapshottable;
Bin/hdfs dfsadmin Allowsnapshot <path>
Create/delete snapshots;
Bin/hdfs dfs-createsnapshot <path>[<snapshotname>]
Bin/hdfs Dfs-deletesnapshot<path>[<snapshotname>]
Snapshot storage location and features: Snapshots are read-only and cannot be modified
Snapshot location:
? <snapshottable_dir_path>/.snapshot
? <snapshottable_dir_path>/.snapshot/snap_name
HDFS Cache
Background:
The 1.HDFS itself does not provide data caching capabilities, but instead uses the OS cache. Easy memory waste, eg. a block of three copies is cached at the same time.
2. Multiple computing frameworks coexist, with HDFs as a shared storage system
MapReduce: Offline computing, taking full advantage of disk
Impala: Low-latency computing, making the most of memory
Spark: Memory Compute Framework
3.HDFS should allow a variety of mixed computing types to coexist in a cluster, reasonable use of memory, disk and other resources, such as high-frequency access to the characteristics of files should be cached as long as possible, to prevent displacement to disk
Realize:
Users need to explicitly add a directory or file to/from the cache by command: block-level caching is not supported, automatic caching is not supported, and cache expiration time can be set.
Cache directory: Caches only one level of files and does not recursively cache all files and directories.
Organize the cache resources as pool, and divide the cache into different pool by using Yarn's resource management method. Each pool has a class of Linux rights management mechanisms, cache caps, expiration times, and so on.
Independently managed memory, not integrated with the resource management system yarn, the user can set the cache size for each DN, independent of yarn
HDFs theory and basic commands