Definition and characteristics of 1.HDFS
The disadvantage of a file as a basic storage unit: It is difficult to achieve load balancing-the file size is different, load balancing is difficult to achieve, the user control the file size;
It is difficult to parallelize processing--only one node resource can be used to process a file, and the cluster resources cannot be utilized;
The definition of HDFs: A distributed File system that is easy to expand, runs on a large number of inexpensive machines, provides a fault-tolerant mechanism, and provides a good performance file storage service for a large number of users;
Advantages: High fault tolerance (data automatically saves multiple copies, automatic recovery after copy loss) suitable for batch processing (mobile computing rather than data, data location exposure to computational framework) processing streaming file access for big data can be built on inexpensive machines
Not good: Low latency data access Small file access concurrent write, File random modification
2.HDFS Architecture
Namenode:master manages the namespace of HDFs to manage block mapping information, configure replica policies, and handle client read and write requests
Datanode:slave stores actual blocks of data, performs data block reads and writes
Client: File segmentation interacts with Namenode, obtains file location information, interacts with datanode, reads or writes data, manages HDFs, accesses HDFs
Secondary NameNode: Not NameNode hot-standby, auxiliary NameNode, share their workload; periodically merge fsimage and Fsedits, push to NameNode; assist recovery NameNode in case of emergency
3.HDFS Working principle
4HDFS combined with other systems
Hadoop Learning for the second time: Application scenario Deployment principle and basic framework of HDFS