Chapter 3 the storage size of the search engine of the parallel distributed file system is at least TB. How can we effectively manage and organize these resources? And get results in a very short time? Mapreduce: simplified data processing on large clusters provides a good analysis.
The implementation of the Distributed File System must implement two kinds of critical resource interfaces: one is the ing table from the file name to the namespace, and the other is the block table corresponding to the node machine list. The namespace indicates the ing of file names to a group of machines. The specific hash function may need to look at the namespace size, which is actually a map process. The block table corresponds to the machine list, it is actually a reduce process. It is stored in blocks to the controlled machine group (inodes). In other words, it is the slave-master architecture. The higher the communication mode, the higher the efficiency of the underlying protocol execution. For specific implementation, refer to hadoop.
Of course, there are still many details to consider, such as turning up and turning down of inode machines, identifying these new machines in real time, and automatically deleting the disabled machines from the list; select the number of backup data, load balancing between backups, maintenance of inode configuration files, and virtualization of the file system for end users;