What is HDFs?
Hadoop Distributed File System (Hadoop distributed filesystem)
is a file system that allows files to be shared across multiple hosts on a network,
Allows multiple users on multiple machines to share files and storage space.
Characteristics:
1. Permeability. Let's actually access the file through the network action, from the program and the user's view,
It's like accessing a local disk in general.
2. Fault tolerance. Even if some nodes in the system are offline, the system can continue to function as a whole
Without any data loss.
Applicable scenarios:
Applies to the case that writes multiple queries at once, does not support concurrent write situations, and small files are inappropriate.
The architecture of HDFs
Master-Slave structure
Master node, only one: Namenode
From the node, there are a number of: Datanodes
Namenode is responsible for:
Receiving user Action requests
Maintaining the directory structure of the file system
Managing the relationship between a file and a block, the relationship between block and Datanode
Datanode is responsible for:
Storing files
Files are partitioned into blocks and stored on disk
To keep your data secure, your files will have multiple copies
NameNode (can be understood as the boss)
is the management node for the entire file system. It maintains the file directory tree for the entire file system,
The meta-information for the file/directory and a list of data blocks for each file. Receives the user's action request.
Files are included (these three are stored in a Linux file system):
Fsimage: A metadata image file that stores Namenode memory metadata information for a certain period of time.
Edits: Operation log file.
Fstime: Time to save last checkpoint
Working characteristics:
1.Namenode always saves metedata in memory for processing "read requests".
2. When a "write request" arrives, Namenode will first write editlog to disk,
The log is written to the edits file, and after a successful return, the memory is modified and returned to the client.
3.Hadoop maintains a fsimage file, which is the metedata image in Namenode,
But Fsimage is not always consistent with the Metedata in Namenode memory,
Instead, the content is updated every once in a while by merging the edits file. Secondary Namenode
It is used to merge fsimage and edits files to update the metedata of Namenode.
DataNode (can be understood as younger brother)
A storage service that provides real-world file data.
The most basic storage unit: block (File block), the default size is 64M
Secondary NameNode (can be understood as boss's assistant)
Solution for HA (high Available). However, hot spares are not supported. Configure
The default is installed on the Namenode node, but this ... Not Safe!
(in production environment, it is recommended to install separately)
Execution process:
Download the metadata information (fsimage,edits) from the Namenode, and then combine the two to generate
New Fsimage, save locally, and push it to Namenode, replacing the old fsimage.
Work Flow:
1.secondarynamenode notification Namenode Switch edits file
2.secondarynamenode get fsimage and edits from Namenode (via HTTP)
3.secondarynamenode loads the fsimage into memory and then starts merging edits
4.secondarynamenode send the new fsimage back to Namenode
5.namenodenamenode replaces the old fsimage with the new fsimage
The entire architecture of Hadoop is built on top of RPC
RPC (Remote Procedure call), (RPC in client/server mode)
A remote procedure call protocol, which is a request service from a remote computer program over a network,
Without the need to understand the protocols underlying network technology.
Specific implementation process:
First, the client call process sends a call message with process parameters to the service process,
Then wait for the message to be answered. On the server side, the process stays asleep until the call information arrives.
When a call arrives, the server obtains the process parameters, evaluates the results, sends the reply message,
And then wait for the next call message,
Finally, the client calls the process to receive the reply information, obtains the process result, and then invokes execution to proceed.
The object provided by the server must be an interface, interface extends Versioinedprotocal
The methods in the object that the client can have must be in the interface of the object.
http://m.oschina.net/blog/212102
One of the two main cores of Hadoop: HDFs Summary