<1> A file system that allows files to be shared on multiple hosts over the network, allowing multiple users on multiple hosts to share files and buckets.
<2> permeability. This allows you to access files through the network. In the view of programs and users, it is like accessing a local disk.
<3> fault tolerance. Even if some nodes in the system are offline, the system can continue to operate without data loss.
<4> applicable to one write and multiple queries. Concurrent writes are not supported, and small files are not suitable.
3. directory structure<1> since NameNode maintains so much information, where is the information stored?
There is a file in hadoop source code called hdfs-default.xml
<2> open this file
Rows 149th and 158th have two configuration information: dfs. name. dir and dfs. name. edits. dir. The two files indicateThe storage location of the Core File fsimage and edits of NameNode, as shown in
The value of the corresponding configuration is $ {}, which is the expression of the variable. The ER expression reads the value of the variable when the program reads the file. The value of the 150th-row variable hadoop. tmp. dir (that is, the hadoop temporary storage path) is shown in.
But in our configuration file core-site.xml in the previous chapter, the configuration value is/usr/local/hadoop/tmp.
<3> you can access the linux File System.
Run the command cd/usr/local/hadoop/conf to view more core-site.xml, as shown in
We can see that the two files are stored in the/usr/local/hadoop/tmp/dfs/name directory of the linux File System.
<4> enter this directory.
View the contents of this directory, as shown in
It can be seen that the core files fsimage and edits of NameNode are stored in the current directory. At the same time, the name directory contains a file in_use.lock and the content is empty when you view the file, that is to say, only one Namenode process can access this directory. You can try it yourself. When hadoop is not enabled, there is no file in_use.lock in this directory, this file is generated only after hadoop is started.
<5> file fsimage is the core file of NameNode.
This file is very important. If it is lost, Namenode cannot be used. How can we prevent the loss of this file from causing adverse consequences. I can look at a piece of code in the hdfs-default.xml again, as shown in
According to the description, this variable determines the storage location of the DFS NameNode NameTable (fsimage) on the local file system. If this is a directory separated by commas (,), The nametable will be repeatedly copied to all directories for redundancy (backup to ensure data security ). For example, $ {hadoop. tmp. dir}/dfs/name ,~ /Name2 ,~ /Name3 ,~ /Name4. Then, the fsimage will be copied ~ /Name1 ,~ /Name2 ,~ /Name3 ,~ /Name4 directory. Therefore, these directories are generally stored on different machines, disks, and folders. The more dispersed, the better. This ensures data security. Someone may ask how to implement it on multiple hosts? In fact, there is an nfs File Sharing System in Linux, which is not detailed here.
<6> check the edits description.
View a piece of code in the hdfs-default.xml, as shown in
According to the description, this variable determines the location of the DFSNameNode storage transaction file (edits) on the local file system. If this is a list of directories separated by commas, the transaction files will be copied to all directories for redundancy. The default value is the same as dfs. name. dir. (Edit stores the transaction process)
Iv. Basic Structure of HDFS-DataNode1. RoleDataNode is used to truly store data in HDFS.
2. block<1> if a file is very large, such as 100 GB, how can it be stored in DataNode? DataNode reads and writes data in blocks when storing data. Block is the basic unit for hdfs to read and write data.
<2> assume that the file size is 100 GB. Starting from the byte location 0, every 64 MB bytes is divided into a block. Therefore, many blocks can be divided. Each block is 64 MB.
2.1 let's take a look at the org. apache. hadoop. hdfs. protocol. Block class.
The following attributes are available.
It can be seen that none of the attributes in the class can store data. Therefore, block is essentially a logical concept, meaning that the block does not actually store data, but only divides files.
2.2 Why must it be divided into 64 MB?
Because this is set in the default profile, we view the core-default.xml file, as shown in.
The ds. block. name parameter in indicates the block size. The value is 67, 108, 864 bytes, and can be converted to 64 MB. If we don't want a 64 MB size, We can override this value in the core-site.xml. Note that the unit is byte.
2.3 Copies
<1> copies are backups for security purposes. Because the cluster environment is not reliable, the copy mechanism is used to ensure data security.
<2> the disadvantage of a copy is that it occupies a large amount of storage space. The more copies, the more space occupied. Compared with the risk of data loss, the cost of storage space is worthwhile.
<3> how many copies of a file are suitable? Let's look at the hdfs-default.xml file, as shown in.
As shown in Figure 4.3, the default number of copies is 3. This means that each data block in HDFS has three copies. Of course, each copy will certainly be allocated to different DataNode servers as much as possible. Imagine: if the three copies of the backup data are on the same server, the server will be shut down. Will all the data be lost?
3. directory structure3.1 DataNode divides files by block
So where is the split file stored? Let's look at the file core-default.xml, as shown in.
The value of dfs. data. dir is the location where the block is stored in the linux File System. Variable hadoop. tmp. the dir value has been described earlier. It is/usr/local/hadoop/tmp, so dfs. data. the complete path of dir is/usr/local/hadoop/tmp/dfs/data. Run the linux Command, as shown in Result 4.5.
3.2 upload a file
First, click PieTTY to open another Linux terminal and upload a file jdk-6u24-linux-i586.bin with a size of 84927175 kb, as shown in.
Then we can view the uploaded files on the original terminal, that is, under the/usr/local/hadoop/tmp/dfs/data directory of the Linux file system, as shown in
Files starting with "blk _" are blocks that store data. The name here is regular. In addition to block files, there are also files suffixed with "meta". This is the source data file of the block and stores metadata information. Therefore, there are only two block files.
Note: from linuxUpload a complete file to the hdfsIn linuxYes, but uploaded to hdfsThere will be no corresponding file, but it will be divided into many blocks. In addition, because our hadoop installation method is pseudo-distributed, there is only one node, and both DataNode and NameNode are on this node, the uploaded block is still in the Linux system.
V. Basic Structure of HDFSSecondaryNodeHA solution. However, hot backup is not supported. Configuration. The more data operations, the larger the size of edits files, but the unlimited expansion is not allowed. Therefore, you need to convert the log process into fsimage. NameNode must be able to respond quickly to user requests to accept user operation requests. In order to ensure the rapid response of NameNode to the user, the work is handed over to SecondaryNode, so he also backs up part of fsimage.
Execution Process: Download the metadata information (fsimage, edits) from NameNode, merge the two, generate a new fsimage, save it locally, and push it to NameNode, reset the edits of the NameNode. it is installed on the NameNode node by default, but so... insecure!
Shows the merging principle.
From: http://www.thebigdata.cn/Hadoop/11962.html
Address: http://www.linuxprobe.com/hdfs-concept.html