I. Overview
In recent years, big data technology in full swing, how to store huge amounts of data has become a hot and difficult problem today, and HDFs Distributed File system as a distributed storage base for Hadoop projects, but also provide data persistence for hbase, it has a very wide range of applications in big data projects.
The Hadoop distributed filesystem (Hadoop Distributed File System,hdfs) is designed for distributed file systems running on common hardware (commodity hardware). HDFs is a core subproject of the Hadoop project, a distributed file system with features such as high fault tolerance, high reliability, high scalability, and high throughput, which can be used for the storage of massive data in cloud computing or other big data applications (mainly for large file storage).
In this paper, the author and colleagues of the study and practice of HDFs understanding, first introduced the characteristics of HDFs and the use of important shell commands (Hadoop and HDFs command), followed by the HDFs provided by the C Access interface Lib HDFs and the common file system with the C API similarities and differences, Then it describes how to use the Lib HDFs interface to implement a simple HDFS client and enumerate the relevant application examples, and finally describes and analyzes the problems encountered in writing HDFs client.
ii. introduction of HDFs
HDFs is the core sub-project of Hadoop project and is a distributed file system with high fault tolerance, high reliability, high scalability and high throughput.
1.HDFS features
As a distributed file system, HDFs mainly has the following features:
1) mainly used to store and manage Big data files (because the HDFS default block is 128M, it is mainly suitable for storing files of hundred m level and above).
2) Its data nodes can be scaled horizontally, and inexpensive commercial hardware can be selected.
3) design concept for "write once, read multiple times".
4) currently does not support to modify the file content anywhere in the file, can only perform append operation at the end of the file.
5) Not suitable for low latency (dozens of ms) data access applications (low latency applications can consider hbase distributed database or es+ Distributed File System architecture).
Introduction to 2.HDFS Common shell commands
Hadoop has two very important shell commands: Hadoop and HDFs. For the management of the HDFs file system, the Hadoop and HDFs scripting features are very repetitive, and the two commands are described separately below.
1) Hadoop command use
Almost all of Hadoop's management commands are integrated into a shell script, the Bin/hadoop script, that enables the management of Hadoop modules by executing Hadoop scripts with parameters. Since this article mainly describes the HDFs file system, here is the main introduction to the use of Hadoop script operation HDFs file system, the relevant commands and parameters shown in Figure 1.
Figure 1 Hadoop FS command and parameter description
Here is a concrete example. For example, to view all the directories or files contained in the root directory in the HDFs file system, create the test directory, give 777 permissions, and finally delete the directory. The main commands for doing this are as follows:
hadoop fs –ls /hadoop fs -mkdir /testhadoop fs –chmod777 /testhadoop fs –rmdir /test
The Hadoop FS command is capable of managing user groups and permissions, directories and files and management, uploading and downloading files, but there is nothing to do with the deep management of HDFs file system checks (including bad block cleanup), node management, snapshot management, and formatting. The HDFs command is needed here.
2) The HDFs command uses
All management commands in HDFS are integrated into a shell script, the Bin/hdfs script, that manages the HDFs file system by executing an HDFS script with parameters, including basic files, directories, permissions and group operations, and data block management and formatting functions.
The HDFs script implementation is very powerful and the HDFs DFS command is exactly the same as the Hadoop FS command, so we can take advantage of the HDFs DFS command and carry the parameters in Figure 1 for all of the capabilities of Hadoop FS. The following mainly describes the following HDFs file system formatting and block management operations.
To format an HDFs file system, use the following command:
HDFs Namenode-format
Delete the bad blocks and the corresponding corrupted files that exist in the HDFs file system, using the following command:
HDFs Fsck-delete-files/
three, LIB HDFs interface Introduction
The Hadoop FileSystem APIs are Java CLIENT Api,hdfs and do not provide native C-language access interfaces. But HDFs provides a JNI-based C-call interface, Lib HDFs, which provides a great convenience for C-language access to HDFS.
The header and library files of the LIB HDFs interface are already included in the Hadoop release and can be used directly. Its header file Hdfs.h is generally located in H ADOO PH OM E /INCLud eMeshRecordin,anditsLibrarytextpiecesLIbhd F s.soPassoftenthebitin {The hadoop_home}/lib/native directory. Because the Hadoop version is always in the update, the different versions of the Lib HDFs interface feature are often not the same, mainly manifested as the phenomenon of increasing functionality.
Accessing the HDFs file system through Lib HDFs is similar to a file system that accesses a normal operating system using the C language API, but there are some deficiencies, as follows:
A) the functionality of the LIB HDFs interface is only a subset of the Java Client API functionality, and there may be a number of bugs that have not been discovered or repaired, such as multi-threaded, multi-process access to the file system when the problem is released by the handle resource.
b) In addition, because the use of JNI method calls Java CLASS, so the application consumes more memory, and the interface can produce a large number of exception logs, how to manage these logs is a problem. In addition, when operating an HDFs file system error, errno does not necessarily have the correct hint, but also increases the difficulty of troubleshooting the problem.
c) At present, Lib HDFs can refer to less use cases, and the use of multi-threaded and so on through the Lib HDFs large data volume read and write HDFs file system use case is less.
d) At present, Lib HDFs does not support the modification of the file content at any location, can only perform append operations at the end of the file, or perform truncate operations on the entire file, this is mainly related to the HDFs file system design, big data storage generally lacks the support of the update function, We can only circumvent this by the business level.
E) The Lib HDFs feature that is carried by the low-to-high Hadoop release may be incremented, so every time the Hadoop version is updated, we need to recompile our application to increase the difficulty of the upgrade.
At present, although there are some deficiencies in the Lib HDFs interface, it is believed that the function and stability will be greatly improved with the update of the interface version.
Four, C language access to HDFS application practice
1. Build and run the environment
In order to successfully compile the C-language client program, we need to pre-install the Java JDK and Hadoop release versions 7.0 and later, which provide libraries such as libjvm.so, which provide the libraries required for the Lib HDFs connection.
In order to successfully run the C language client program, in addition to pre-installing the program mentioned above, we also need to correctly set several key environment variables, including Ld_library_path and classpath settings.
Regarding the LD_LIBRARY_PATH environment variable, it is necessary to add the path of the libjvm.so and libhdfs.so libraries, while for classpath the full path information of all the jar packages provided by Hadoop is required (specifically through the find+ awk combined command), otherwise the C language client program will always report an error that is missing a class and cannot run.
2.LIB HDFs Interface Simple Application Practice
Here are some examples of how some APIs are used.
1) Obtain the capacity of the HDFs file system and the amount of space used to use the information as shown in the Gethdfsinfo function:
voidGethdfsinfo (void) {HDFSFS PFS =NULL;intIRet =0; Toffset itmp =0; PFS = Hdfsconnect ("Hdfs://127.0.0.1:9000/",0);//connection to the HDFs file system if(NULL= = PFS) {Writelogex (Log_error, ("Gethdfsinfo (): Hdfsconnect failed! errno=%d. ", errno));return; }writelog (Log_info,"Gethdfsinfo (): Hdfsconnect success!"); Itmp = hdfsgetcapacity (PFS);//Get HDFs file system capacity if(-1= = itmp) {Writelogex (Log_error, ("Gethdfsinfo (): Hdfsgetcapacity failed! errno=%d. ", errno)); Hdfsdisconnect (PFS); PFS =NULL;return; }writelogex (Log_info, ("Gethdfsinfo (): Hdfsgetcapacity success! Offset=%ld. ", itmp)); Itmp = hdfsgetused (PFS);//Get all files in the HDFs file system to occupy space, that is, the amount of used if(-1= = itmp) {Writelogex (Log_error, ("Gethdfsinfo (): hdfsgetused failed! errno=%d. ", errno)); Hdfsdisconnect (PFS); PFS =NULL;return; } Writelogex (Log_info, ("Gethdfsinfo (): hdfsgetused success! Offset=%ld. ", itmp)); IRet = Hdfsdisconnect (PFS);//Close the connection to the HDFs file system if(-1= = IRet) {Writelogex (Log_error, ("Gethdfsinfo (): Hdfsdisconnect failed! errno=%d. ", errno));return; } Writelogex (Log_info, ("Gethdfsinfo (): Hdfsdisconnect success! ret=%d. ", IRet));p fs =NULL;return;}
2) Add the file to the HDFs file system and write the data as shown in the Hdfswritetest function:
void Hdfswritetest (Hdfsfs PFS) {intIRet =0; Hdfsfile pfile = NULL; Char sztestfile[ $] ="/test/write.test";if(NULL = = PFS) {Writelog (Log_error,"Hdfswritetest ():p FS is null.");return; } pfile = Hdfsopenfile (PFS, sztestfile, o_wronly | | O_creat,0,0,0);//Open File Handleif(NULL = = pfile) {Writelogex (Log_error, ("Hdfswritetest (): Hdfsopenfile failed! Szfilepath=%s, errno=%d. ", Sztestfile, errno));return; } Writelogex (Log_info, ("Hdfswritetest (): Hdfsopenfile success! Szfilepath=%s. ", Sztestfile)); IRet = Hdfswrite (PFS, Pfile,"Hello world!", strlen ("Hello world!"));//Write Dataif(-1= = IRet) {Writelogex (Log_error, ("Hdfswritetest (): Hdfswrite failed! ret=%d, errno=%d. ", IRet, errno)); Hdfsclosefile (PFS, pfile); Pfile = NULL;return; }writelogex (Log_info, ("Hdfswritetest (): Hdfswrite success! ret=%d. ", IRet)); IRet = Hdfshflush (PFS, pfile);//Writing data in buffers to diskif(-1= = IRet) {Writelogex (Log_error, ("Hdfswritetest (): Hdfsflush failed! ret=%d, errno=%d. ", IRet, errno)); Hdfsclosefile (PFS, pfile); Pfile = NULL;return;} Writelogex (Log_info, ("Hdfswritetest (): Hdfsflush success! ret=%d. ", IRet)); IRet = Hdfsclosefile (PFS, pfile);//Close file handles, freeing resourcesif(-1= = IRet) {Writelogex (Log_error, ("Hdfswritetest (): Hdfsclosefile failed! ret=%d, errno=%d. ", IRet, errno));return; } Writelogex (Log_info, ("Hdfswritetest (): Hdfsclosefile success! ret=%d. ", IRet));p file = NULL;return;}
3. Description and analysis of major problems encountered
The disadvantages of the Lib HDFs interface are outlined in the third part of this article (Introduction to the Lib HDFs interface).
In the actual performance testing process, the problems caused by the Lib HDFs interface mainly include: Lease lease recovery exception and program handle resource release exception and other two categories.
We have changed a variety of test models, basically confirm that the Lib HDFs interface in some exceptional cases (such as frequent append operations on the same file) will produce the above problems. Therefore, if you need to actually apply the Lib HDFs interface in the project, we need to improve the client program processing process to avoid and reduce the problem. You can use the following methods:
1) Reduce the read and write density of HDFs by increasing the cache between the client program and the HDFs file system;
2) Reduce the update operation to the HDFs file system, such as when the file is written, no longer perform the append operation, only the read operation is performed.
v. Summary
In this paper, HDFs and C-language access to HDFS operations are described in detail, for the relevant project developers to reference.
HDFs, as a distributed file system, is not omnipotent, for example, for scenarios where storage is too small or requires low access latency, or for systems that require frequent data updates. Even if the HDFs file system is applied, in order to maximize the efficiency of the HDFs file system, it is still possible to circumvent some of the drawbacks of HDFS by modifying the business layering or logical implementation.
My public number: ZHOUZXI, please scan the following two-dimensional code:
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Introduction to HDFs and operation practice of accessing HDFs interface with C language