HDFs Simple Introduction and C language access to the HDFs interface operation practice

Last Update:2017-06-14 Source: Internet

Author: User

Tags hdfs dfs hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Overview
In recent years, big data technology in full swing, how to store huge amounts of data has become a hot and difficult problem today, and HDFs Distributed File system as a distributed storage base for Hadoop projects, but also for hbase to provide data persistence, it has a wide range of applications in big data projects.
Hadoop distributed FileSystem (Hadoop Distributed File System. HDFS) is designed to be suitable for distributed file systems running on general-purpose hardware (commodity hardware). HDFs is a core subproject of the Hadoop project, a distributed file system with high fault tolerance, high reliability, high scalability, and high throughput, which can be used for the storage of massive data in cloud computing or other big data applications (mainly for large file storage).
This article, combined with the author and colleague's understanding of HDFs learning and practice, first introduces the features of HDFs and the use of important shell commands (Hadoop and HDFS commands). Then we introduce the similarities and differences between the C Access interface Lib HDFs and the C API of the common file system provided by HDFs. Then it introduces how to use the Lib HDFs interface to realize simple hdfsclient and enumerates relevant application examples, and finally, describes and analyzes the problems encountered in writing hdfsclient.

Second, HDFs simple introduction
HDFs is the core sub-project of the Hadoop project. is a kind of distributed file system with high fault tolerance, high reliability, high scalability, high throughput and so on.

1.HDFS features
As a distributed file system, HDFs mainly has the following features:
1) primarily used to store and manage large data files (because the HDFS default data block is 128M. So it is mainly suitable for storing files of the size of hundred m and above).
2) Its data nodes can be scaled horizontally, and inexpensive commercial hardware can be selected.

3) design concept for "write once, read multiple times".

4) It is not currently supported to change file contents anywhere in the file. Only the append operation can be run at the end of the file.
5) Not suitable for low latency (dozens of ms) data access applications (low latency applications can consider hbase distributed database or es+ Distributed File System architecture).

2.HDFS often uses shell commands to briefly introduce
Hadoop has two very important shell commands: Hadoop and HDFs. For the management of the HDFs file system, the Hadoop and HDFs scripting features are very repetitive, and these two commands are described separately below.

1) Hadoop command use
Hadoop almost all of the management commands are integrated into a shell script, the Bin/hadoop script, by running a Hadoop script with the parameters. Can achieve the management of the Hadoop module.

Because this article mainly introduces the HDFs file system. So here's how to use Hadoop scripts to manipulate the HDFs file system. The relevant commands and references are shown in Figure 1.

Figure 1 Hadoop FS Command and description of the parameters
Here is a detailed example. For example, to view all the folders or files contained in the root folder in the HDFs file system, create the test folder, give 777 permissions, and finally delete the folder. The main commands for implementing the above operations are as follows:

hadoop fs –ls /hadoop fs -mkdir /testhadoop fs –chmod777 /testhadoop fs –rmdir /test

Hadoop FS commands enable the management of user groups and permissions, folders and files and management, file uploads and downloads, but there is little to do in the deep management of the HDFs file system (including bad block cleanup), node management, snapshot management, and formatting. The HDFs command is needed here.

2) The HDFs command uses
All HDFS management commands are integrated into a shell script, the Bin/hdfs script. The management of the HDFs file system can be achieved by running the HDFs script with the parameters. Includes key files, folders, permissions, and group operations, as well as data block management and formatting capabilities.
The HDFs script implements a powerful feature. The HDFs DFS command is fully consistent with the Hadoop FS command functionality. So we were able to use the HDFs DFS command and carry the parameters in Figure 1 to realize the full functionality of Hadoop FS. The following main introduction to the HDFs file system formatting and data block management operations.

Format an HDFs file system, using such as the following command:
HDFs Namenode-format
Delete the bad blocks that exist in the HDFs file system and the corresponding corrupted files, using the following command, for example:
HDFs Fsck-delete-files/

three, LIB HDFs interface Simple Introduction
Hadoop FileSystem APIs are Java CLIENT APIs. HDFs does not provide a native C-language access interface.

But HDFs provides a JNI-based C-call interface, Lib HDFs, which provides a great convenience for C-language access to HDFS.

The header and library files of the LIB HDFs interface are already included in the Hadoop release number and can be used directly. Its header file Hdfs.h is generally located in H ADOO PH OM E /INCLud eMeshRecordin,anditsLibrarytextpiecesLIbhd F s.soPassoftenthebitin {hadoop_home}/lib/native folder.

Because the Hadoop version number is constantly being updated, the Lib HDFs interface features of different version numbers are often not the same, mainly manifested in the phenomenon of increasing functionality.
Access to the HDFs file system through Lib HDFs is similar to a file system that uses the C language API to visit a common operating system, but there are some deficiencies, such as the following:
A) the functionality of the LIB HDFs interface is only a subset of the functionality of the Java CLIENT API. And there may be a lot of bugs not found or fixed compared to the Java CLIENT API, such as multi-threaded, multi-process access to the file system when the handle resource release problem.
b) In addition, because the Java CLASS is invoked using JNI, the application consumes more memory and the interface can run with a large number of exception logs. How to manage these logs is a problem. In addition, when operating an HDFs file system error, errno does not necessarily have the correct hint, but also adds difficulty in troubleshooting the problem.
c) There are fewer use cases for Lib HDFs at the moment. And the use of multi-thread and other methods through the Lib HDFs large data volume read and write HDFs file system is less.
D) At the moment, Lib HDFs does not support altering the contents of the file anywhere, only the append operation can be run at the end of the file. Or run the truncate operation on the entire file. This is mainly related to the HDFs file system design, big data storage generally lacks the support of the update function. We can only circumvent this by the business level.
E) The Lib HDFs feature that is carried by the low-to-high Hadoop release number may be incremented, so every time the Hadoop version number is updated, we need to compile our application again and add the upgrade difficulty.

For now, there are some deficiencies in the Lib HDFs interface. But I believe the future with the interface version number of the continuous updating, its function and stability will be greatly improved.

Four, C language access to HDFS application practice
1. Build and run the environment
In order to successfully compile the C language client program, we need to pre-install the Java JDK and Hadoop distributions of version 7.0 and above, which provide libraries such as libjvm.so, which provide the libraries required for the Lib HDFs connection.
In order to successfully run the C language client program, we need to set up several key environment variables in addition to pre-installing the program mentioned above. Contains settings for Ld_library_path and classpath.
About the LD_LIBRARY_PATH environment variable. The main point is to add the path to the libjvm.so and libhdfs.so libraries, while for classpath the full path information for all the jar packages provided by Hadoop (detailed by the Find+awk Combination command) is required. Otherwise, the C language client program will always report an error that is missing a class and cannot be run.

2.LIB HDFs Interface Simple Application Practice
Here is a demonstration sample of the use of some APIs.
1) Gets the capacity of the HDFs file system and the amount of space that has been used as seen by the Gethdfsinfo function:

voidGethdfsinfo (void) {HDFSFS PFS =NULL;intIRet =0; Toffset itmp =0; PFS = Hdfsconnect ("Hdfs://127.0.0.1:9000/",0);//connection to the HDFs file system    if(NULL= = PFS) {Writelogex (Log_error, ("Gethdfsinfo (): Hdfsconnect failed! errno=%d. ", errno));return; }writelog (Log_info,"Gethdfsinfo (): Hdfsconnect success!"); Itmp = hdfsgetcapacity (PFS);//Get HDFs file system capacity    if(-1= = itmp) {Writelogex (Log_error, ("Gethdfsinfo (): Hdfsgetcapacity failed! errno=%d. ", errno));        Hdfsdisconnect (PFS); PFS =NULL;return; }writelogex (Log_info, ("Gethdfsinfo (): Hdfsgetcapacity success! Offset=%ld. ", itmp)); Itmp = hdfsgetused (PFS);//Gets the total amount of file space in the HDFs file system, which is used    if(-1= = itmp) {Writelogex (Log_error, ("Gethdfsinfo (): hdfsgetused failed! errno=%d. ", errno));        Hdfsdisconnect (PFS); PFS =NULL;return; } Writelogex (Log_info, ("Gethdfsinfo (): hdfsgetused success! Offset=%ld. ", itmp)); IRet = Hdfsdisconnect (PFS);//Close the connection to the HDFs file system    if(-1= = IRet) {Writelogex (Log_error, ("Gethdfsinfo (): Hdfsdisconnect failed! errno=%d. ", errno));return; } Writelogex (Log_info, ("Gethdfsinfo (): Hdfsdisconnect success! ret=%d. ", IRet));p fs =NULL;return;}

2) Add files to the HDFs file system and write the data as seen in the Hdfswritetest function:

void Hdfswritetest (Hdfsfs PFS) {intIRet =0;    Hdfsfile pfile = NULL; Char sztestfile[ $] ="/test/write.test";if(NULL = = PFS) {Writelog (Log_error,"Hdfswritetest ():p FS is null.");return; } pfile = Hdfsopenfile (PFS, sztestfile, o_wronly | | O_creat,0,0,0);//Open File Handleif(NULL = = pfile) {Writelogex (Log_error, ("Hdfswritetest (): Hdfsopenfile failed! Szfilepath=%s, errno=%d. ", Sztestfile, errno));return; } Writelogex (Log_info, ("Hdfswritetest (): Hdfsopenfile success! Szfilepath=%s. ", Sztestfile)); IRet = Hdfswrite (PFS, Pfile,"Hello world!", strlen ("Hello world!"));//Write Dataif(-1= = IRet) {Writelogex (Log_error, ("Hdfswritetest (): Hdfswrite failed! ret=%d, errno=%d. ", IRet, errno)); Hdfsclosefile (PFS, pfile); Pfile = NULL;return; }writelogex (Log_info, ("Hdfswritetest (): Hdfswrite success! ret=%d. ", IRet)); IRet = Hdfshflush (PFS, pfile);//Writing data in buffers to diskif(-1= = IRet) {Writelogex (Log_error, ("Hdfswritetest (): Hdfsflush failed! ret=%d, errno=%d. ", IRet, errno)); Hdfsclosefile (PFS, pfile); Pfile = NULL;return;} Writelogex (Log_info, ("Hdfswritetest (): Hdfsflush success! ret=%d. ", IRet)); IRet = Hdfsclosefile (PFS, pfile);//Close file handles, freeing resourcesif(-1= = IRet) {Writelogex (Log_error, ("Hdfswritetest (): Hdfsclosefile failed! ret=%d, errno=%d. ", IRet, errno));return; } Writelogex (Log_info, ("Hdfswritetest (): Hdfsclosefile success! ret=%d. ", IRet));p file = NULL;return;}

3. Descriptive narrative and analysis of the main problems encountered
For the shortcomings of the Lib HDFs interface, in the third part of this article (Lib HDFs interface simple introduction) has been roughly described.
In the actual performance test process. The problems caused by the Lib HDFs interface mainly include: Lease lease recovery exception and program handle resource release exception.

We have changed a variety of test models, basically confirm that the Lib HDFs interface in some exceptional cases (such as frequent operation of the same file append operation) will produce the above problems.

Therefore, it is necessary for us to improve the client program processing process to avoid and reduce the above problems if we need to actually apply the Lib HDFs interface in the project. Can be used such as the following methods:
1) Reduce the read and write density of HDFs by adding cache between client program and HDFs file system;
2) Reduce the update operation to the HDFs file system, such as after the file is written, no longer run the append operation, just run the read operation.

v. Summary
In this paper, HDFs and C language access to the operation of HDFs are described in detail, for the relevant project developers to participate in the test.
HDFs, as a distributed file system, is not a panacea, for example, is not suitable for applications where storage is too small or requires low latency, or systems that require frequent data updates. Even if the HDFs file system is applied, in order to maximize the efficiency of the HDFs file system, it is still possible to circumvent some of the drawbacks of HDFS by altering the business layering or logical implementation.

My public number: ZHOUZXI, please scan the following two-dimensional code:

HDFs Simple Introduction and C language access to the HDFs interface operation practice

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More