Store the OpenStack Swift object as the underlying storage of Hadoop

Last Update:2014-12-24 Source: Internet

Author: User

Keywords Swift hadoop openstack underlying storage

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

There is a concept of an abstract file system in Hadoop that has several different subclass implementations, one of which is the HDFS represented by the Distributedfilesystem class. In the 1.x version of Hadoop, HDFS has a namenode single point of failure, and it is designed for streaming data access to large files and is not suitable for random reads and writes to a large number of small files. This article explores the use of other storage systems, such as OpenStack Swift object storage, as the underlying storage of Hadoop, increased support for OpenStack Swift for the storage layer of Hadoop, and gives test results, Ultimately achieves the goal of functional verification (functional POC). It is worth mentioning that increasing support for Hadoop to OpenStack Swift is not a substitute for HDFS, but rather a direct analysis of the data stored in Swift with the Hadoop MapReduce and its associated tools; This paper, as a phased attempt, Data locality (Locality) is not currently considered, which will be part of future work. In addition, the Hadoop 2.x provides a highly available HDFS solution that is not covered by this article.

This article is intended for software developers and administrators who are interested in Hadoop and OpenStack Swift, and assumes that readers have a basic understanding of them. The version of Hadoop used in this article is version 1.8 of the 1.7.4,swift Java Client API for 1.0.4,openstack Swift, and the Swauth version for authentication is 1.0.4.

The integration of Hadoop and OpenStack Swift object storage

Imagine the following scenarios where you have stored large amounts of data in Swift, but want to use Hadoop to analyze the data to dig out useful information. It is possible at this point to export the data from the Swift cluster to an intermediary server and then import the data into HDFS to analyze the data by running the MapReduce job. If the amount of data is very large, the whole process of importing data will be long and more storage space will be used.

If you can consolidate Hadoop and OpenStack Swift so that Hadoop can access the Swift object store directly and run the MapReduce job to analyze the data stored in swift, it will increase efficiency and reduce hardware costs.

Hadoop Abstract File System API

Org.apache.hadoop.fs.FileSystem is the abstract base class for a common file system in Hadoop that abstracts the various operations of file systems on files and directories, such as creating, copying, moving, renaming, deleting files and directories, reading and writing files, reading and written files. According to the basic file system operation, as well as some other common operations of the file system. The main methods and meanings in the FileSystem abstract class are shown in table 1.

Table 1. The main methods and meanings of filesystem abstract classes

method signature meaning void Initialize (URI, revisit) Initializes a file system based on a configuration file Fsdatainputstream open (path, int) opens the file for the specified path and gets the input stream Fsdataoutputstream Create (Path, Fspermission, Boolean, int, short, long, progressable) creates a file of the specified path and obtains a Boolean rename for the output stream ( Path, path) renames a file or directory Boolean delete (path, Boolean) deletes a file or directory Boolean mkdirs (path, fspermission) creates a directory Filestatus Getfilestatus (path) obtains the file or directory metadata filestatus[] Liststatus (PATH) obtains the metadata Uri GetURI () of all files or directories in a directory to obtain the URI Path of the file system Getworkingdirectory () Get current working directory void Setworkingdirectory (Path) setting current working directory

The FileSystem abstract class has several different subclass implementations, including: Local file system implementation, distributed file system implementation, memory file system implementation, FTP file system implementation, non-Apache third-party storage System implementation, and HTTP and HTTPS Protocol access to the implementation of distributed File system. The LocalFileSystem class represents a local filesystem for client checksums and is the default file system when the Hadoop is not configured. Distributed File system implementation is the Distributedfilesystem class, that is, HDFS, used to store large amounts of data, the typical application is to store larger than the total disk size of a single machine data set. Third-party storage-system implementations are open source implementations provided by other vendors other than Apache, such as the S3filesystem and Natives3filesystem classes, which are implemented using Amazon S3 as the underlying storage file system.

By reading Hadoop's file system-related source code and Javadoc, and by means of tools, you can analyze the meaning and usage of the various abstract methods of the FileSystem abstract class, and the inheritance and dependencies between the various types of filesystem APIs. The ORG.APACHE.HADOOP.FS package includes the Hadoop file system related interfaces and classes, such as file input stream Fsdatainputstream class and output stream Fsdataoutputstream class, file metadata Filestatus class, All input/output stream classes are combined with the Fsdatainputstream class and the Fsdataoutputstream class, and all file system subclass implementations inherit from filesystem abstract classes. The class diagram of the Hadoop filesystem API is shown in Figure 1.

Figure 1. Hadoop filesystem API Class diagram

In the case of S3filesystem, the underlying storage system it uses is the Amazon S3, inherits the FileSystem abstraction class, is a concrete implementation of it, and implements the input/output stream for Amazon S3. The user can specify the URI of the Amazon S3 Storage System for the Fs.default.name attribute in the profile core-site.xml of Hadoop, so that Hadoop can access Amazon S3 and run MapReduce jobs on it.

Swift's Java client API

Swift provides storage services externally via the HTTP protocol, with a restful API. Swift itself is implemented in the Python language, but it also provides client APIs for a variety of programming languages, such as Python, Java, PHP, C #, Ruby, and so on. These client APIs interact with the proxy node of the Swift cluster by initiating HTTP requests and receiving HTTP responses, and the SWIFT Client API provides a higher level of operation on containers and objects on top of the REST API, making it easier for programmers to write programs to access Swift.

Swift's Java client API, called Java-cloudfiles, is also an open source project. The Filesclient class provides a variety of operations for swift object storage, including logging in Swift, creating and deleting account, containers, objects, obtaining account, container, object metadata, and methods of reading and writing objects. Other related classes include: Filescontainer, Filesobject, Filescontainerinfo, Filesobjectmetadata, which represent the containers and objects in Swift and the corresponding metadata, respectively. such as the number of objects the container contains, the size of the object, the modified time, and so on. Java-cloudfiles with version number 1.8 can be compatible with open source version of Swift. The main methods and meanings in the Filesclient class are shown in Table 2.

Table 2. The main methods and meanings in the Filesclient class

Method signature meaning Filesclient (string, String, String, string, int) construction method, parameters include proxy node URL, account, username, Password,timeout Boolean Login () login Swift void CreateContainer (String) Create Container Boolean Deletecontainer (String) Delete Container Boolean containerexists (String) Determine if the container exists a Boolean storeobject (string, byte], string, String, map<string,string>) stores the values in the byte array into the object, storing the metadata in the extended attribute BYTE] GetObject (String, String) Gets the object content from Swift and stores it in a byte array list<filescontainer> listcontainers () lists all the containers contained in an account <FilesObject> ListObjects (String) lists all objects contained in a container filescontainerinfo getcontainerinfo (string) to obtain the metadata of the container Filesobjectmetadata Getobjectmetadata (String, string) gets the metadata of the object

To sum up, the Hadoop filesystem API is able to accept the new file system implementation mechanism and the ability to interact with Swift in the Java language, which makes it possible to extend the Hadoop abstract file system two points.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More