There is a concept of an abstract file system in Hadoop that has several different subclass implementations, one of which is the HDFS represented by the Distributedfilesystem class. In the 1.x version of Hadoop, HDFS has a namenode single point of failure, and it is designed for streaming data access to large files and is not suitable for random reads and writes to a large number of small files. This article explores the use of other storage systems, such as OpenStack Swift object storage, as the underlying storage of Hadoop, increased support for OpenStack Swift for the storage layer of Hadoop, and gives test results, Ultimately achieves the goal of functional verification (functional POC). It is worth mentioning that increasing support for Hadoop to OpenStack Swift is not a substitute for HDFS, but rather a direct analysis of the data stored in Swift with the Hadoop MapReduce and its associated tools; This paper, as a phased attempt, Data locality (Locality) is not currently considered, which will be part of future work. In addition, the Hadoop 2.x provides a highly available HDFS solution that is not covered by this article.
This article is intended for software developers and administrators who are interested in Hadoop and OpenStack Swift, and assumes that readers have a basic understanding of them. The version of Hadoop used in this article is version 1.8 of the 1.7.4,swift Java Client API for 1.0.4,openstack Swift, and the Swauth version for authentication is 1.0.4.
Integration of
Hadoop with OpenStack Swift object storage
The following scenario, if you have a large amount of data stored in Swift, but want to use Hadoop to analyze the data, dig out useful information. It is possible at this point to export the data from the Swift cluster to an intermediary server and then import the data into HDFS to analyze the data by running the MapReduce job. If the amount of data is very large, the whole process of importing data will be long and more storage space will be used.
If you can consolidate Hadoop and OpenStack Swift so that Hadoop can access the Swift object store directly and run the MapReduce job to analyze the data stored in swift, it will increase efficiency and reduce hardware costs.
Hadoop Abstract File system API
Org.apache.hadoop.fs.FileSystem is an abstract base class for common file systems in Hadoop that abstracts the various operations of file systems on files and directories, such as creating, copying, moving, renaming, deleting files and directories, reading and writing files, Read and write file metadata and other basic file system operations, as well as some other common operations of the file system. In the official Hadoop API, you can see the methods and meanings in the FileSystem abstract class.
FileSystem abstract classes have several different subclass implementations, including: Local file system implementation, distributed file system implementation, memory file system implementation, FTP file system implementation, non-Apache third-party storage System implementation, and HTTP and HTTPS Protocol access to the implementation of distributed File system. The LocalFileSystem class represents a local filesystem for client checksums and is the default file system when the Hadoop is not configured. Distributed File system implementation is the Distributedfilesystem class, that is, HDFS, used to store large amounts of data, the typical application is to store larger than the total disk size of a single machine data set. Third-party storage-system implementations are open source implementations provided by other vendors other than Apache, such as the S3filesystem and Natives3filesystem classes, which are implemented using Amazon S3 as the underlying storage file system.
The FileSystem abstract class has several different subclass implementations, including: Local file system implementation, distributed file system implementation, memory file system implementation, FTP file system implementation, non-Apache third-party storage System implementation, and HTTP and HTTPS Protocol access to the implementation of distributed File system. The LocalFileSystem class represents a local filesystem for client checksums and is the default file system when the Hadoop is not configured. Distributed File system implementation is the Distributedfilesystem class, that is, HDFS, used to store large amounts of data, the typical application is to store larger than the total disk size of a single machine data set. Third-party storage-system implementations are open source implementations provided by other vendors other than Apache, such as the S3filesystem and Natives3filesystem classes, which are implemented using Amazon S3 as the underlying storage file system.
by reading Hadoop's file system-related source code and Javadoc, and by means of tools, you can analyze the meaning and usage of the various abstract methods of the FileSystem abstract class, and the inheritance and dependencies between the various types of filesystem APIs. The ORG.APACHE.HADOOP.FS package includes the Hadoop file system related interfaces and classes, such as file input stream Fsdatainputstream class and output stream Fsdataoutputstream class, file metadata Filestatus class, All input/output stream classes are combined with the Fsdatainputstream class and the Fsdataoutputstream class, and all file system subclass implementations inherit from filesystem abstract classes. The class diagram of the Hadoop filesystem API is shown in the following illustration.
The
, for example, uses the underlying storage system of Amazon S3, inherits the FileSystem abstract class, is a concrete implementation of it, and implements the input/output stream for Amazon S3 (S3filesystem). The user can specify the URI of the Amazon S3 Storage System for the Fs.default.name attribute in the profile core-site.xml of Hadoop, so that Hadoop can access Amazon S3 and run MapReduce jobs on it.
Swift's Java client API
Swift provides storage services externally via the HTTP protocol, with a restful API. Swift itself is implemented in the Python language, but it also provides client APIs for a variety of programming languages, such as Python, Java, PHP, C #, Ruby, and so on. These client APIs interact with the proxy node of the Swift cluster by initiating HTTP requests and receiving HTTP responses, and the SWIFT Client API provides a higher level of operation on containers and objects on top of the REST API, making it easier for programmers to write programs to access Swift.
Swift's Java client API, called Java-cloudfiles, is also an open source project. The Filesclient class provides a variety of operations for swift object storage, including logging in Swift, creating and deleting account, containers, objects, obtaining account, container, object metadata, and methods of reading and writing objects. Other related classes include: Filescontainer, Filesobject, Filescontainerinfo, Filesobjectmetadata, which represent the containers and objects in Swift and the corresponding metadata, respectively. such as the number of objects the container contains, the size of the object, the modified time, and so on. Java-cloudfiles with version number 1.8 can be compatible with open source version of Swift. The main methods and meanings in the Filesclient class are shown in the table below.
To sum up, the Hadoop filesystem API is able to accept the new file system implementation mechanism and the ability to interact with Swift in the Java language, which makes it possible to extend the Hadoop abstract file system two points.
Design of Swift Adapters
As noted above, there are two things you need to do to extend the abstract file system of Hadoop: to inherit and implement filesystem abstract classes, and to use Swift's Java client API for various file operations in the implementation class. Therefore, the design of the extended system should follow the object adapter mode (Adapter pattern) in the software design pattern. Object Adapter mode is used for interface adaptation, which is to convert the interface of a class into another interface that the client program expects, so that those classes that cannot work together because of incompatible interfaces can work together.
In an extended system, Swift adapters invoke Swift's Java client API to operate on the Swift object store, the Hadoop MapReduce API invokes the Hadoop filesystem API, and for MapReduce, the underlying HDFS and Swift are transparent. The API hierarchy in which the Swift adapter is located is shown in the following illustration compared to HDFS.
The detailed design of the Swift adapter is as follows: Swiftadapter is an adapter class, Filesclient is an adaptive class, the Swiftadapter class inherits the FileSystem abstract class, and the Filesclient class is a composite relationship that contains the A reference to the Filesclient class. Among them, the Filesclient class is a class in Swift's Java client API. The swift input/output stream is as follows: Swiftinputstream is an input stream for swift, Swiftbytearrayinputstream is an input stream containing a byte array cache, Swiftinputstream contains A reference to Swiftbytearrayinputstream, Swiftoutputstream is the output stream for Swift, which inherits the corresponding file system input/output stream base class or interface, the input stream has the function of seek, and the output stream has flush and other functions. The class diagram in the Swift adapter is shown in the following illustration.
The following table is a detailed relationship for classes in the SWIFT adapter:
The implementation of the
Swift Adapter
Implementation Details
requires you to log on to Swift before interacting with Swift, so you will use an account, user name, and password that was created in Swift to implement the following details.
Invoke Swift's Java client API to implement the input/output stream for swift.
in Hadoop, all input stream classes need to inherit and implement Fsinputstream abstract classes, focusing on implementing the Read method and the Seek method. The Read method reads the next byte from the input stream, which is the most basic method of the input stream class, where the Seek method sets the read location of the input stream, and if a byte array is used as a buffer, the random location to a byte is implemented. The Swiftbytearrayinputstream class inherits the Bytearrayinputstream class and the Seekable interface, which uses a byte array as a buffer. The Swiftinputstream class inherits Fsinputstream abstract classes and contains a reference to the Swiftbytearrayinputstream class that invokes Swift's Java client API to object to read into the buffer of the byte array. With this implementation, the input stream class Swiftinputstream for Swift has the basic operation of read and seek these input streams.
in Hadoop, the output stream class only needs to be a subclass of the OutputStream abstract class, focusing on implementing the Write method and the Flush method, which can choose whether to implement the sync method of the Syncable interface, sync method allows buffered data to be synchronized with the underlying storage device. The Write method writes a byte to the output stream, which is the most basic method of the output stream class. The Swiftoutputstream class inherits the subclass Bytearrayoutputstream of the OutputStream abstract class, invokes the Swift Java client API in the Flush method, and stores all the bytes in the buffer to the objects in swift. With this implementation, the output stream class Swiftoutputstream for Swift has the basic operation of write and flush these output streams.
Invoke Swift's Java client API to implement various file operations for Swiftadapter.
The actions implemented include: Opens the file and returns the input stream, creates the file and returns the output stream, deletes the path, determines whether the path exists, obtains the path's metadata, obtains the file system's URI, obtains the working directory, creates the directory and so on. The directory corresponds to the container in swift, and the file corresponds to the object in Swift. In the process of implementation, there are several issues that require special treatment.
First, because namespaces are flat in the Swift object store and do not have a directory hierarchy, special processing is required on the path, specifically allowing the file name to contain a slash (/). In a general POSIX-compliant file system, slashes cannot be part of a file name, are illegal characters, and are allowed in Swift. In this way, you can implement a virtual directory hierarchy. At this point, the root path is the name of the container, and the entire path after the root directory is the name of the object.
Secondly, because the Swift object store is not a real file system, unlike a normal file system, it does not contain readable, writable, executable permission information for users, groups of users, and other consumers, so special processing is required on the permissions. The specific approach is to store these permission information in the extended properties of the object. The Storeobject method of the Filesclient class has a java.util.Map type parameter that can take the user, the user group, and other users ' permission information as an element in the Java.util.Map object, as a key to the string representing the type of permission. The rights corresponding to the number as a value, such as user, user group and other users of the rights information for < "Acl-user", "6" >, < "Acl-group", "4" >, < "Acl-others", "4" >. By passing the Java.util.Map object containing the permission information to the Storeobject method as a parameter, the permission information can be stored in the extended attribute. The corresponding relationship of interface transformations in the
Swiftadapter class is shown in table 4, and the following table lists the corresponding relationships between the Swiftadapter class and the methods of the Filesclient class.
Compile the source code and package it into a jar file, and then deploy the jar file and its dependent class library to the $hadoop_prefix/share/hadoop/lib directory of all nodes in the Hadoop cluster.
The class library default directory for Hadoop installed using RPM files is/usr/share/hadoop/lib. This is like installing plug-ins into Hadoop without modifying the original software.
Modify the profile core-site.xml for all nodes in the Hadoop cluster so that the file system URI points to Swift's proxy node and specify an account, user name, and password in Swift.
These properties are read by the Swift adapter. Deploying multiple agent nodes in the Swift cluster can also point to these proxy nodes using a dedicated load balancer (load balancer) or rotary DNS (Robin DNS), and in Core-site.xml to make the file system URI Point to load balancer or rotate DNS. The properties of the configuration file Core-site.xml are shown in the following table.
Topological structure
There are 1 jobtracker nodes deployed in the Hadoop cluster, as well as multiple slave nodes running Tasktracker, all joining the Swift adapter JAR file and its dependent class library. Multiple proxy nodes and Storage nodes are deployed in the swift cluster, and 1 Web DNS servers are deployed, pointing to the proxy nodes in these swift clusters. The topological structure of the entire extended system is shown in the following illustration.
The
Process
in Swift adapters, for example, initializes file system instances, opens files and reads data, and creates files and writes data, describing their processes separately and using UML timeline diagrams to show them.
Hadoop's file system client command-line program corresponds to the Org.apache.hadoop.fs.FsShell class. When using this command-line program to interact with the file system, Hadoop first looks for the corresponding file system implementation class and initializes it according to the scheme specified in the configuration file. The Org.apache.hadoop.fs.FileSystem class has a static internal class Filesystem.cache that caches instance objects of the filesystem using a Java MAP type, which is the scheme name of the file system, such as "HDFs". The value is the corresponding file system object instance, such as an instance of the Distributedfilesystem class. In the implementation of this article, the scheme name of the SWIFT adapter is "Swift", and the corresponding file system class is Swift. Swiftadapter and set properties in the configuration file Fs.swift.impl is swift. Swiftadapter. The detailed process for initializing a file system instance is as follows: If scheme named "Swift" exists in the cache, Filesystem.cache returns swift directly through the Get method. Swiftadapter object instance. Otherwise, the FileSystem class calls the static method Createfilesystem, then calls the Newinstance method of the Reflectionutils class, and eventually calls the constructor method of the Newinstance class, The object instance of the Swift adapter class is obtained in a reflective manner, and the Initialize method is finally called to perform the necessary initialization operations. The UML sequence diagram that initializes the file system instance is shown in the following illustration.
The detailed process for opening the file and reading the data is as follows: When the file is opened, the client invokes the Swiftadapter class's Open method, and the Swiftadapter object initializes the instance of the Swift input stream class Swiftinputstream, and then The Swiftinputstream object invokes the GetObject method of the Filesclient object to initiate an HTTP request to the proxy server in the SWIFT cluster to get the object in swift and deposit the data Swiftbytearrayinputstream the byte array buffer inside the object, after which the client program calls the Read method of the Swiftinputstream object to read the bytes in the buffer store, and then the Close method is closed after the operation to read the data is completed Swift input stream. The UML sequence diagram that opens the file and reads the data is shown in the following illustration.
A UML sequence diagram that opens the file and reads the data
The detailed process for creating a file and writing the data is as follows: When you create a file, the client invokes the Swiftadapter class's Create method, and the Swiftadapter object initializes the instance of the Swift output stream class Swiftoutputstream first. The client then invokes the write method of the Swiftoutputstream object to write the data to its internal byte array buffer until the flush method or the Close method is invoked, and the Swiftoutputstream object is invoked filesclient The Storeobject method of the object, which initiates an HTTP request to the proxy server in the Swift cluster to write the bytes in the buffer store to the object in Swift. The UML timeline diagram that creates the file and writes the data is shown in the following illustration.
Future work
with Swift adapters, storing highly available Swift objects as the underlying storage systems of Hadoop makes Hadoop highly available at the storage level. It is quick and easy to deploy Swift adapters to existing Hadoop clusters. The MapReduce application, which was originally used to analyze the data stored in HDFS, can also be analyzed for data stored in Swift without modification.
However, after the use of the SWIFT adapter to integrate Hadoop with Swift object storage, the overall system's disadvantage is the loss of data locality (Locality) advantages. In HDFS, the Namenode node knows which DataNode node each file block is stored on. Therefore, in the process of running the MapReduce job, the binaries of the MapReduce application written by the user are sent by the MapReduce framework dispatch to the node nearest to the data as far as possible, preferably on the DataNode node where the file block resides. The Tasktracker process launches the map task, at which time the map task reads the input file from the local file system, which avoids large amounts of data being transmitted between different nodes in the Hadoop cluster, saves network bandwidth, and speeds up the MapReduce operation at the map stage.
with Swift adapters, the Swift object is stored as the underlying storage system of Hadoop, and for the Hadoop cluster, Swift is an external storage system, tasktracker and files are not on the same node, so the MapReduce job runs the In the map phase, all read file operations transmit data over the network. Swift Object storage is a black box for the Hadoop cluster, and the MapReduce framework does not know the internal details of the storage system.
The purpose of this article is to increase support for OpenStack Swift for the storage layer of Hadoop, not to replace HDFS. As a phased attempt, the problem of data locality is not considered and solved, which will be a part of future work.
Test Results
Swift adapters enable Swift object storage to be the underlying storage system for Hadoop, with two aspects: first, accessing the Swift object store using the file System command line of Hadoop. Second, run the MapReduce job to analyze the data stored in Swift.
Use HadOOP file system command line access Swift object store
ls lists files in a directory and does not read the actual modification time of the file at implementation, so defaults to 1970-1-1. As shown in the figure.
Run MapReduce job analysis data stored in Swift
First, submit a MapReduce job in the Hadoop cluster and access the MapReduce managed page of the Jobtracker node via the following URL: http://<jobtracker-ip-address>:50030/ jobtracker.jsp, click on the specific job link to view the running results page, from File Scheme (swift://) on the page to see that Hadoop is already running MapReduce jobs on the Swift object store, as shown in Figure 8.
Summary
This article analyzes the Hadoop filesystem API and the Swift Java client API, as well as the feasibility of integrating Hadoop with OpenStack Swift, introduces the design and implementation details of the SWIFT adapter, which will eventually OpenStack Swift Object storage, as the underlying storage of Hadoop, enables them to work together to increase support for OpenStack Swift for the storage tiers of Hadoop.