Linux new Technology Object storage file System _unix Linux

Source: Internet
Author: User
Tags file permissions lawrence livermore national laboratory
With the evolution of high-performance computing from traditional host to networked cluster, the traditional host-based storage architecture has gradually developed to networked storage, and the trend of computing and storage separation is becoming more and more obvious. A new file system for Linux clusters has been launched internationally for the lack of SAN and NAS --Object storage File system, this paper focuses on the storage object File system architecture, technical features, and the Lustre object storage file system for the initial test, the results show that the object storage file system in terms of scalability, performance, ease of use and so on have been significantly improved, With the maturity of networked storage technology, object storage file system will become an important development direction.

  First, the introduction

High-performance computing has evolved from traditional host mode to cluster mode, such as TOP500, only 2 systems are clustered in 1998 years, and by 2003 there are 208 clusters. With the development of high performance computing architecture, the traditional host-based storage architecture has become a new bottleneck and cannot meet the needs of cluster system. Cluster storage systems must address two key issues effectively: (1) Provide shared access data to facilitate cluster application authoring and storage load balancing, and (2) deliver high-performance storage that can meet the needs of hundreds of thousands of Linux clustered server aggregation accesses at the I/O level and data throughput rates. At present, networked storage has become an effective technical approach to solve the high performance storage of cluster systems.

There are two main types of networked storage architectures in the world, which are differentiated by command sets. The first category is the San (Storage area network) structure, which employs a set of SCSI block I/O commands, providing high performance random I/O and data throughput through data access at the disk or FC (Fiber Channel) level, with a bandwidth, low latency advantage, A niche in high-performance computing, such as SGI's CXFS file system, is based on SAN for high-performance file storage, but because of the high price of San systems and poor scalability, thousands of CPU-scale systems are not met. The second category is the NAS (Network attached Storage) architecture, which uses NFS or CIFS command sets to access data, file as a transport protocol, networked storage via TCP/IP, scalable, inexpensive, user-manageable, If the NFS file system is used in cluster computing, the high cost, low bandwidth and large latency of NAS are not conducive to the application in high performance cluster.

In response to the Linux cluster's high performance and data-sharing requirements for storage systems, new storage architectures and new file systems have been studied abroad in the hope of effectively combining the benefits of SAN and NAS systems, enabling direct access to disk to improve performance, and simplifying management through shared files and metadata. At present, object storage file system has become a hot research hotspot in Linux cluster system, such as lustre of cluster file systems Company, Activescale file system of Panasas company and so on. The lustre file system is based on object-based storage technology, which comes from the CODA Project research work of Carnegie Mellon University, released in December 2003 in Lustre 1.0, and is expected to release 2.0 editions in 2005. Lustre at the United States Department of Energy (U.s.department of Energy:doe), Lawrence Livermore National Laboratory, Los Alamos National Laboratory, Sandia National Laboratory, Pacific Northwest National Laboratory of High-performance Computing system has been a preliminary application, IBM is developing the Blue gene systems will also use lustre file system to achieve its high-performance storage. The Activescale file system technology comes from Dr. Carnegie Mellon University. Garth Gibson, the first NASD (network attached Secure disks) project supported by DARPA, is now the industry's more influential object storage file system, and won the Computerworld 2004 Innovation Technology Award.

second, the object storage file system

2.1 Object storage File System architecture

The core of the object storage file system is to separate the data path (data read or write) and control path (metadata), and to build the storage system based on the object storage device (object-based Storage device,osd), each object storage device has certain intelligence, can automatically manage the data distribution on it, the object storage file system usually consists of the following parts.

1. Object

An object is the basic unit of data storage in a system, and an object is actually a combination of file data and a set of attributes that define file-based RAID parameters, data distribution, and quality of service, while traditional storage systems use files or blocks as basic storage units, In a block storage system, you also need to always track the properties of each block in the system, and the object maintains its own properties by communicating with the storage system. In a storage device, all objects have an object identity that accesses the object by identifying the OSD command with the object. There are usually several types of objects, and the root object on the storage device identifies the storage device and the various properties of the device, which are collections of objects that share resource management policies on the storage device.

2. Object storage Device

The object storage device is intelligent, it has its own CPU, memory, network and disk system, and currently the blade structure is usually used to implement object storage devices. The OSD provides three main features:

(1) Data storage. The OSD manages object data and places them on standard disk systems, the OSD does not provide block interface access, and the client requests data with object ID, offset for data read and write.

(2) Intelligent distribution. The OSD optimizes data distribution with its own CPU and memory, and supports prefetching of data. Because the OSD can intelligently support object prefetching, you can optimize disk performance.

(3) Management of metadata for each object. The OSD manages the metadata that is stored on its object, similar to the traditional inode metadata, typically including the object's data block and the length of the object. In traditional NAS systems, these metadata are maintained by the file server, and the object storage architecture completes the main metadata management in the system by the OSD, reducing the client overhead.

3, Meta Data server (Metadata Server,mds)

MDS controls the interaction of client and OSD objects, providing several features:

(1) Object storage access. The MDS constructs, manages the view that describes the distribution of each file, and allows the client to access the object directly. The MDS provides clients with the ability to access the objects contained in the file, and the OSD verifies that capability before it receives each request before it can be accessed.

(2) File and directory Access management. The MDS builds a file structure on the storage system, including quota control, directory and file creation and deletion, access control, and so on.

(3) Client cache consistency. In order to improve client performance, client side cache is usually supported in object storage file system design. Due to the introduction of the client cache, bring cache consistency problem, MDS support based on client file cache, when the cache file changes, will notify the client flush cache, so as to prevent cache inconsistencies caused by the problem.

4, the object storage file system client

In order to effectively support client support for accessing objects on the OSD, it is necessary to implement the client of the object storage file system at the compute node, typically providing a POSIX file system interface that allows the application to perform the same as standard file system operations.

Key technologies of 2.2 object storage file system

1, distributed metadata traditional storage structure metadata server usually provides two main functions. (1) Provide a logical view (Virtual file SYSTEM,VFS layer), File name list and directory structure for computing nodes. (2) Organizing the data distribution (inode layer) of the physical storage medium. The object storage structure separates the logical view of the stored data from the physical view and distributes the load to avoid bottlenecks (such as NAS systems) that are caused by the metadata server. The VFS portion of the metadata is typically 10% of the metadata server's load, and the remaining 90% work (inode portion) is done on the physical distribution of the data in the storage media block. In object storage structure, inode work distributes to each intelligent OSD, each OSD is responsible for managing data distribution and retrieval, so that 90% of metadata management work is distributed to intelligent storage devices, which improves the performance of system metadata management. In addition, distributed metadata management, when adding more OSD to the system, can simultaneously increase the metadata performance and system storage capacity.

2. Concurrent data Access Object Storage architecture defines a new, more intelligent, disk interface OSD. The OSD is a network-attached device that itself contains storage media, such as disk or tape, and has sufficient intelligence to manage locally stored data. The computing node communicates directly with the OSD and accesses the data it stores, because the OSD is intelligent and therefore does not require file server intervention. If you distribute the file system data across multiple OSD, the aggregation I/O rate and data throughput will grow linearly, and for most Linux cluster applications, continuous I/O aggregation bandwidth and throughput are important to a larger number of compute nodes. The object storage structure provides performance that is currently difficult to achieve for other storage structures, such as the Activescale object storage file system's bandwidth can reach 10gb/s.

2.3 Lustre Object Storage File system

The Lustre object storage file system consists of three main parts of the client, the storage server (Ost,object Storage Target), and the metadata server (MDS). The lustre client runs the lustre file system, which interacts with the OST for file data I/O, and the MDs for namespace operations. To improve the performance of the lustre file system, the client, OST, and MDs are usually separated, and these subsystems can also be run in the same system. The three main sections are shown in Figure 1.


Figure 1 The composition of the lustre file system

Lustre is a transparent global file system that allows clients to transparently access data in the clustered file system without having to know where the data is actually stored. The client reads the data on the server over the network, the storage server is responsible for the actual file system's read and write operations and the storage device connection, the metadata server is responsible for the file system directory structure, file permissions and file extension attributes, as well as maintaining the data consistency of the entire filesystem and responding to client requests. Lustre the file as an object anchored by the metadata server, which directs the actual file I/O requests to the storage server, which manages the physical storage on the object-based disk group. Because of the separation of metadata and storage data, computing and storage resources can be fully separated so that client computers can concentrate on the requests of users and applications, and the storage servers and metadata servers focus on reading, transmitting and writing data. Operations such as storage server-side data backup and storage configuration, and Storage server Extensions do not affect the client, and neither the storage server nor the metadata server become a performance bottleneck.

The Lustre global namespace provides a valid, globally unique directory tree for all the clients of the filesystem, and the data is compartmentalized and distributed to individual storage servers, providing a more flexible way of sharing access than the "block sharing" of traditional sans. The Global directory tree eliminates configuration information at the client and remains in effect when the configuration information is updated.

Iii. Testing and conclusion

1, Lustre IOzone test

For the object storage file system, we have a preliminary test of the Lustre file system, the specific configuration is as follows:

3 Dual Xeon Systems: Cpu:1.7ghz, Memory: 1GB, Gigabit Ethernet
Lustre File System: lustre-1.0.2
Linux version: RedHat 8
Test program: IOzone
The test results are as follows:

Block Write (mb/s/thread) single thread two threads
Lustre 1 ost 2 OST 1 OST 2 OST
21.7 50 12.8 24.8
NFS 12 5.8

The above tests show that the write bandwidth of single OST is better than that of NFS, 2 OST extensibility, showing the effect of the strip, the aggregate bandwidth of two threads is basically equal to the saturation bandwidth, but lustre client side CPU utilization is very high (more than 90%), test system size (three nodes) Limited, So there is no upward expansion of the OST and the client number. In addition, lustre cache has better performance than NFS for file writing. The Lustre data processing ability was preliminarily tested by bonnie++, and the file creation speed was relatively faster and the Readdir speed was slower than that of NFS.

2, Lustre small scale test data (file write test, Unit kb/s):

Hardware: Dual xeon1.7,gige, SCSI Ultra160 software: Redhat8,iozone

Figure 2 2 OST/1 MDS


Figure 3 1 OST/1 MDS


Figure 4 NFS Test


From a preliminary test, the performance and scalability of the lustre are good. The object storage file system has the following advantages over traditional file systems:

(1) performance. The object storage architecture has no metadata manager bottlenecks in other shared storage systems. NAS systems use a centralized file server as the metadata manager, some San file systems adopt A centralized lock manager, and finally metadata management becomes a bottleneck. Object storage architectures are similar to Sans, and each node can access its storage devices directly. Object storage architecture improvements to SANS are not a bottleneck for RAID controllers, and when computing nodes are scaled up, the advantage will be obvious, and the total throughput of all nodes will ultimately be limited by the size of the storage system and the performance of the network. Storing object node sending data to OSD,OSD automatically optimizes the distribution of data, which reduces the burden of computing nodes and allows parallel reading and writing to multiple OSD to maximize the throughput of a single client.

(2) scalability. Distributing the load to multiple intelligent OSD and combining them organically with network and software eliminates the scalability problem. An object storage system has memory, processors, disk systems, and so on, allowing them to increase their storage processing power regardless of other parts of the system. If the object storage system does not have sufficient storage processing power, the OSD can be added to ensure linear increase in performance.

(3) The OSD shares the main metadata service task. Metadata management capabilities are often the bottleneck of a shared storage system, and all compute nodes and storage nodes need access to it. In the object storage structure, the metadata service consists of two parts: Inode metadata, managing the storage block distribution on the media, file metadata, managing the file hierarchy and directory of the filesystem. The object storage structure increases the scalability of metadata access, the OSD is responsible for its own inode metadata, adding an OSD can increase disk capacity, and can increase the metadata management resources. While traditional NAS servers add more disks, performance will be slower. The object storage System ensures a consistent throughput rate as capacity expands.

(4) easy to manage. The intelligent distributed object storage structure can simplify the task of storage management and simplify the distribution of data optimization. For example, new storage capacity can be automatically merged into the storage system because the OSD can accept object requests from the computed node. System administrators do not need to create LUNs, do not need to resize partitions, do not need to rebalance logical volumes, do not need to update a file server, and so on. RAID blocks can be automatically extended to new objects, taking full advantage of the new OSD.

(5) Safety. Traditional storage systems often rely on client identity authentication and private networks to ensure system security. The object storage structure provides security features at each level, including the identity authentication of the storage device, the authentication of the node, the authentication of the node command, the integrity of all commands, the private data and commands based on IPSec, etc. These security levels ensure that users use more efficient and accessible networks, such as Ethernet. At present, Panasas has launched a commercial object storage global file system Activescale, object storage is being valued, lustre has also been in (ALC, MCR) or will (Redstorm) in a number of large-scale cluster applications, Therefore, the object storage file system will become an important development direction of future cluster storage.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.