Disclaimer:
This article is transferred from an online article. This article is only for personal favorites and shared knowledge. If there is any infringement, please contact the blogger to delete it.
Original article Cui binghua
Http://blog.csdn.net/ I _chips/article/details/17787017 1 Overview openstack Object Storage (SWIFT) is one of the sub-projects of openstack open source cloud computing. The purpose of SWIFT is to use common hardware to build redundant and Scalable Distributed Object Storage clusters with a storage capacity of up to petabytes.
SWIFT is not a file system or a real-time data storage system. It is an object storage system used for long-term storage of permanent static data. The data can be retrieved, adjusted, and updated if necessary. Examples of data types that are most suitable for storage are Virtual Machine images, image storage, mail storage, and archive backup.
Swift does not require RAID (redundant disk arrays) or central units or master nodes. Swift introduces consistent Hash technology and data redundancy at the software level, and sacrifices a certain degree of data consistency to achieve high availability (HA) and scalability, supports multi-tenant mode, container and object read/write operations, and is suitable for solving unstructured data storage problems in Internet application scenarios.
2. Technical Features2.1
Key features of swift
The main features of SWIFT are as follows:
- Extremely high data persistence (durability ).
- A fully symmetrical system architecture: "Symmetry" means that nodes in Swift can be fully equivalent, greatly reducing system maintenance costs.
- Unlimited Scalability: data storage capacity is infinitely scalable; swift performance (such as QPS and throughput) can be linearly improved.
- No spof: Swift's metadata storage is completely evenly distributed and randomly distributed, and, like object file storage, multiple copies of metadata are also stored. No role in the entire swift cluster is single-point, and the architecture and design ensure that no single-point service is effective.
- Simple and reliable.
2.2 technical differences between SWIFT and HDFS
Swift and hadoop Distributed File System (HDFS) share the same purpose: to achieve redundant, fast, and networked storage. Their technical differences are as follows:
- In swift, metadata is distributed and replicated across clusters. In HDFS, the central system is used to maintain file metadata (namenode, Name node), which is no different from a single fault point for HDFS. Therefore, it is more difficult to expand to a large-scale environment.
- Swift has taken into account the multi-tenant architecture while HDFS does not.
- In swift, files can be written multiple times. In a concurrent operation environment, the last operation prevails. In HDFS, only one file can be written at a time.
- SWIFT is written in Python, while HDFS is written in Java.
- Swift has been designed as a general storage solution to reliably store a large number of files of different sizes; HDFS is designed to store a medium number of large files (HDFS is optimized for larger files) to support data processing.
3 key technologies 3.1 consistenthashing)
A key issue in Distributed Object Storage Service is how to store data. SWIFT is based on consistent Hash technology. Through computing, objects can be evenly distributed to virtual nodes in the virtual space, greatly reducing the amount of data to be moved when adding or deleting nodes; the virtual space size usually uses the power of 2 N to facilitate efficient shift operations, and then uses the unique data structure ring (ring) then, the virtual node is mapped to the actual physical storage device to complete the addressing process.
Figure 1 consistent hash Ring Structure
The following four indicators are used to measure consistent hash:
- Balance: the hash result can be evenly distributed as much as possible to make full use of all the cache space.
- Monotonicity: monotonicity means that if some content has been hashed to the corresponding buffer, and a new buffer has been added to the system. The hash result should ensure that the original allocated content can be mapped to the new buffer instead of other buffers in the old buffer set.
- Spread: in a distributed environment, different terminals map content to the cache through the hash process. The hash results are inconsistent because the visible cache is different, the same content is mapped to different buffers.
- Load: load is another latitude that requires dispersibility. Since different terminals can map the same content to different buffer zones, different users may map different content to a specific buffer zone.
Swift uses this algorithm to minimize the mappings between existing keys and nodes when changing the number of nodes in the cluster (adding or deleting servers) to satisfy the monotonicity.
When the number of nodes in a hash algorithm is small, changing the number of nodes will result in huge data migration. To solve this problem, consistent hash introduces the concept of "virtual node" (vnode, also known as partition): "virtual node" is a replica of the actual node in the ring space, an actual node corresponds to several "virtual nodes". The "virtual nodes" are arranged by hash values in the hash space.
In general, there are two mappings in swift. For a file, the corresponding virtual node (one-to-one ing) is found through the hash algorithm (MD5 ), the virtual node then finds the corresponding device (many-to-many ing) through the ing relationship (two-dimensional array in the ring file), thus completing the ing of a file stored on the device.
Figure 2 ing between objects, virtual nodes, and nodes
When setting virtual knot points, you need to fully consider the expected scale of the system. If the cluster size does not exceed 6000 nodes, you can set the number of virtual knots to 100 times the number of knots. In this way, the load of any node changes only affects 1% of data items. At this time, there are 6 million vnodes, and 2 bytes are used to store the number of nodes (0 ~ 65535 ). The basic memory usage is 6*(10 ^ 6) * 2 bytes = 12 Mb, which is sufficient for the server.
Assume that there are 65536 (2 ^ 16) nodes and 128 (2 ^ 7) times the number of partition (2 ^ 23, partition_power = 23 ). Because the MD5 code is 32-bit, use partition_shift (equal to 32-partition_power) to map the MD5 hash value of the data item to the space of 2 ^ 23 in partition.
3.2 Data Consistency Model)
According to Eric Brewer's cap (consistency, availability, partitiontolerance) theory, swift gave up strict consistency (meeting the ACID transaction level ), the eventual consistency model is used to achieve high availability and unlimited horizontal scalability.
To achieve this goal, swift uses the quorum arbitration agreement (quorum has a quorum ):
- Definition: N: Total number of data copies; W: Number of copies confirmed for write operations; R: Number of copies of read operations.
- Strong Consistency: R + W> N, to ensure that the read and write operations on the copy will have an intersection, so as to ensure that the latest version can be read; if W = N, r = 1, all updates are required, which is suitable for strong consistency in scenarios with a large number of read and write operations. If R = N, W = 1, only one copy is updated, the latest version can be obtained by reading all the copies, which is suitable for strong consistency in scenarios with a large number of writes and a small number of reads.
- Weak Consistency: R + W <= n. If the replica set of the read/write operation does not have an intersection, dirty data may be read. This is suitable for scenarios with low consistency requirements.
SWIFT is applicable to scenarios with frequent read/write operations. Therefore, it adopts a compromise policy, that is, the write operation must satisfy at least half of the successful write operations W> n/2, the replica set of the read and write operations must have at least one intersection, namely, R + W> N.
In a distributed system, a single point of data cannot exist. If the number of replica that normally exist online is 1, it is very dangerous, because once this replica is wrong again, it may cause a permanent data error. If we set N to 2, as long as one storage node is damaged, there will be a single point of presence. Therefore, N must be greater than 2. But the higher N, the higher the system maintenance and overall cost. Therefore, the industry usually sets n to 3.
The default configuration of SWIFT is n = 3, W = 2> n/2, R = 1 or 2, that is, each object will have three copies, these replicas are stored on nodes in different regions as much as possible. W = 2 indicates that at least two replicas need to be updated before writing is successful.
When r = 1, it means that a successful read operation will return immediately. In this case, it may be read to the old version (weak consistency model ).
When r = 2, you must add the X-Newest = true parameter to the request header of the read operation to read the metadata information of the two copies at the same time, then compare the timestamps to determine the latest version (strong consistency model ).
If data inconsistency occurs, the backend service process completes Data Synchronization Through detection and replication Protocols within a certain time window to ensure final consistency.
Figure 3 quorum Protocol example
3.3 Ring)
Ring is the most important component in swift. It is used to record the ing between storage objects and physical locations. When querying account, iner, and object information, you need to query the ring information of the cluster.
A ring is designed to evenly map virtual nodes (partitions) to a group of physical storage devices and provide redundancy. Its data structure consists of the following information:
The list of storage devices and device information includes the unique ID, region, weight, IP address, and port) device and metadata ).
Swift defines the ring for the account, container, and object respectively, and the search process is the same. In the ring, each partition has three replica by default in the cluster. The location of each partition is maintained by the ring and stored in the ing.
Ring uses zone to ensure physical isolation of data. Replica of each partition must be placed in different zones. Zone is just an abstract concept. It can be a disk drive, a server, a Rack (cabinet), or a switch ), it is even a data center to provide the highest level of redundancy. We recommend that you deploy at least five zones.
The weight parameter is a relative value, which can be adjusted based on the disk size. A larger weight indicates that more partitions can be deployed as the more space can be allocated.
When the storage node in the cluster is down, the new (delete) Storage node, the new (delete) zone, and so on must change the ing relationship between the partition and node, you can also update the ring file by rebalance. When a virtual node needs to be moved, the ring ensures that the minimum number of virtual nodes is moved at a time, and only one copy of one virtual node is moved at a time.
In general, the reason why ring introduces consistent hash is to reduce the number of data items moving due to the increase of nodes to improve monotonicity; partition is introduced to reduce the number of data items that move too much due to a small number of nodes. replica is introduced to prevent single points of data and improve redundancy. The reason for the introduction of zone is to ensure partition adequacy; weight is introduced to ensure the balance of partition allocation.
4 architecture design 4.1 SWIFT data model
Swift uses a hierarchical data model with a three-tier Logical Structure: account/container/object (account/container/object ). The number of nodes on each layer is unlimited and can be expanded as needed. Here, the account and personal account are not a concept. They can be understood as tenants. They are used as top-level isolation mechanisms and can be used together by multiple personal accounts. containers are similar to folders, indicates encapsulating a group of objects. An object consists of metadata and data.
4.2 SWIFT System Architecture
Swift adopts a fully symmetrical and Resource-Oriented Distributed System Architecture Design. All components can be expanded to avoid the spread of spof and affect the operation of the entire system; the communication mode adopts non-blocking I/O mode, which improves the system throughput and response capability.
Swift components include:
- Proxyserver: swift provides HTTP-based rest service interfaces through proxy server, the service address is searched based on the ring information and user requests are forwarded to the corresponding account, container, or object for CRUD (add, delete, modify, and query) operations. The stateless rest request protocol can be used for horizontal scaling to balance loads. Before accessing the swift service, you must first obtain the access token through the authentication service, and then add the header information X-auth-token to the sent request. The proxy server is responsible for communication between other components of the SWIFT architecture. The proxy server also processes a large number of failed requests. For example, if a storage node is unavailable for a put request of an object, it queries the server that can be sent by the ring and forwards the request. Objects arrive at (from) the object server in the form of a stream. They are directly transferred from the proxy server to (from) the user-proxy server without buffering them.
- Authenticationserver: authenticates the identity information of the access user and obtains an object access token, which will remain valid for a certain period of time; verify the validity of the access token and cache it until it expires.
- Cache service (cacheserver): the cached content includes the object service token, account and container information, but does not cache the data of the object itself. The cache service can use the memcached cluster, swift uses a consistent hash algorithm to allocate cache addresses.
- Accountserver: Provides account metadata and statistical information, and maintains the services of the contained container list. The information of each account is stored in an SQLite database.
- Container Service (containerserver): provides container metadata and statistical information (such as the total number of objects and container usage), and maintains the services of the contained Object List. Container service does not know where an object exists. It only knows which objects are stored in the specified container. These object information is stored in the form of SQLite database files and backed up similar to the object in the cluster.
- Object Service (objectserver): provides object metadata and content services to store, retrieve, and delete objects on local devices. In a file system, objects are stored as binary files, and their metadata is stored in the file system's extended attributes (xattr). We recommend that you use the default extended attributes (xattr). Each object is stored in a path consisting of the hash value of the Object Name and the timestamp of the operation. The last write operation always succeeds, and the latest object version will be processed. Delete a zero-byte file ending with ". Ts". Ts indicates the tombstone ).
- Replication Service (Replicator): It checks whether the local partition copy is consistent with the remote copy, specifically by comparing the hash file and the advanced watermark. If the difference is found, push is used) update remote copy: for object replication, the update only uses rsync to synchronize files to the peer node. Account and container replication pushes lost records on the entire database file through HTTP or rsync. Another task is to ensure that the marked deleted objects are removed from the file system: when an item (object, container, or account) is deleted, a tombstone file is set as the latest version of the item. The replicaset detects the tombstone file and ensures that it is removed from the entire system.
- Update Service (Updater): when an object cannot be updated immediately due to high load or system failure, the task will be serialized to the local file system for queuing, so that the service can be asynchronously updated after recovery. For example, if the container server does not update the Object List in time after an object is successfully created, the update operation of the container enters the queue, the Update Service scans the queue and updates the queue after the system returns to normal.
- Audit Service (auditor): it repeatedly crawls on the local server to check the integrity of objects, containers, and accounts. If a bit-level error is found, files are isolated, duplicate other copies to overwrite corrupted local copies. Other types of errors (such as the list of objects not found in any container server) are recorded in the log.
- Account Cleanup Service (accountreaper): removes an account marked as deleted and deletes all its containers and objects. The process of deleting an account is quite direct. For containers in each account, each object is deleted first and then the container is deleted. Any failed deletion request will not stop the entire process, but will cause the final failure of the entire process (for example, if the deletion of an object times out, the container will not be deleted, therefore, the account cannot be deleted ). The entire process continues even if it fails, so that it will not stop restoring the cluster space because of a troublesome problem. The account harvester will continue to try to delete the account until it becomes empty. At this time, the database recycles the account in db_replicator and finally removes the database file.
Figure 4 SWIFT System Architecture
The following table lists all operations supported by SWIFT:
Table 1 Summary of swiftrestful APIs
4.3 ring Data Structure
The ring data structure consists of three top-level domains, where:
- List of devices indicates the list of devices in the cluster, which is called Devs inside the ring class;
- Partition assignment list, used to store the ing between each replica and device. It is called _ replica2part2dev_id in the ring class;
- Partition shift value, which indicates the shift volume of hash data, which is called _ part_shift in the ring class.
Use python to read data stored in/etc/SWIFT/object.ring.gz. You can obtain dict data that uses devs, part_shift, and replica2part2dev_id as the key.
4.4 swift Storage Structure
Linux runs on storage node and XFS file system is used. Logically, consistent hash algorithms are used to map a fixed number of partitions to each storage node, each data is mapped to a partition using the same hash algorithm.
The stored content is generally stored in a path such as/srv/node/sdb1. The directory structure is as follows: Accounts, async_pending, containers, objects, quarantined, and TMP. Accounts, containers, and objects are the storage directories of accounts, containers, and objects respectively. async_pending is the directory to be updated asynchronously, quarantined is the isolation directory, and TMP is the temporary directory.
- Objects: Each partition directory is stored in the objects directory. Each partition directory is a directory named by several suffix_path and a hashes. a pkl file. The suffix_path directory is a directory consisting of the hash_path name of the object. The hash_path directory stores data and metadata about the object. The object data is stored in the suffix. in the data file, its metadata is stored with the suffix. in the meta file, the deleted object is suffixed with a 0-byte. store ts files.
- Accounts: Each partition is stored in the accounts directory, and each partition directory is composed of several suffix_path directories. The suffix_path directory is composed of the Account's HSH name, the account SQLite dB is stored in the HSH directory. The account dB file contains four tables: account_stat, container, incoming_sync, and outgoing_sync, the account_stat table records account information, such as the name, Creation Time, and container count statistics. The container table records information about the container. The incoming_sync table records the incoming synchronization data items; the outgoing_sync table indicates the synchronously pushed data items.
- Containers: the directory structure and generation process of containers are similar to those of accounts. There are five tables in the database of containers. The schema of incoming_sync and outgoing_sync is the same as that in accounts. The other three tables are container_stat, object, and sqlite_sequence. The container_stat table is similar to account_stat. The difference is that container_stat stores information about container.
- TMP: tmp directory is the temporary directory before the account/container/Object Server writes data to the partition directory. For example, when the client uploads a file to the server, the object server calls the mkstemp method of the diskfile class to create a directory in the Path/device/tmp. After the data is uploaded, call the put () method to move the data to the corresponding path.
- Async_pending: async_pending stores data that is not updated in time and is added to the update queue. When the local server establishes an HTTP connection with the remote server or times out when sending data, resulting in an update failure, the file will be placed in the async_pending directory. This often happens when the system fails or the load is high. If the update fails, this update is added to the queue, and then the Updater continues to process the failed updates. The pending File Processing Methods for the Account and the database and object of the container are different: after updating one of the data in the DB pending file, delete the corresponding data items in the pending file. After the object data is updated, move the pending file to the target directory.
- Quarantined: the quarantined path is used to isolate corrupted data. The auditor process scans the disk on the local server at intervals to check the integrity of accounts, iner, and objects. Once incomplete data is found, the file will be isolated. This directory is called the quarantined directory. To limit the consumption of excessive system resources by auditor, the default scan interval is 30 seconds. The maximum number of scanned files per second is 20, and the maximum scan rate is 10 Mb/s. The scanning interval between account and container auditor is much longer than that of object.
Figure 5 Process of isolating objects
5 Summary
Swift sacrifices a certain degree of data consistency to achieve high availability and scalability. It supports multi-tenant mode, container and object read/write operations, and is suitable for solving unstructured data storage problems in Internet application scenarios.
There is reason to believe that SWIFT may become an open standard for cloud storage because of its completely open, extensive user base and Community contributor, thus breaking the monopoly of Amazon S3 in the market, to promote cloud computing in a more open and interoperable direction.
6 References
1) The principle, architecture and API of openstack swift, http://www.kankanews.com/ICkengine/archives/66411.shtml
2) In-depth cloud storage system swift core components: Ring implementation principle analysis, http://www.cnblogs.com/yuxc/archive/2012/06/22/2558312.html
3) In-depth cloud storage system swift core components: ring data structure and construction, rebalancing operation, http://www.cnblogs.com/yuxc/archive/2012/06/28/2568584.html
4) deep cloud storage system swift storage node: storage implementation analysis, http://www.cnblogs.com/yuxc/archive/2012/07/04/2575536.html
5) openstack Object Storage-Swift open source cloud computing, http://dev.yesky.com/244/33228744.shtml
6) Discussion: technical differences between HDFS and openstack Object Storage, http:// OS .51cto.com/art/201202/314254.htm