Understanding OpenStack Swift (2): Architecture, Principles and functions [Architecture, Implementation and Features]

Source: Internet
Author: User
Tags sqlite database unique id web hosting haproxy openstack swift swiftstack

This series of articles focuses on learning and researching OpenStack Swift, including environment building, principles, architecture, monitoring, and performance.

(1) OpenStack + Three-node Swift cluster + HAProxy + UCARP installation and configuration

(2) principle, architecture and performance

(3) Monitoring

1. Architecture 1.1 Overall architecture

Swift's overall architecture is clear and Independent:

  • allocates more CPU and network bandwidth, such as 10GbE network
  • use Memcached for token, account, and container data caching
  • and storage Services Division Department
  • uses a load balancer
  • often uses two proxy server nodes per five storage nodes, then scales
  • to use public network for client access and backend N Etwork (for access to the storage service) separate
  • to consider security, you can place SSL termination on the Proxy Server
# Tiered (tier) Component (Service) Functions (function) Characteristics Deployment considerations
1 Access tier (access tier) Load Balancer Hardware (such as F5) or software (such as a haproxy) load balancer, the client's request is assigned to a stateless proxy service according to the configured policy. Select hardware or software lb according to the actual requirements
2 proxy server
  • provides a REST API to the client
  • stateless WSGI service consisting of multiple Proxy servers Cluster
  • forwards a client request to a Account,container or Object service on a storage node

is  cpu and network bandwidth sensitive

3 Storage tier (capactity tier)

Account Server

Provides the account operation REST API is disk performance and network-sensitive
  • Use better disks for account and Container services, such as SAS or SSDs, to improve performance;
  • Using 1GbE or 10GbE networks
  • If necessary, you can separate the Replicaiton network.
  • The account and Container services can be deployed on separate servers if needed
4 Container Server Provides Container operation REST API
5 Object Server Provides the Object operation REST API
6 Consistency Servers Includes back-office services such as Replicators, updaters and Auditors to ensure object consistency

This is a more classic Swift physical deployment diagram:

1.2 Network Architecture

As an example of a cluster that provides object storage services externally, its network architecture can be:

    • The external traffic is placed in a separate (medium purple) VLAN with the end of LB
    • Control (Manage) network connections all nodes
    • Swift Front-end (front end/public) network connection LB and all Proxy server nodes
    • Swift Back-end (backend/private) network connects all Proxy server nodes and storage nodes
    • If needed, you can also isolate replication (replication) networks from the backend network

On the network bandwidth selection,

    • Given the large capacity of replicated data (often starting with several terabytes), back-end networks are often used with 10GbE networks
    • Depending on the front-end load, the front-end network can use a 1GbE network, or use a 10GbE network when conditions are available
    • Management networks are often used with 1GbE bandwidth networks
2. Data storage 2.1 Swift data storage 2.1.1 Swift's data model

The Swift data model uses the following three concepts (see 1):

    • Account: Accounts/tenants. Swift is a natural support for multi-tenancy. If you use OpenStack Keystone for user verification, the account is the same as the OpenStack project/tenant concept. The isolation of Swift tenants is reflected in metadata, not on object data. The data includes its own metadata and container list, which is stored in the SQLite database.
    • Container: A container, similar to a directory in a file system, is customized by the user and contains its own metadata and a list of objects within the container. The data is saved in the SQLite database. In the new version, Swift supports the addition of folders within the container.
    • Object: objects, including metadata for data and data, are saved as files on the file system.

(Fig. 1) (Fig. 2)

2.1.2 Select data storage location

Swift saves each object as a multi-copy, which, according to the physical location, tries to place the copies in different physical locations to ensure the geographic reliability of the data. It mainly considers the following positional properties:

    • Region: Geographically located areas, such as the computer rooms of different cities and even different countries, are mainly considered in terms of disaster preparedness.
    • Zone: A data center separate domain based on physical network, power supply, air conditioning and other infrastructure, often in a rack (Rack) within a Zone of the server.
    • Node: physical server
    • Disk: Disks on the physical server

Swift, when determining where an object is placed, tries to place the object and its copy in a physical location that is not lost at the same time. See 2.

2.1.3 Ensure data consistency

After the object and its copy are placed on a disk, Swift uses background services such as Replicators, Updaters, and Auditors to ensure the eventual consistency of its data.

    • replicator– copy objects to ensure eventual system consistency (Replicate objects and make a systems in a consistent state); Recover disk and network errors (Recover disks failure, NE Twork outages situation)
    • updater– Updating metadata (update metadata), recovering from problems caused by heavy load on container and account metadata (Recover failure caused by container, accounts metadata high load)
    • auditor– Delete The problem accounts, containers and objects, and then copy them from the other servers (delete problematic account, container or objects and replicate from other server); According to the library and file data errors (Recover DBS or files which have bit rot problem.

Where the Replicator service is started periodically at configurable intervals, the default is 30s, which performs a periodic copy of the data in the smallest unit of replication, with node as a range. Please refer to the reference documentation at the end of this document for detailed procedures. Given that Swift achieves eventual consistency rather than strong consistency, it is not suitable for applications that require strong data consistency, such as bank deposits and ticketing systems. Situations that require replication include, but are not limited to:

    • Proxy server fails while writing to the third, it still returns success to the client, and the background service writes a third copy
    • The background process discovers that a replication data is corrupted and that it is re-written in the new location
    • In the case of cross-region, Proxy Server writes only to the store on which it is located, and the data on the far area is written by the background process complex
    • When a disk is replaced or a disk is added, the data needs to be rebalanced
2.2 How Swift is implementing these requirements: using the Ring + hash algorithm

Swift uses a relatively straightforward algorithm based on the Ring configured by the administrator to determine where objects are stored. The object is saved as a file in the local file system, using the extended properties of the file to hold the object's metadata, so Swift needs a file system that supports extended attributes, which is currently the official recommendation of XFS.

Content and algorithm of 2.2.1 Ring

In short, Swift's Proxy Server determines where the respective data is stored, based on the respective Ring of Account,container and object, where the account and container database files are treated as objects.

Therefore, Swift requires that the Ring's configuration file be distributed across all proxy nodes. At the same time, consistency Servers needs it to determine the location of the background object copy, so it needs to be deployed on all storage nodes as well. The Ring is saved in the form of a file:

    • Object.ring.gz
    • Container.ring.gz
    • Account.ring.gz

Before analyzing how the ring works, consider several key configurations of the ring:

    • Region,zone and Disk: said before, skip
    • The partition:swift then divides each disk into several partition (partitions). This is the basic unit of the back-end consistency Check service processing copy (replication).
    • Replica: The total number of objects and copies, the general recommended value is 3.

The administrator uses the Ring generation tool provided by Swift (Swift-ring-builder, located under the Bin directory of the source code, is the most basic command of Swift, it implements the ring file creation, adds, balances, with the files under swift/common/ring/. , plus various configuration parameters, the ring content is drawn. Take the Object ring, for example,

[email protected]:/etc/swift# Swift-ring-builder Object.builderobject.builder, build version 61024 partitions, 3.000000 replicas, 1 regions, 3 zones, 6 devices, 0.00 Balance, 0.00 dispersionthe Minimum number of hours before a Partit Ion can be reassigned is 1The overload factor are 0.00% (0.000000) Devices:id region zone IP address Port repli cation IP Replication Port name weight partitions balance meta 0 1 1 9.115.251.235 6000 9 .115.251.235 6000 SDB1 100.00 512 0.00 1 1 1 9.115.251.235 6000 9.1 15.251.235 6000 SDC1 100.00 512 0.00 2 1 2 9.115.251.234 6000 9.115 .251.234 6000 SDB1 100.00 512 0.00 3 1 2 9.115.251.234 6000 9.115.2 51.234 6000 SDC1 100.00 512 0.00 4 1 3 9.115.251.233 6000 9.115.251 .233 6000 SDB1 100.00 0.00 5 1 3 9.115.251.233 6000 9.115.251.233 6000 SDC1 100.00 512 0.00

The Ring is configured as: 1 region,3 zone,3 node,6 disks, 512 partitions per disk.

internal implementation, Swift saves the Ring's configuration in its _replica2part2dev in the data structure:

The Reading method is:

    • Line: Sequentially numbering all the partitions in the cluster with a unique ID for each partition
    • Column: Contains the ID in the Ring, which uniquely identifies a disk; and the number of the replica.

As a result, Swift uses this data structure to easily find out which disk a replica should be placed on by the storage service on which nodes.

In addition to generating the ring, another important operation on the ring is rebalance (rebalance). After you modify the builder file (for example, add or subtract devices), the operation regenerates the ring file to balance the partition distribution in the system. Of course, after rebalance, you need to restart each service of the system. For more information, please refer to OpenStack Swift source code Analysis (ii) ring file generation.

2.2.2 Data placement and reading process

When you receive a PUT request for an object that needs to be saved, Proxy server:

    1. Computes its hash value based on its full object path (/account[/container[/object]), and the length of the hash depends on the total number of partitions in the cluster.
    2. Maps the first N characters of a hash value to a number of partition IDs of the same replica value.
    3. Determine the IP and port of a data service based on the partition ID.
    4. Try connecting the ports for these services in turn. If half of the services are unable to connect, the request is rejected.
    5. When you try to create an object, the storage service saves the object as a file on a disk. (Object server asynchronously invokes the container service to update the container database after it completes the file store)
    6. After two copies of 3 copies have been successfully written, Proxy Server returns success to the client.

When Proxy server receives a GET request for an object, it:

(1) (2) (3) (4) with the previous PUT request, determine all disks that hold all replica

(5) Sort these nodes, try to connect the first, and if successful, return the binary data to the client; if unsuccessful, try the next one. Until they succeed or fail.

It should be said that the process is straightforward, which also conforms to Swift's overall design style. As for the specific hashing algorithm implementation, interested can see related papers. In general, it achieves the "unique-as-possible", which is "as unique as possible" algorithms, in the order of Zone,node and Disk. For a replica,swift first will go to choose a zone without the object replica, if there is no such zone, select a used zone in the unused node, if there is no such node, select the Used node on a unused Disk

2.2.3 Object Segmentation

Swift for small files, is not segmented direct storage, for large files (the size threshold can be configured, the default is 5G), the system will automatically segment it. The user can also specify the size of the fragment to hold the file. For example, for 590M files, setting the fragment size to 100M will be divided into 6 segments of the parallel (in parallel) uploaded to the cluster:

[Email protected]:~/s1# Swift upload Container1-s100000000Tmpubuntutmpubuntu Segment5Tmpubuntu Segment2Tmpubuntu Segment4Tmpubuntu Segment1Tmpubuntu Segment3Tmpubuntu Segment0Tmpubuntu[email protected]:~/s1# Swift list Container11Admin-Openrc.shcirros-0.3.4-x86_64-Disk.rawtmpubuntu

It can be seen from stat that it uses a manifest file to hold segment information:

[Email protected]:~/s1# Swift stat container1 tmpubuntu account:auth_dea8b51d28bf41599e63464828102759 container:container1 Object:tmpubuntu Content type:application/octet-streamcontent Length:591396864Last Modified:fri, -Nov -  -: to: -GMT ETag:"fa561512dcd31b21841fbc9dbace118f"manifest:container1_segments/tmpubuntu/1446907333.484258/591396864/100000000/Meta Mtime:1446907333.484258Accept-ranges:bytes Connection:keep-Alive X-timestamp:1447439512.09744X-trans-id:txae548b4b35184c71aa396-0056462d72

But the list still sees only one file because the manifest file is saved in a separate container (container1_segments). Here you can see 6 objects:

[Email protected]:~/s1# Swift list Container1_segmentstmpubuntu/1446907333.484258/591396864/100000000/00000000Tmpubuntu/1446907333.484258/591396864/100000000/00000001Tmpubuntu/1446907333.484258/591396864/100000000/00000002Tmpubuntu/1446907333.484258/591396864/100000000/00000003Tmpubuntu/1446907333.484258/591396864/100000000/00000004Tmpubuntu/1446907333.484258/591396864/100000000/00000005

Each object size is 100M (considering storage efficiency, it is not recommended that each object size is less than 100M):

[Email protected]:~/s1# Swift stat container1_segments tmpubuntu/1446907333.484258/591396864/  100000000/00000000       account:auth_dea8b51d28bf41599e63464828102759     Container: Container1_segments        object:tmpubuntu/1446907333.484258/591396864/100000000 /00000000  Content type:application/octet-100000000

And the user can do it individually, such as modifying a paragraph. Swift will only be responsible for connecting all segments to large objects that the user sees.

For more details on large file support, refer to the official documentation and Rackspace documentation. As can be seen from the above description, Swift's support for file segmentation is relatively rudimentary (fixed, inflexible), so there has been an object stripping scheme, such as the following scheme, do not know whether it has been supported or will be supported.

2.3 Region

By storing objects in a region of different physical locations, you can further enhance the availability of your data. The basic principle is: for N-parts replica and M region, the number of replica in each region is n/m integer, and the remaining replica is randomly selected in M region. Take N = 3, M = 2 As an example, there are 1 replica in a region, and two replica in the other region, as shown in:

For a PUT operation, Proxy server will only write replica to node in the region where it resides, and replica in the far end is written by Replicator. Therefore, Swift's algorithm should try to ensure that the proxy server is located in the region of the replica number of relatively more, which is also known as the proxy server affinity of replica.

Clearly, data replication across the region exacerbates the need for network bandwidth.

Two types of region:

(1) Remote region real-time write replica

(2) Replica asynchronous write to the remote region

2.4 Storage Polices (Storage policy)

In the above description, a swift cluster supports only a set of Ring configurations, which means that the configuration of the entire machine is unique. Similar to the definition of pool in Ceph, Swift has added a very large feature in version 2.0 (included in the OpenStack Juno release): Storage policy. In the new implementation, a Swift can be configured by multiple sets of ring, each set of ring configurations can be different. For example, Ring 1 holds 3 copies of objects, and Ring 2 holds 2 copies of objects. Several features:

    • Policy is implemented at the container level
    • You can specify policy when creating container. Once specified, it cannot be modified.
    • A policy can be shared with multiple container

By applying this new feature, Swift users can develop different storage strategies to suit the storage needs of different applications. For example, data for critical applications, a storage strategy that allows data to be stored on SSDs, and a storage strategy that allows data to be saved in only 2 copies to conserve disk space for generally critical data. For example:

For more information, please refer to the official OpenStack documentation and SwiftStack official documentation.

3. Version and main features 3.1 Juno and previous major releases and features

(1) Large object Support

    • Swift Limitation:single OBJECT:5GB
    • Split Object & Manage Large Object
    • Manage segmented objects by manifest file
    • Ref:http://docs.openstack.org/developer/swift/overview_large_objects.html

(2) Static web Hosting

    • Upload static web file and make Web site; Upload Web site file with index and error files
    • Use Statiscweb Middleware
    • Ref:http://docs.openstack.org/developer/swift/middleware.html#staticweb

(3) S3 compatible API

    • Support S3 API
    • Support Limited APIs less than 40%use swift3 middleware
    • Ref: https://github.com/stackforge/swift3

(4) Object expiration

    • Schedule Deletion of objects
    • Use X-delete-at and X-delete-after headers while using a object PUT or POST
    • X-delete-at:delete object at specified time
    • X-delete-after:delete object after specified time
    • Ref:http://docs.openstack.org/developer/swift/overview_expiring_objects.html

(5) Temp URL

    • Provide URL to access in limited time
    • Need temp_url_expires time in header
    • Use temporary URL middleware
    • Ref:http://docs.openstack.org/developer/swift/api/temporary_url_middleware.html

(6) Global cluster

    • Make a mono cluster in distant
    • Read/write Affinity
    • Deferred replication
    • Ref: http://docs.openstack.org/developer/swift/admin_guide.html,https://swiftstack.com/blog/ 2012/09/16/globally-distributed-openstack-swift-cluster/

(7) Storage policy

    • Support various policy in sing storage cluster
    • Use multiple ring file
    • Ref: http://docs.openstack.org/developer/swift/admin_guide.html
3.2 Updates in the Kilo version
new FeaturesErasure Code (Beta)

Swift now supports the Erasure code (EC) Storage policy type. This allows the deployment of personnel with very little raw capacity to achieve high availability, as in replica storage. However, the EC requires more CPU and network resources, so it is not suitable for all scenarios. The EC is ideal for very little access to large volumes of data in a separate area.

The implementation of the swift erasure code is transparent to the user. There is no difference in the API for the types of copy storage and erasure code storage.

To support erasure codes, Swift now relies on Pyeclib and Liberasurecode. Liberasurecode is a plug-in library that allows you to implement an EC algorithm in a library of your choice.

For more detailed documentation, see http://swift.openstack.org/overview_erasure_code.html

Composite Tokens (Composite tokens)

Composite tokens allow other OpenStack services to store data in Swift on behalf of the client, so both the client and the service do not need to be authorized by each other when updating data.

A typical example would be a user requesting Nova to store a snapshot of a VM. Nova passes the request to glance,glance to write the image to a set of objects in the Swift container. In this scenario, the snapshot data cannot be modified directly when the user does not have a legitimate token from the service. Similarly, the service itself cannot update data without a legitimate token from the user. But the data does exist in the user's swift account, which makes account management easier.

For more detailed documentation, see http://swift.openstack.org/overview_backing_store.html

data location updates for smaller, unbalanced clusters

The location of the SWIFT data is now determined by the hardware weights. Currently, operators are allowed to incrementally add new zones (zones) and geographies (regions) without triggering large-scale data migrations immediately. At the same time, if a cluster is unbalanced (for example, in one region (zones) cluster, where one is twice times the capacity of the other), Swift uses the existing space more efficiently and warns when the replica is out of cluster space.

Global cluster replication optimization

When replicating between zones (regions), only one copy is migrated per replication. This allows remote regions (region) to replicate internally, avoiding more data over wide area network (WAN) copies.

Known Issues
    • As a beta update, the Erasure code (EC) function is nearing completion, but for some functions it is still incomplete (like multi-range (Multi-range) reads) and does not have a complete performance estimate. This feature is dependent on ssync for persistence. The Deployer urged us to do a larger-scale test, and not to use erasure-coded storage policies in a production environment deployment.
Upgrade Tips

As always, you can upgrade to this version of Swift without compromising the end user experience.

    • To support the erasure code, Swift needs a new dependency pyeclib (and Liberasurecode, etc.). and the minimum version requirements for Eventlet have also risen.
Updates in the 3.3 Liberty version

In the L version, Swift does not include a large new feature, which can be consulted in the official documentation.

3.4 Advantages

Other reference documents:

Http://www.florentflament.com/blog/openstack-swift-ring-made-understandable.html

https://www.mirantis.com/blog/configuring-multi-region-cluster-openstack-swift/

OpenStack Swift Source Analysis

Understanding OpenStack Swift (2): Architecture, Principles and functions [Architecture, Implementation and Features]

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.