Introduction to Distributed Storage

Last Update:2016-08-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The first three sections are described in this article:

First, the storage type

Ii. File System

Third, storage media

Iv. Raid and replicas

V. Structure of Srvsan

Vi. Safety hazards of Srvsan

Vii. Methods of Settlement

First, the storage type

In general, we divide the storage into 4 types, native-based Das and NAS storage for the network, SAN storage, and object storage. Object storage is a combination of SAN storage and NAS storage that draws on the benefits of SAN storage and NAS storage.

650) this.width=650; "Src=" http://mmbiz.qpic.cn/mmbiz/ ynkv1p4q9evwgdmm3rqbfjkccmic6g0hviomlxe4ibjp8cdlhefxjw5niaikwda1hmcrjxsj8h0ymaiamwrayhh0jg/640?wx_fmt=png& Wxfrom=5&wx_lazy=1 "style=" Width:auto;height:auto; "alt=" 640?wx_fmt=png&wxfrom=5&wx_lazy=1 "/>
Figure 1

Let's take a look at how the app obtains a file in the store it wants, and uses the familiar Windows for example, 1.

1. The application will issue an instruction "read the first 1K data of the Readme.txt file in this directory".
2, through the memory communication to the directory layer, the relative directory into the actual directory, "read C \ test\readme.txt files before 1K data"
3, through the file system, such as FAT32, by querying the file allocation table and directory entries, to obtain the file storage of the LBA address location, permissions and other information.

The file system first queries the cache for no data, if there is direct return data, no, the file system passes the memory communication to the next link command "read start position LBA1000, length 1024 information".

4. The volume (LUN) management layer translates the LBA address into the physical address of the storage and encapsulates the protocol, such as the SCSI protocol, to the next link.
5. The disk controller obtains the corresponding information from the disk according to the command.

If the disk sector size is 4K, the actual I/O read data is 4K, the head read 4K data to the content on the server, there is a file system interception before the 1K data to the application, if the next application to initiate the same request, the file system can be read directly from the server's memory.

The process of data access is similar, whether it's Das, Nas, or SAN. The DAS encapsulates the compute and storage capabilities in a single server. The computer that everyone uses today is a DAS system, 1.

650) this.width=650; "Src=" http://mmbiz.qpic.cn/mmbiz/ ynkv1p4q9evwgdmm3rqbfjkccmic6g0hvkzm9lzubsymf6tmyuiojqiabq8sbaypuibqzq5pr1g6w1pdqtghlmlsa/640?wx_fmt=png& Wxfrom=5&wx_lazy=1 "style=" Width:auto;height:auto; "alt=" 640?wx_fmt=png&wxfrom=5&wx_lazy=1 "/>
Figure 2

If the computation and storage are separated, storage becomes a separate device, and storage has its own file system, you can manage the data yourself, that is nas,2.

Between compute and storage, Ethernet connection is generally used, and the CIFS or NFS protocol is taken. Servers can share a file system, that is, whether the server is speaking Shanghainese or Hangzhou dialect, the network to the NAS's file system, are translated into Mandarin.

So NAS storage can be shared by different hosts. The server just needs to ask, do not need to do a lot of computation, will be a lot of work to the storage to complete, the savings of CPU resources can do more server to do things, that is, compute-intensive suitable for the use of NAS.

650) this.width=650; "Src=" http://mmbiz.qpic.cn/mmbiz/ Ynkv1p4q9evwgdmm3rqbfjkccmic6g0hvgojpkolce4scj4twgjibnw3fkib83igvcx4tiibylbftusojrosibjlazg/640?wx_fmt=png &wxfrom=5&wx_lazy=1 "style=" Width:auto;height:auto; "alt=" 640?wx_fmt=png&wxfrom=5&wx_lazy=1 "/ >
Figure 3

The compute and storage are separated, the storage becomes a standalone device, the store just accepts the command no longer makes complex calculations, only reads or writes the file 2 things, called san,3.

Because there is no file system, it is also called "Bare storage", and some applications require bare devices, such as databases. Storage only accepts simple and straightforward commands, and other complex things that have server-side dry. In conjunction with FC networks, this storage data reads/writes at a high speed.

But each server has its own file system for management, for the storage is not picky as long as the data I have to save, do not need to know what, whether it is English or French, are faithfully recorded.

But only the English language can read the data, understand the French language to understand the data. As a result, general server and SAN storage areas are monogamous and sans are not shared well. Of course, some hosts that have clustered file systems can share the same storage area.

From the above analysis, we know that determining the speed of storage is determined by the complexity of the network and command.

Memory Communication speed > Bus communication > Network communication

There are also FC networks and Ethernet networks in network communications. FC network can now realize 8GB/S, but Ethernet network through the fiber media has been popularized 10GB/S,40GB/S network card is also in use. This means that traditional Ethernet networks are no longer a bottleneck for storage. In addition to Fcsan,ipsan is also an important member of SAN storage.

Operations on storage, in addition to familiar read/write, there are actually create, open, Get Properties, set properties, find and so on.

For SAN storage with brains, commands other than read/write can be done in local memory at very fast speeds.

The lack of a brain NAS storage, each to the storage delivery command, requires IP encapsulation and transmission through the Ethernet network to the NAS server, this speed is much lower than the memory communication.

Das is characterized by the fastest, but only its own use;
NAS features a slow but good sharing performance;
The SAN is characterized by fast speed but poor sharing.

In general, object storage is a feature of San high-speed direct-access disks and the distributed sharing of NAS.

The basic unit of NAS storage is a file, the base unit of San Storage is a block of data, and the basic unit of object storage is an object, which can be considered a combination of file Data + a set of attribute information, which can define file-based RAID parameters, data distribution and quality of service.

Take the "Control information" and "Data storage" separation mode, the client uses the object id+ offset as the basis for reading and writing, the client first from "control information" to obtain the real address of the data store, and then directly from the "Data storage" access.

Object storage is heavily used on the Internet, and the network disk used by everyone is typical object storage. Object storage has a good extensibility and can be scaled linearly. And can be encapsulated through the interface, but also provide NAS storage services and SAN storage services.

VMware's Vsan nature is an object store. Distributed object storage is a kind of srvsan, and there are security hidden dangers. Because this vulnerability is brought about by the X86 server.

Ii. File System

The computer's file system is the " Mr. Zhangfang " that manages the file.

First he had to manage the warehouse, to know where all the goods were put;
Then control the goods in and out, and to ensure the safety of the goods.

Without this "Mr. Zhangfang", let each "man" free access to the warehouse, will lead to the warehouse clutter, the loss of goods.

Like that year when the textile city computer room was just opened, everyone's goods are piled up in the computer room, no one unified management, equipment needs to shelves, to a large pile of goods in their own search, after the installation of garbage and no one to clean, the final accumulation of the place can not find, sometimes their goods can not find, find others on the use of ....

Everyone complained, and later set up a warehouse, please come to the warehouse administrator, with a book record the fate of the goods and storage location, the establishment of goods out of the system, the problem is solved, this is what the file system to do.

The file system manages the interface of access files, the storage organization and allocation of files, and the management of file attributes (such as attribution of files, permissions, creation of events, etc.).

each operating system has its own file system. For example, Windows has commonly used fat, FAT32, NTFS, etc., Linux with Ext1-4 and so on.

There are many types of warehouses for storing files, and now the main use is (mechanical) disk, SSD, CD, tape and so on.

When you get these media, the first thing you need is "format", which is the process of building a file storage organization and "Ledger". For example, to format the USB drive with FAT32, we can see that this is the architecture and ledger (4):

650) this.width=650; "Src=" http://mmbiz.qpic.cn/mmbiz/ ynkv1p4q9evwgdmm3rqbfjkccmic6g0hvbkia0sfqobsyh5ib5wqcz72ayqkgqalrqc0zzzi4vdamibmjvrdexmt6q/640?wx_fmt=png& Wxfrom=5&wx_lazy=1 "style=" Width:auto;height:auto; "alt=" 640?wx_fmt=png&wxfrom=5&wx_lazy=1 "/>
Figure 4

main boot area: The overall information and basic information of this storage device is recorded. such as the size of the sector, the size of each cluster, the number of heads, the total number of disk sectors, fat tables, partition boot code, and so on information.

partition table:that is, this stored ledger, if the partition table is missing, it means data loss, so generally 2 copies, namely FAT1 and FAT2. The partition table mainly records each cluster usage, when the cluster of this location is empty, it means that there is no use, there is a special marked Representative is a bad cluster, the location of the data, is to indicate the next location of the file block.

Directory Area: The location information for the directory and the record file.

Data area: a region that records specific information about a file.

Use the following example to help understand what a FAT file system is.

Suppose that 8 sectors per cluster comprise a cluster, and the size is 512*8=4k. The Readme.txt file size under the root directory is 10k,5:

650) this.width=650; "Src=" http://mmbiz.qpic.cn/mmbiz/ Ynkv1p4q9evwgdmm3rqbfjkccmic6g0hvtto4fcpu1bkicvlw83gq8cef6ibj70e4hphxymibaa3usiao4makzxzsiaq/640?wx_fmt=jpeg &wxfrom=5&wx_lazy=1 "style=" Width:auto;height:auto; "alt=" 640?wx_fmt=jpeg&wxfrom=5&wx_lazy=1 "/ >
Figure 5

1. Locate the root directory in the directory Readme.txt the location of the file in the Fat table is 0004
2. Read the corresponding file block in the 8 sectors of the 0004-bit corresponding cluster Readme (1) Save in memory and get the next data block position 0005.
3. Read the corresponding file block in the 8 sectors of the 0005-bit corresponding cluster Readme (2) Save in memory and get the next data block position 0008.
4. Read the corresponding file block in the 4 sectors of the 0005-bit corresponding cluster Readme (3) Save in memory and get the end flag.
5. The Readme (1), the Readme (2), and the Readme (3) are combined into a readme file.

In this example, we see in the FAT file system, by querying fat tables and catalog items to determine where the files are stored, the file distribution is a cluster of data blocks, and the "chain" means to indicate the text saved by the file data.

When you want to read a file, you must start reading from the file header. In this way, the reading efficiency is not high.

Different Linux file systems are similar, generally take the Ext file system, 6.

650) this.width=650; "Src=" http://mmbiz.qpic.cn/mmbiz/ ynkv1p4q9evwgdmm3rqbfjkccmic6g0hvkzn5nyp3slptupkbxprdeynwd6vcwtg55eh4kbvl9uiavt959ib2shxq/640?wx_fmt=png& Wxfrom=5&wx_lazy=1 "style=" Width:auto;height:auto; "alt=" 640?wx_fmt=png&wxfrom=5&wx_lazy=1 "/>
Figure 6

The boot block is used by the server to boot, even if the partition is not a boot partition and is also reserved.

The Super Block stores information about the file system, including the type of file system, the number of inode, the number of data blocks

Inodes blocks are inode information that stores files, and each file corresponds to an inode. Contains the meta-information for the file, specifically the following:

The number of bytes in the file

User ID of the owner of the file

The group ID of the file

Read, write, and execute permissions for files

The timestamp of the file, total three: CTime refers to the time when the inode was last changed, mtime refers to the time when the file content was last changed, atime refers to the time when the file was last opened.

Number of links, that is, how many file names point to this inode

Location of file data block

When you view a directory or file, the file attributes and data points are isolated from the Inode table and the data is read from the data block.

Data blocks: Store directory and file data.

To understand the Ext file system by reading the \var\readme.txt file flow, 7.

650) this.width=650; "Src=" http://mmbiz.qpic.cn/mmbiz/ ynkv1p4q9evwgdmm3rqbfjkccmic6g0hvubqqyg73xdebicpiwhgrtyrpe9blyyuy4flfnrlv2hcy4dezvhqc9aa/640?wx_fmt=png& Wxfrom=5&wx_lazy=1 "style=" Width:auto;height:auto; "alt=" 640?wx_fmt=png&wxfrom=5&wx_lazy=1 "/>
Figure 7

1. The inode node corresponding to the root directory A is the 2,inode1 corresponding data block is D1.
2, in retrieving D1 content found, the directory var corresponding to the inode=28, corresponding data block is D5.
3, search D5 content found readme.txt corresponding is inode=70.
4, Inode70 point to the data area D2, D3, D6 block. Read these data blocks to combine D2, D3, D6 data blocks in memory.

When the hard disk is formatted, the operating system automatically divides the hard disk into two zones.

One is the data area, storing the file data;
The other is the Inode area, which holds the information contained in the Inode.

When the inode resource is consumed, no new files can be written even though the data area has free space.

Summary: The file system of Windows is often "serial", while the Linux file system is "parallel".

Then look at the Distributed file system.

If you provide a persistent layer of storage space is not a device, but more than one, each through the network connection, the data is scattered and saved on multiple storage devices. This means that the metadata records not only the number of blocks of data recorded, but also which data node to record.

In this way, the metadata needs to be stored on each data node and must be synchronized in real time. It's really difficult to do that. If the meta-data server is isolated and made into a "master-slave" architecture, it is not necessary to maintain the metadata tables in each data node, simplifying the difficulty of data maintenance and improving the efficiency.

Hadoop's file system HDFs is a typical distributed file system.

650) this.width=650; "Src=" http://mmbiz.qpic.cn/mmbiz/ ynkv1p4q9evwgdmm3rqbfjkccmic6g0hvibzz9lpaw2mlg8lonhgbt5k4zuhofzwdg74xdidalt6q7adtppqd2bw/640?wx_fmt=png& Wxfrom=5&wx_lazy=1 "style=" Width:auto;height:auto; "alt=" 640?wx_fmt=png&wxfrom=5&wx_lazy=1 "/>
Figure 8

1, the client will be Filea by 64M sub-block. Divided into two pieces, Block1 and Block2.
2, the client sends the write data request to the Namenode, the purple dashed line 1.
3, Namenode node, record block information. And return the available Datanode to the client, dashed Red 2.

Block1:host11,host22,host31

Block2:host11,host21,host32

4, the client sends Block1 to Datanode; the sending process is written in stream.

Streaming Write Process:

1) 64M Block1 According to the 64k package division;

2) then send the first package to host11;

3) After receiving the HOST11, the first package is sent to host22, and the client wants host11 to send a second package;

4) host22 receives the first package, sends it to HOST31, and receives the second package from host11.

5) and so on, the black dashed line 3 is shown until the Block1 is sent.

6) HOST11,HOST22,HOST31 sends a notification to Namenode and client saying "the message has been sent out".

7) After receiving the message from the client, send a message to Namenode that I have finished writing. This is really done.

8) After sending the Block1, send the Block2 to Host11,host21,host32, and the blue dashed line 4 is shown.

..........

HDFs is the embryonic form of distributed storage, and distributed storage will be described in detail later.

Third, storage media

Warehouses have many kinds of storage media, now most commonly used are disk and SSD disk, as well as CDs, tapes and so on. Disk has been a cost-effective advantage to occupy the supremacy of the position.

The round magnetic platters are packed in a square sealed box, which is a common disk. The magnetic chip is the media that really holds the data, and the heads are "suspended" on the front and back of each magnetic sheet.

The disk is divided into a number of concentric circles, each concentric circle is called the track, each track is divided into a small sector, each sector can store 512B of data. When the head is rotated at high speed on the magnetic disk and continuously changed, the data can be read or written.

In fact, the magnetic disk is responsible for high-speed rotation, and the head is only responsible for lateral movement on the magnetic chip. The main disk performance is determined by the speed of the magnetic disc, the head of the channel, the disk, the capacity of each piece of magnetic plate and the speed of the interface. the higher the speed, the shorter the lane change time, the higher the single-chip capacity, the better the disk performance.

650) this.width=650; "Src=" http://mmbiz.qpic.cn/mmbiz/ ynkv1p4q9evwgdmm3rqbfjkccmic6g0hvmphdiicl2jgc2jvx4yfeptjvojtuses2kty3f2lolb9u4jqywmgxn9a/640?wx_fmt=jpeg& Wxfrom=5&wx_lazy=1 "style=" Width:auto;height:auto; "alt=" 640?wx_fmt=jpeg&wxfrom=5&wx_lazy=1 "/>
Figure 9

650) this.width=650; "Src=" http://mmbiz.qpic.cn/mmbiz/ ynkv1p4q9evwgdmm3rqbfjkccmic6g0hv4picyuoicupqlxussj1rkjdp9e0jlbyj53knsx7qtipxbeeyvz6y7c7a/640?wx_fmt=jpeg& Wxfrom=5&wx_lazy=1 "style=" Width:auto;height:auto; "alt=" 640?wx_fmt=jpeg&wxfrom=5&wx_lazy=1 "/>
Figure 10

650) this.width=650; "Src=" http://mmbiz.qpic.cn/mmbiz/ ynkv1p4q9evwgdmm3rqbfjkccmic6g0hvj9ld6icnm8zrqibhmol863fiuvibnaacial9qh7ycndzaxmbdop6uvrnxa/0?wx_fmt=gif& Wxfrom=5&wx_lazy=1 "style=" Width:auto;height:auto; "alt=" 0?wx_fmt=gif&wxfrom=5&wx_lazy=1 "/>
Figure 11

Measuring disk performance is primarily a reference to IOPS and throughput two parameters.

IOPS is how many times the disk is read and written in a second.

Throughput is how much data is read out.

In fact, these indicators should have a premise, that is, large packets (blocks) or small packets (blocks), is read or write, is random or continuous. Generally we see the performance of the disk IOPS manufacturers are generally referred to as small packets, sequential reading under the test indicators. This indicator is generally the maximum value.

The SATA, SAS disk performance that we often use on X86 servers is:

650) this.width=650; "Src=" http://mmbiz.qpic.cn/mmbiz/ ynkv1p4q9evwgdmm3rqbfjkccmic6g0hvo6q4nadotlo7m5rmpfmyd98q8ce4lpqki6m01wpfv7xqvnsvithwnq/640?wx_fmt=png& Wxfrom=5&wx_lazy=1 "style=" Width:auto;height:auto; "alt=" 640?wx_fmt=png&wxfrom=5&wx_lazy=1 "/>
Figure 12

Actual production estimates, SATA 7200 rpm disks, providing IOPS of about 60 times, throughput at 70mb/s.

Our first use of the bare-volume 2P srvsan storage Data Persistence layer in 2014 was 57 X86 servers with a built-in 12 SATA7200 3TB HDD. A total of 684 disks, approximately 41,040 ioPS and 47.88gb/s.

These indicators are clearly not sufficient to meet the storage needs, need to find ways to "accelerate."

Mechanical disks also do a lot of optimizations, such as the number of sector addresses is not sequential.

Because the magnetic disk to turn fast enough (7200 RPM or 1 seconds to 120 rpm, turn a lap is 8.3 milliseconds, that is, read and write the same track maximum delay is 8.3 seconds), to prevent the head read and write to miss, so the address of the sector is not continuous, but jumping number, such as 2:1 of the Cross-factor (1, 10, 2 , 11, 3, 12 ...).

At the same time the disk has a cache, with a queue, not to an I/O to write one, but accumulated to a certain I/O, based on the position and algorithm of the head. I/O is not necessarily a "first-come, first-served", but a compliance efficiency.

the best way to accelerate is to use an SSD disk. the control part of the disk is composed by the mechanical part + control circuit, the speed limit of the mechanical part, so that the performance of the disk cannot have a big breakthrough. And the SSD uses the full electronic control to obtain the very good performance.

An SSD is a storage device that is made up of flash memory as a storage medium and matched with an appropriate control chip. There are three types of NAND flash that are currently used to produce solid state drives:

Single-layer storage (SLC, storing 1bit data)

Two-tiered storage (MLC, storage 4bit data)

Three-layer storage (TLC, storing 8bit data)

SLC has the highest cost, the longest life, but the fastest access, the lowest cost of TLC, the shortest lifetime but the slowest access speed. To reduce costs, the enterprise-class SSD used by the server is MLC,TLC to be used as a USB flash drive.

SSDs are also a bit of a hurdle, such as higher costs, limited write times, irreparable damage, and the drawbacks of slowing down as the number of writes increases or when the write is nearly full.

The minimum IO unit sector for the corresponding disk, page is the smallest unit of the SSD.

For example, each page stores 512B of data and 218b error-correcting Code, 128 page consists of a block (64KB), 2048 blocks, a region, a flash memory chip has 2 areas of the composition. The larger the page size, the greater the capacity of the flash chip.

But SSD has a bad habit, is to modify the data of a 1 page, will affect the whole block. You need to read the entire block of data that the page is in to the cache, then initialize the block to 1, and then read the data from the cache to write.

For SSDs, speed may not be a problem, but the number of writes is limited, so the block is not as big as possible. Of course, there are similar problems with the mechanical disk, the larger the block, the faster the speed of reading and writing, but the more serious the waste, because write not a piece of the position to occupy a piece.

SSD performance varies greatly between different models and the following are our distributed block storage SSD parameters used as the cache:

With the PCIe 2.0 interface, the capacity is 1.2T, comprehensive read/write IOPS (4k packet) is 260,000 times, read throughput 1.55gb/s, write throughput 1gb/s.

The 1 Srvsan servers are configured with an SSD as a cache and 12 7200 to 3T SATA disks, and the disk only provides 1200 times, 1200M of swallowing.

is much smaller than the capacity provided by the cache SSD, so direct access to the cache can provide high storage performance, the key of Srvsan is to calculate the algorithm of hot spot data and improve the hit rate of hot spot data.

Use a high-cost SSD as a cache and use inexpensive SATA disks as the capacity layer.

Introduction to Distributed Storage

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Introduction to Distributed Storage

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support