Distributed storage WEED-FS Source code Analysis

Source: Internet
Author: User
Tags disk usage
This is a creation in Article, where the information may have evolved or changed.

Based on the source version number 0.67, "Weed-fs also named Seaweed-fs."

Weed-fs is a very good distributed storage open source project developed by Golang, although it was only star 50+ on github.com when I first started to focus, but I think this project is an excellent open source project with thousands of star magnitude. Weed-fs's design principle is based on a Facebook image storage System Paper Facebook-haystack, the paper is very long, but in fact the principle of a few words, you can look at the Facebook image storage System Haystack, I think Weed-fs is pupil surpasses.

Weed-fs This open source system covers more faces, it is difficult to say clearly in an article, can only be as clear as possible to say the main parts.

SOURCE directory Structure

Core modules

    • Weed Entry Directory
    • Weed/weed_server Portal Directory related to HTTP service
    • Topology Core module, mainly includes "DataCenter, Rack, DataNode" three-layer topological structure.
    • Storage Core module, mainly including "store, Volume, Needle" The three large chunks of storage-related source.

Auxiliary modules

    • Sequence is responsible for the global orderly generation of Fileid
    • Filer provides a file server that supports HTTP REST operations, in fact, it is based on LEVELDB to store the file name and directory structure.
    • Stats modules related to operating system memory and disk usage
    • Operation the code generated by PROTOBUF
    • Proto storing Protobuf's description file
    • Glog Log Module
    • Images Image Service
    • Util tool functions
    • Tools tool, which temporarily has only one read index file.

Topology of most data node maintenance

Topology The core data structure of the entire module is three:

    • DataCenter
    • Rack
    • DataNode

Topology is a tree-like structure, DataNode is the leaf node of the tree, DataCenter and Rack are the non-leaf nodes of the tree, DataCenter is the parent node of Rack. Such as

            DataCenter                |                |       ------------------       |                |       |                |      Rack            Rack       |       |   ------------   |          |   |          | DataNode  DataNode

That is, in the Masterserver maintenance topology, is to store volumeserver related information in DataNode, so in the code can be seen as follows:

dc := t.GetOrCreateDataCenter(dcName)rack := dc.GetOrCreateRack(rackName)dn := rack.FindDataNode(*joinMessage.Ip, int(*joinMessage.Port))

Each time you find the corresponding DataNode, you need to find it from DataNode, Rack, DataCenter.

Data storage

Understanding FID

curl -F "file=@/tmp/test.pdf" "127.0.0.1:9333/submit"{"fid":"1,01f96b93eb","fileName":"test.pdf","fileUrl":"localhost:8081/1,01f96b93eb","size":548840}

Among them "fid":"1,01f96b93eb" is Fid,fid consists of three parts "Volumeid, Needleid, Cookie" composition.

    • Volumeid:1 the ID of a physical volume that is stored in 32bit
    • NEEDLEID:01 64bit globally unique Needleid, each stored file is different (except for each backup).
    • Cookie:f96b93eb 32bit cookie value, for security purposes, to prevent malicious attacks.

Where Volumeid is assigned to Volumeserver by Masterserver, each volumeserver maintains an n Volume, each Volume has an exclusive Volumeid, which is then elaborate. Needle belongs to Volume inside a unit, the follow-up said.

Volume

type Volume struct {    Id         VolumeId    dir        string    Collection string    dataFile   *os.File    nm         NeedleMapper    readOnly   bool    SuperBlock    accessLock       sync.Mutex    lastModifiedTime uint64 //unix time in seconds}
    • Volumeid easy to understand, such as "fid":"3,01f9896771" the comma in front of the 3 is Volumeid.
    • Dir is the directory where the Volume is located,
    • Collection is very useful, each Volume can only correspond to the same Collection, different Collection pictures are stored in different Volume, which will be mentioned later.
    • So the same Volume can only be targeted at one Collection, and the same Collection picture may be distributed across different Volume. The datafile is the corresponding file handle.
    • NM Needlemapper looks like a map, which is actually a list of multiple Needle, which will be covered later.
    • Whether the readOnly is read-only
    • Superblock Super block, which will be discussed later.
    • Accesslock Mutual Exclusion Lock
    • Lastmodifiedtime Last Modified Time

The two key points above are superblock and Needlemapper, which are laid out in the file as follows:

+-------------+|SuperBlock   |+-------------+|Needle1      |+-------------+|Needle2      |+-------------+|Needle3      |+-------------+|Needle ...   |+-------------+
1 Volume = 1 SuperBlock + n Needle

Superblock

/** Super block currently has 8 bytes allocated for each volume.* Byte 0: version, 1 or 2* Byte 1: Replica Placement strategy, 000, 001, 002, 010, etc* Byte 2 and byte 3: Time to live. See TTL for definition* Rest bytes: Reserved */type SuperBlock struct {    version          Version    ReplicaPlacement *ReplicaPlacement    Ttl              *TTL}

The data maintained in Superblock is basically the metadata of the Volume.

    • Replicaplacement: In the back of the Replication will speak
    • Ttl:time to Live feature for timed removal

"TTL"

Timer Delete function, this feeling is cool, but in weed-fs the realization principle is very simple, according to Volume to block, when the user uploads a self-contained TTL file (need to delete files regularly), the file will be stored in the appropriate Volume inside (how to select the appropriate Volume later), each file will be stored with the TTL attribute, when read out and found that the file has expired (timeout time), will return a not Found results, and each Volume maintain a maximum time-out, when the time arrived, the entire Volume all the files have timed out, and then Volumeserver notifies masterserver that the Volume has been identified as Dead status, which means Masterserver will not assign a new Fid to this Volume. Then after a reasonable time, the Volumeserver will remove the Volume from the disk safely. For details, see the documentation ttl in Weed-fs,

Needle

 /** a Needle means a uploaded and stored file.* Needle file size is Limited to 4GB for now. */type Needle struct {Cookie uint32 ' comment: "random number to mitigate brute force lookups" ' Id UInt64 ' commen T: "Needle id" ' Size uint32 ' comment: "Sum of Datasize,data,namesize,name,mimesize,mime" ' Data []byte ' comme NT: "The actual file data" ' DataSize uint32 ' comment: "Data size" '//version2 Flags byte ' Comment: "Boolea N Flags "'//version2 namesize uint8//version2 Name []byte ' Comment: ' Maximum ' characters '//version 2 mimesize uint8//version2 Mime []byte ' Comment: "Maximum-Characters" '//version2 lastmodified u Int64//only store Lastmodifiedbyteslength bytes, which is 5 bytes to disk Ttl *ttl Checksum CRC ' commen T: "CRC32 to check integrity" ' Padding []byte ' Comment: "Aligned to 8 bytes" '}  

The cookies and IDs inside the Needle structure are the cookies and Needleid in the Fid mentioned above, and the others are some stored-related variables, nothing inventions. is simply a storage structure.

Replication of data backup

Replication and topology are heavily correlated, and multiple backup modes can be configured in the configuration file, as described in Weed-fs/docs.

 +-----+------------------------------------------------- --------------------------+|001 |replicate Once on the same rack |+-----+----- ----------------------------------------------------------------------+|010 |replicate Once on a different rack in the  Same Data center |+-----+---------------------------------------------------------------------------+|100 |replicate once on a different data center |+-----+------------------------------------- --------------------------------------+|200 |replicate twice on the other different data center | +-----+---------------------------------------------------------------------------+

For example, in 001 mode, a copy is backed up in a different DataNode in the same rack. Assuming that the rack1 contains DataNode1, DataNode2, DataNode3 three data nodes "randomly" select two data nodes, such as select DataNode1, DataNode2 and then write both data nodes. Assuming that the RACK1 has only one data node, and the backup mode is 001 mode, it cannot be backed up properly and the service will error.

Notice that the method of selecting a Backup data node is "random", so there is a random selection of two from three data nodes,

curl -v -F "file=@/tmp/test.json" localhost:8081/5,1ce2111f1

Topo. Nextvolumeid is responsible for generating volumeid, responsible for allocating Volume in Volumegrowth, generate a globally unique new Volumeid, in Weed-fs, is to support the multi-masterserver cluster. When there are multiple masterserver, it is important to generate a globally unique new Volumeid, which is achieved through Goraft in Weed-fs.

"Strong consistency"

The backup implementation of WEED-FS is strongly consistent. When a volumeserver accepts a POST request to upload a file, writes the file as a Needle to the local Volume, determines whether a backup is required based on the Volumeid assigned to the file, and if it needs to be backed up, it needs to request additional Vol Umeserver server). The process is described in ReplicatedWrite (topology/store_replicate.go). When the backup is complete, reply to the POST request. So every time a user uploads a picture, when they receive a response, they can assume that the backup is complete. This is different from final consistency and is strong consistency.

In the process of achieving strong consistency, one of the prerequisites is that "volumeserver needs to know about the other volumeserver backups." In the implementation of WEED-FS is implemented with the help of Masterserver, because the basic unit of backup is Volume, in Masterserver, the corresponding backup machine list is maintained for each volumeid. This can be viewed through the following example commands:

curl "localhost:9333/dir/lookup?volumeId=4&pretty=y"{  "volumeId": "4",  "locations": [    {      "url": "127.0.0.1:8081",      "publicUrl": "localhost:8081"    },    {      "url": "127.0.0.1:8080",      "publicUrl": "localhost:8080"    }  ]}

As can be seen in the example above, the corresponding volumeid=4 Volume, you can see that the corresponding backup machine list has two, respectively, is "127.0.0.1:8081" and "127.0.0.1:8080".

In fact, for each volumeserver to find other backup machine, but also through the HTTP API as above to Masterserver inquiry. It's just not every time you ask, because once you've asked, you'll be cached and asked only if you can't find it in the cache.

"Collection"

Examples are as follows:

Start Masterserver

weed master

Start Volumeserver

weed volume -dir="/tmp/data1" -max=5  -mserver="localhost:9333" -port=8080

Apply FID

curl "http://127.0.0.1:9333/dir/assign?collection=pictures"{"fid":"4,01d50c6fbf","url":"127.0.0.1:8080","publicUrl":"localhost:8080","count":1}
curl "http://127.0.0.1:9333/dir/assign?collection=mp3"{"error":"No free volumes left!"}
curl "http://127.0.0.1:9333/dir/assign?collection=pictures"{"fid":"5,0147ed0fb7","url":"127.0.0.1:8080","publicUrl":"localhost:8080","count":1}

An example of an application for FID is explained:

    1. Because by default, when Volumeserver starts, no Volume is requested, and when the first time /dir/assign , the Volume is assigned, because weed volume the parameters are -max=5 assigned 5 Volume at a time, and the 5 Volume Col The lection properties are all pictures and can even be seen in ls /tmp/data1 the following results:
/tmp/data1pictures_1.dat pictures_1.idx pictures_2.dat pictures_2.idx pictures_3.dat pictures_3.idx pictures_4.dat pictures_4.idx pictures_5.dat pictures_5.idx

You can see that the file name for each volume is named Collection.

2. Because the Collection attribute of the existing 5 Volume is pictures, if /dir/assign a non-pictures Collection Fid is required to fail at this time,

3. When applying for an Fid that belongs to pictures Collection success.

That is, every time you apply for Fid, the Collection will be checked to ensure that each Needle that is deposited into the Volume is identical to the Collection that it belongs to. In real-world applications, you can use Collection to class shards.

"Size limit for Volume"

Each time Volumeserver sends a heartbeat message to Masterserver, it is storage. The size of the current Volume is indicated in the volumeinfo.size. So it is possible to limit the size of the Volume. The following functions:

func (vl *VolumeLayout) isWritable(v *storage.VolumeInfo) bool {    return uint64(v.Size) < vl.volumeSizeLimit &&        v.Version == storage.CurrentVersion &&        !v.ReadOnly}

When Volumeinfo.size is greater than Volumelayout.volumesizelimit, the Volume is marked as not writable. The value of Volumelayout.volumesizelimit can be configured when the Masterserver is started. by weed help master :

-volumeSizeLimitMB=30000: Master stops directing writes to oversized volumes.

The maximum size per Volume is 30G by default, and each volumeserver can be configured with n Volume, depending on the size of the machine's different drive sizes. by weed help volume :

-max="7": maximum numbers of volumes, count[,count]...

The default Volume size for each volumeserver is 7.

So by default, when a volumeserver uses a disk that is more than 7 * 30G = 210G, the volumeserver is read-only, and Masterserver no longer assigns a new Fid to it.

But in fact there will be loopholes, if you do not request masterserver at this time to obtain the FID, but directly construct the FID to volumeserver POST file, Volumeserver will still accept the uploaded file, until the size more than in the storage/needle.go inside A constant to write dead:

MaxPossibleVolumeSize = 4 * 1024 * 1024 * 1024 * 8

In fact, in the Volumeserver also has maintained a variable called Volumesizelimit,

type Store struct {    ...    volumeSizeLimit uint64 //read from the master    ...}

The value of this variable is obtained from Masterserver, and each time Volumeserver writes Needle to Volume, the Volume Size is checked for more than Volumesizelimit, and the error log is hit when it is exceeded, but it does not stop The uploaded file is not rejected. The disk is rejected only if the size exceeds maxpossiblevolumesize.

Expansion

For Weed-fs, the expansion is very simple,

Start Masterserver:

./weed master -mdir="/tmp/weed_master_tmp"

Start VolumeServer1:

weed volume -dir="/tmp/data1" -max=5  -mserver="localhost:9333" -port=8080

When VolumeServer1 is unable to continue accepting the upload data because it arrives at the Volume size limit, the submit upload data Masterserver will return an error (because VolumeServer1 is marked as not in Masterserver can be written).

curl -F "file=@/tmp/test.pdf" "127.0.0.1:9333/submit"{"error":"No free volumes left!"}

The VolumeServer2 can then be started directly.

weed volume -dir="/tmp/data2" -max=5  -mserver="localhost:9333" -port=8081

The new DataNode is automatically registered to the topology structure of Masterserver when VolumeServer2 is started, and the uploaded file is written to Masterserver when the Volumeserve accepts a new submit request. R2 (because VolumeServer1 is not writable at this time).

In other words, if there is a capacity problem on the line, the expansion only needs to add the machine, simple and effective.

Summarize

    • Each masterserver maintains multiple volumeserver through topology.
    • Each volumeserver maintains multiple Volume.
    • Each Volume contains multiple Needle, Needle is a file.
    • Multi-machine backup implementations between multiple Volumeserver are strongly consistent.
    • The master-slave relationship between multiple masterserver is achieved through Goraft.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.