A Golang-based distributed storage open source project

Last Update:2015-06-17 Source: Internet

Author: User

Tags posix save file

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This is a creation in Article, where the information may have evolved or changed.

Project Address: https://code.google.com/p/weed-fs/

Weed-fs is a simple and high-performance distributed storage System with two goals:

1. Store massive files 2, quick access to the saved files

Weed-fs chose Key~file mapping to implement file addressing, rather than the mechanism that POSIX filesystem already has, which is a bit like a nosql system that you can call "Nofs"

WEED-FS implementation mechanism is to manage the volumes server, rather than in a central phase of the management of the meta-files, volumes server can manage files and their meta-files, this mechanism can greatly alleviate the pressure of the central node, The meta-file can be saved in the memory of the volumes server, and the file is automatically compressed by gzip, thus guaranteeing the file access speed

WEED-FS's theoretical model can be referenced in WEED-FS models after Facebook's Haystack design paper. Http://www.usenix.org/event/osdi10/tech/full_papers/Beaver.pdf

How to use

The master node of Weed-fs runs on port 9333 by default, while the volume node runs on port 8080, for example, you can start a master node and two volume nodes in the following ways, where all nodes run on localhost, But you can run on more than one machine.

Start the master node./weed master

Start the volume node

 weed volume -dir="/tmp" -volumes=0-4 -mserver="localhost:9333" -port=8080 -publicUrl="localhost:8080" & weed volume -dir="/tmp/data2" -volumes=5-7 -mserver="localhost:9333" -port=8081 -publicUrl="localhost:8081" &

Write a file

An example of a simple save file is Curl Http://localhost:9333/dir/assign {"FID": "3,01637037d6", "url": "127.0.0.1:8080", "Publicurl": " localhost:8080 "}

The first step is to send an HTTP GET request to get the file's FID and volume server's url:curl-f file=@/home/chris/myphoto.jpg http://127.0.0.1:8080/3,01637037d6 {" Size ": 43234}

The second step is to send an HTTP multipart POST request (url+ '/' +fid) to store the actual file

Save File ID

You can save the file fid, 3,01637037d6 to any database, 3 represents the volume server ID, which is an unsigned 32-bit integer number, and the 01 after the comma represents the file ID, which is an unsigned 64-bit integer, The last 637037d6 represents a file cookie, which is an unsigned 32-bit integer, and the secure Access file ID and cookie ID of the user-protected URL are hexadecimal encoded, and you can save the tuple in your own format, such as using FID as a string, theoretically you need to 8+1 +16+8=33 bytes bytes

Read file

Here's an example of how to access a URL based on

curl http://localhost:9333/dir/lookup?volumeId=3{"Url":"127.0.0.1:8080","PublicUrl":"localhost:8080"}

First, the URL of the volume server is queried through Volumeid, and as a general rule, the volume server will not be a lot and will not change very often, so you can cache the query results now you can load the required files from the volume server via a URL:/http Localhost:8080/3,01637037d6.jpg

Architecture

For most of the distributed storage System, the file is divided into many chunk, the central node to save the file name and Chunk index mapping, chunk index contains chunk server and chunk handler information, this way can not handle the efficient processing of a large number of small files, and access requests pass through the master node, and in high concurrency, the response is slow.

In Weed-fs, with volumes server management data, each volume size is 32GB, and can hold a large number of files, each storage node has multiple volume nodes, master node only need to manage volume metadata, While the actual file meta-file is stored in each volume, each meta-file size is bytes, all the file access can be processed in memory, and the hard disk operation is only the actual file read to start

Master Server and Volume server

The architecture is very simple, the actual data is stored in volumes, a volume server contains multiple volumes, can support both read and write operations, all volumes are managed by master server, and master server contains volume and volume Server mapping relationship, this static correspondence can be easily cached

For each write request, master server also generates a key for the file, because the write is not read so frequently, so a master server can handle a large number of requests

Read and write files

But the client makes a write request, the master server returns, after which the client wants the volume node to issue a POST request, to transfer the contents of the file in rest, when the client needs to read the file, it needs to be fetched to the master server or cache, and finally used public ur L Get Content

Storage size

In the current design, each volume can store 32GB of data, so the size of a single file is subject to the size of volume, but this capacity can be adjusted in the code

Memory storage

The meta-file information on all volume servers is stored in memory and does not need to be read from the hard disk, each meta-file is a 16-byte mapping table

Comparison of similar products

HDFs is characterized by the segmentation of large files, can be perfect to read and write large files Weedfs is biased towards small files, the pursuit of higher speed and concurrency capabilities

MogileFS has three layers of components: tracers, database, storage nodes. Weedfs has two layers of components: directory server, storage nodes. A single layer of components means: very slow access, very complex operations, and a higher probability of error

The Glusterfs is fully compatible with the POSIX specification, so the more complex Weedfs is only partially compatible with POSIX

Mongo's Gridfs uses MongoDB management separated chunks, each read-write request requires data query meta-file information, for a small number of requests is not a problem, but for high concurrency scenes, it is easy to hang out Weedfs volume management of actual data, Query tasks are spread across volume nodes, so it's easy to handle high-load scenarios

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More