Data storage & Distribution Overview of Distributed systems

Last Update:2015-02-02 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

0. Data classification

　　1. Unstructured data: text, images, audio, video, etc. This type of data is commonly referred to as BLOBs (binary Large object, binary Large objects).

　　2, structured data: the use of table structure, with the corresponding pattern (schema: such as attributes, data types and data between the connection), the pattern and content are separated, need to be defined beforehand. Typically stored in a relational database.

　　3. Semi-structured data: It is a direct part of structured and unstructured data. With self-descriptive, patterns and content are mixed together, such as HTML.

First, distributed storage System

　　1, Distributed File system: storage of text, pictures, audio and video, such as unstructured data. such as Google's gfs/colossus and open-source HDFs.

　　2. Distributed key-value system: store Simple semi-structured data. NoSQL distributed extensions provide only key-based additions and deletions (CRUD) functionality. Such as Amazon's Dynamo.

　　3. Distributed Tabular System: store complex semi-structured data. He also supports key-based range lookups compared to distributed KV systems. But relative to the relational database, he does not support complex operations such as multi-table associations, nested queries. Typical systems such as Google's bigtable and open source hbase.

　　4, distributed database: storage of structured data. A distributed relational database. Provides a SQL relational query language. such as: Google's spanner and open-source hive.

Second, single-machine storage engine

The basic functions provided by the storage system include: adding and Deleting (CRUD). Different storage schemes have different emphases on business, which leads to a certain difference in performance and function of storage system.

Currently, the main single-machine storage engines are:

　　1, hash storage:Hash of the crud is the fastest. However, the disadvantage is that sequential scanning is not supported. Bitcask is a storage system based on hash table structure. He appends the write operation (including deleting the identity) to the end of the file. and regularly merge old and new documents & Records.

　　2, B-Tree: I have been thinking of both support random reading and support range lookup system. The personal conclusion is that B-trees are born for this. The lookup time complexity is LOGD (n) (d is the out of each node). MySQL's InnoDB engine and OS file system use a B + tree. (Why choose a B-tree variant of a + + tree, readers are interested to explore.) Hint: disk read)

　　3. LSM (Log structured Merge tree): improved by B + number. The idea is to save the incremental write operation in memory, swipe in the disk when the threshold is exceeded, and thus reduce the random write disk operation. Read operations require merging disk data and in-memory write operations. Through memtable/sstable implementation , the implementation details are not explored in depth. More suitable for business scenarios with more write operations. The data storage method for a column cluster in HBase is the LSM tree.

Data distribution of Distributed system

The above is only a single-machine storage engine, on the distributed system, the most basic problem is how to partition data. There are two main ways of distributing data:

　　1, hash distribution: The comparison is applicable to the key value system. As with a single hash store, the disadvantage is that sequential scanning is not supported. The selection of hashes is more important:

A) If you randomly hash, it may be difficult for multiple records of the same user to fall on different nodes while manipulating multiple records of one user.

b) When a user hashes, when the user data distribution is uneven, may cause data skew. The single-node pressure will be larger.

and the traditional hashing algorithm has a more serious problem is: when there are nodes failure or need to add nodes (not multiplied), all the node data should be re-distributed, resulting in a large number of data migration. The solution to this problem is to use a familiar "consistent hash". Detailed description of the next section of the author.

　　2. Sequential distribution: applicable to Distributed tabular System/distributed relational database. The advantages are self-evident: support range scanning. Similar to the B + tree, the master server is responsible for the distribution of data, maintaining a set of index structures similar to B + trees, the root (root) table maintains the location of meta data tables, and meta data tables maintain the location information of the user's real data tables. The user table, with additions and deletions, needs to be split and merged according to the size of the table, and the table change action is recorded in the Meta data table.

--------------------------END----------------------------------

Data storage & Distribution Overview of Distributed systems

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Data storage & Distribution Overview of Distributed systems

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Data storage & Distribution Overview of Distributed systems

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support