Data storage & Distribution Overview of Distributed systems

Source: Internet
Author: User

0. Data classification

  1. Unstructured data: text, images, audio, video, etc. This type of data is commonly referred to as BLOBs (binary Large object, binary Large objects).

  2, structured data: the use of table structure, with the corresponding pattern (schema: such as attributes, data types and data between the connection), the pattern and content are separated, need to be defined beforehand. Typically stored in a relational database.

  3. Semi-structured data: It is a direct part of structured and unstructured data. With self-descriptive, patterns and content are mixed together, such as HTML.

First, distributed storage System

  1, Distributed File system: storage of text, pictures, audio and video, such as unstructured data. such as Google's gfs/colossus and open-source HDFs.

  2. Distributed key-value system: store Simple semi-structured data. NoSQL distributed extensions provide only key-based additions and deletions (CRUD) functionality. Such as Amazon's Dynamo.

  3. Distributed Tabular System: store complex semi-structured data. He also supports key-based range lookups compared to distributed KV systems. But relative to the relational database, he does not support complex operations such as multi-table associations, nested queries. Typical systems such as Google's bigtable and open source hbase.

  4, distributed database: storage of structured data. A distributed relational database. Provides a SQL relational query language. such as: Google's spanner and open-source hive.

Second, single-machine storage engine

The basic functions provided by the storage system include: adding and Deleting (CRUD). Different storage schemes have different emphases on business, which leads to a certain difference in performance and function of storage system.

Currently, the main single-machine storage engines are:

  1, hash storage:Hash of the crud is the fastest. However, the disadvantage is that sequential scanning is not supported. Bitcask is a storage system based on hash table structure. He appends the write operation (including deleting the identity) to the end of the file. and regularly merge old and new documents & Records.

  2, B-Tree: I have been thinking of both support random reading and support range lookup system. The personal conclusion is that B-trees are born for this. The lookup time complexity is LOGD (n) (d is the out of each node). MySQL's InnoDB engine and OS file system use a B + tree. (Why choose a B-tree variant of a + + tree, readers are interested to explore.) Hint: disk read)

  3. LSM (Log structured Merge tree): improved by B + number. The idea is to save the incremental write operation in memory, swipe in the disk when the threshold is exceeded, and thus reduce the random write disk operation. Read operations require merging disk data and in-memory write operations. Through memtable/sstable implementation , the implementation details are not explored in depth. More suitable for business scenarios with more write operations. The data storage method for a column cluster in HBase is the LSM tree.

Data distribution of Distributed system

The above is only a single-machine storage engine, on the distributed system, the most basic problem is how to partition data. There are two main ways of distributing data:

  1, hash distribution: The comparison is applicable to the key value system. As with a single hash store, the disadvantage is that sequential scanning is not supported. The selection of hashes is more important:

A) If you randomly hash, it may be difficult for multiple records of the same user to fall on different nodes while manipulating multiple records of one user.

b) When a user hashes, when the user data distribution is uneven, may cause data skew. The single-node pressure will be larger.

and the traditional hashing algorithm has a more serious problem is: when there are nodes failure or need to add nodes (not multiplied), all the node data should be re-distributed, resulting in a large number of data migration. The solution to this problem is to use a familiar "consistent hash". Detailed description of the next section of the author.

  2. Sequential distribution: applicable to Distributed tabular System/distributed relational database. The advantages are self-evident: support range scanning. Similar to the B + tree, the master server is responsible for the distribution of data, maintaining a set of index structures similar to B + trees, the root (root) table maintains the location of meta data tables, and meta data tables maintain the location information of the user's real data tables. The user table, with additions and deletions, needs to be split and merged according to the size of the table, and the table change action is recorded in the Meta data table.

--------------------------END----------------------------------

Data storage & Distribution Overview of Distributed systems

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.