A brief reading of the "NoSQL Essence"

Last Update:2015-12-15 Source: Internet

Author: User

Tags database sharding

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, why NoSQL?

Two reasons to follow a NoSQL
- Application development Efficiency (NoSQL simplifies data interaction)
- Large-scale data (NoSQL is designed for clustered environments)
NoSQL is not independent, and will not replace the relational database, future database domain will step into the hybrid persistence (Polyglot persistence) era.
The benefits of a relational database:
- Standardized modeling
- Easier handling of relationships
- Handling Concurrency through transactions
- Can be persisted
The disadvantages of relational databases:
- Impedance detuning: The storage structure (pattern, table, tuple) in the relational database and the data structure in the application need to be transformed. The ORM framework solves this problem, but it can cause a decrease in performance.
- Application database (green hat) and integrated Database (cuckold): The Rise of SOA (decoupling between internal databases and external communication services)
- Clustering Issues: Scalability Issues (portrait and landscape) and licensing fees, big data is a huge push.
The definition of NoSQL: Open source distributed non-relational database
- Open source
- Distributed
- Non-relational (non-modal)
The main reason to use NoSQL: (in other cases, still use a relational database)
- Large data volume, high access efficiency requirements
- To solve the problem of "impedance detuning"

Ii. Aggregation Data Model

The main categories of NoSQL
- For aggregation (aggregate oriented): from DDD
  - Key-value model (Key-value): aggregation is opaque to the database, only the entire read
  - Document model: Transparent aggregation, you can see the data structure, can read its parts (in exchange for better access to structural restrictions)
  - Column Family Model (column family): "Level Two mapping"--level two aggregation structure. So it's both "line oriented" and "column oriented"
- Non-faceted aggregation: Graph database (graph)
Benefits of Aggregation:
- In a cluster, it is relatively simple to use aggregation operations data
- Aggregation also facilitates application manipulation of data structures (more friendly to programmers)
- Atomic operations in aggregation units
- Reduces the number of points required for data acquisition
Disadvantages of aggregation:
- The atomic nature of multiple aggregations requires the application of code to maintain, which is often more complex

Third, the data model detailed

Relational: If there is a large number of relationships in the data being processed, this means that you need to choose a relational database (but actually the graph database is stronger in this regard)
Aggregation: It is convenient to manipulate a single aggregation, but it is awkward to manipulate multiple aggregations
Graph database:
- A graph consisting of nodes and edges;
- Simple node, rich interconnect structure
- The traversal relationship in the graph database is very fast, but the relational database is poor
- Typically run on a single server
Non-modal database:
- In fact, it always contains "hidden mode"
- Essentially, a modeless database handles a pattern by handing it to the application code that accesses its data.
- If you find that the stored data types are not uniform, then you should optimize the modeless database
- Non-modal flexibility is limited to aggregating internal
- Data migration is difficult for databases with or without schemas
Materialized view
- Purpose: Make basic data and derived data transparent to clients
- Computational generation materialized views are more complex and time consuming
- Way:
  - Update immediately once there are data changes
  - Regular updates through batch operations (typically via Map-reduce)
- Materialized views can be used within aggregations for updates within atomic operations

Iv. Distributed model

Scale-out is easy to scale vertically
Data distribution:
- Copy (Replication)
  - Master-Slave (Master-salve)
  - Peer (Peer-to-peer)
- Shard (sharding)
- These two methods are orthogonal and can be used simultaneously
Single Server
- A single server can be either SQL or NoSQL
- You should always select the single server scenario when you do not need to distribute data
Sharding
- Put different data on different database servers
- Both read and write performance can be improved
- Reduces database error resilience (because more database servers need to be maintained)
- Optimization method:
  - Geo-space should close the database to the visitor
  - Load Balancing
  - Automatic sharding Technology (most NoSQL offers, which decouple the application code from the database sharding function)
Master-slave replication
- You can treat your system as a single-server storage scenario with immediate backup capability
- Advantages:
  - By adding a new slave node, you can easily scale horizontally, handle more read requests, and ensure that read requests are directed to the slave node
  - can enhance the failure recovery capability of read operations
    - The primary node is faulted, and the slave node can still provide read service
    - A slave node with the same content as the primary node can be quickly assigned as a new master node instead of the original node of the fault
  - Reduced conflict probability for write operations
- Disadvantages:
  - Inconsistencies in data
  - The primary node is a performance bottleneck
Peer copy
- Advantages
  - All nodes can read and write
  - Easy to scale
  - There are no performance bottleneck nodes
- Disadvantages:
  - Data consistency issues
Two extreme ways to resolve write-operation conflicts:
- Always coordinate the relationship between nodes to ensure that there is no conflict, just ensure that the majority of copies of the consistency,
- Allow collisions between nodes, but try to merge these conflicting write operations
Shard + Copy
- Peer copy + Shard
- Storing shards in a certain number (replication factor) peer node

V. Consistency

Need to understand and weigh
- Strong consistency (strong consistency): always consistent
- Final consistency (eventual consistency): There are inconsistent time windows, but finally consistent
Update consistency
- Problem: Write conflict (write-write conflict)
- Common ways to solve:
  - Prerequisites: The order in which update operations are processed must be consistent
    - Sequential consistency (sequential consistency)
  - Pessimistic way: avoid conflict
    - Write lock
    - Greatly reduces the responsiveness of the system
    - Easy to create deadlocks
  - Optimism: conflict, conflict resolution
    - Conditional updates
    - Save all updates, annotate conflicts, and merge conflicts
Read Consistency
- problem: Read-write conflict (read-write conflict)/read inconsistent (logical consistency)
- NoSQL support for things:
  - for aggregations : Supports atomic update, but does not support multiple aggregations of things
  - Graph database: Support
- Inconsistent window: Time length of inconsistent data logic
- Replication Consistency: Consistency of data in different replicas
- can typically specify the level of conformance required for a single request, and a reasonable reduction in the consistency level of a partial request can raise performance
- read the consistency of the content as is (read-your-writes consistency)
  - session consistency
    - sticky session
      - Read/write Binding to a node
      - reduces the performance of the load Balancer
    - version stamp
Easing the "consistency" constraint
- Establish a reasonable level of isolation and reasonably relax conformance requirements
- Cap theory
  - Consistency (consistency)
  - Availability (availability)
  - Zoning resistance (Partition tolerance)
  - Common Error Understanding:
    - We can only meet two of them at the same time.
    - In fact: When the system may encounter "partitioning" conditions, we need to weigh the tradeoffs between consistency and availability, which is not a two-choice issue.
- It is sometimes appropriate to relax consistency, allow conflicts to occur, and resolve conflicts through application code under Domain knowledge guidance in exchange for better concurrency.
- The base theory advocated by NoSQL:
  - Basic available (basically Available)
  - Flex status (Soft state)
  - Final consistency (eventual consistency)
  - In essence, the trade-offs between consistency and delay are mostly
Relaxing "persistence" constraints
- Non-persistent write operations
  - For example, Redis writes memory first and then periodically writes to the hard disk
- Replication persistence
  - Failure of the Write node during the copy process can result in loss of data
  - Need to weigh the guarantee of replication quality or the responsiveness of the database
Arbitration (the way to avoid conflict)
- Write Quorum
  - Under the peer distribution model:
    - W>n/2
    - W: Write node count
    - N: Replication Factor
  - Under the master-slave distribution model:
    - From the master node
- Read Quorum
  - Under the peer distribution model:
    - R+w>n
    - R: Number of reading points
  - Under the master-slave distribution model:
    - From the master node
- It is still necessary to determine the specific mode of arbitration according to the actual situation

Six, version stamp

Business affairs vs. System transactions
- Business transactions: From the user, the entire interaction process
- System transactions: After a user commits a transaction to the system
- Problem: There is a large time window between business affairs and system transactions
- Workaround: Offline concurrency technology
  - Optimistic offline lock (one of the conditional updates): Compare-and-set (CAS operation)
  - Verifies whether the information has changed to determine whether an update operation is required by comparing the version stamp at the beginning of the execution of the business transaction with the version stamp at which the system transaction begins to execute
Common version Stamp type:
- counter
  - easy to compare
  - but requires a master server to generate and ensure that different versions of counter values are not duplicated
- GU ID (globally unique Identifier), globally unique identifier:
  - anyone can generate
  - but with a large number, and it is difficult to directly compare
- based on Source content generated hash code
  - is large enough to uniquely identify
  - anyone can generate a
  - hash value is OK
  - but verbose and cannot be directly compared
- Last updated timestamp
  - short, easy to build
  - can be directly compared
  - but clocks between different servers must be synchronized, otherwise it is easy to cause data corruption
  - and The precision of the timestamp is difficult to determine: too low to distinguish; too high, frequent updates are required
- Composite version stamp (composite stamp)
  - Apply multiple methods
Generate a version stamp in a multi-node environment
- Single-server or master-slave replication model:
  - The basic version-stamping scheme is sufficient: the counter
  - Timestamps are OK, but not as good as counters
- Multiple Master nodes:
  - Each node maintains a copy of the version stamp record (versions stamp history)
  - Identify new and old relationships by judging "ancestor" records, or detect conflicts if they are not ancestors
- Peer Replication Model
  - Array version-stamp (vector stamp)
  - Maintains an array of records for all node versions, for example [Server01:1, Server02:4, Server03:5]
  - If some values are missing, they are treated as 0, such as the server04:0
  - This makes it easy to add nodes
The version stamp only detects conflicts and does not resolve conflicts, and conflict resolution relies on domain knowledge

Seven, Map-reduce

A form of dispersion-aggregation (scatter-gather) pattern
Put some of the computational logic on the database server
The input value is a collection, and the output value is a collection of key-value pairs
The main functions include the following
- Mapper
- Combiner
- Reducer
Mapper Combinable is usually required, both mapper and used as combiner
Piping and Filters (pipes-and-filters) are typically used to combine processing
Incremental Map-reduce: You usually need to save some intermediate results for your next use
Classic implementation: Hadoop
- Pig: Dedicated language
- Hive: Class SQL language

Viii. Common NoSQL implementations

Key value
- Memcached
- Redis
- Riak
Document
- Couchdb
- Mongodb
Column Family
- HBase
- Cassandra
- Amazon SimpleDB
Figure
- Neo4j
- Hypergraphdb

A brief reading of the "NoSQL Essence"

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More