First, why NoSQL?
- Two reasons to follow a NoSQL
- Application development Efficiency (NoSQL simplifies data interaction)
- Large-scale data (NoSQL is designed for clustered environments)
- NoSQL is not independent, and will not replace the relational database, future database domain will step into the hybrid persistence (Polyglot persistence) era.
- The benefits of a relational database:
- Standardized modeling
- Easier handling of relationships
- Handling Concurrency through transactions
- Can be persisted
- The disadvantages of relational databases:
- Impedance detuning: The storage structure (pattern, table, tuple) in the relational database and the data structure in the application need to be transformed. The ORM framework solves this problem, but it can cause a decrease in performance.
- Application database (green hat) and integrated Database (cuckold): The Rise of SOA (decoupling between internal databases and external communication services)
- Clustering Issues: Scalability Issues (portrait and landscape) and licensing fees, big data is a huge push.
- The definition of NoSQL: Open source distributed non-relational database
- Open source
- Distributed
- Non-relational (non-modal)
- The main reason to use NoSQL: (in other cases, still use a relational database)
- Large data volume, high access efficiency requirements
- To solve the problem of "impedance detuning"
Ii. Aggregation Data Model
- The main categories of NoSQL
- For aggregation (aggregate oriented): from DDD
- Key-value model (Key-value): aggregation is opaque to the database, only the entire read
- Document model: Transparent aggregation, you can see the data structure, can read its parts (in exchange for better access to structural restrictions)
- Column Family Model (column family): "Level Two mapping"--level two aggregation structure. So it's both "line oriented" and "column oriented"
- Non-faceted aggregation: Graph database (graph)
- Benefits of Aggregation:
- In a cluster, it is relatively simple to use aggregation operations data
- Aggregation also facilitates application manipulation of data structures (more friendly to programmers)
- Atomic operations in aggregation units
- Reduces the number of points required for data acquisition
- Disadvantages of aggregation:
- The atomic nature of multiple aggregations requires the application of code to maintain, which is often more complex
Third, the data model detailed
- Relational: If there is a large number of relationships in the data being processed, this means that you need to choose a relational database (but actually the graph database is stronger in this regard)
- Aggregation: It is convenient to manipulate a single aggregation, but it is awkward to manipulate multiple aggregations
- Graph database:
- A graph consisting of nodes and edges;
- Simple node, rich interconnect structure
- The traversal relationship in the graph database is very fast, but the relational database is poor
- Typically run on a single server
- Non-modal database:
- In fact, it always contains "hidden mode"
- Essentially, a modeless database handles a pattern by handing it to the application code that accesses its data.
- If you find that the stored data types are not uniform, then you should optimize the modeless database
- Non-modal flexibility is limited to aggregating internal
- Data migration is difficult for databases with or without schemas
- Materialized view
- Purpose: Make basic data and derived data transparent to clients
- Computational generation materialized views are more complex and time consuming
- Way:
- Update immediately once there are data changes
- Regular updates through batch operations (typically via Map-reduce)
- Materialized views can be used within aggregations for updates within atomic operations
Iv. Distributed model
- Scale-out is easy to scale vertically
- Data distribution:
- Copy (Replication)
- Master-Slave (Master-salve)
- Peer (Peer-to-peer)
- Shard (sharding)
- These two methods are orthogonal and can be used simultaneously
- Single Server
- A single server can be either SQL or NoSQL
- You should always select the single server scenario when you do not need to distribute data
- Sharding
- Put different data on different database servers
- Both read and write performance can be improved
- Reduces database error resilience (because more database servers need to be maintained)
- Optimization method:
- Geo-space should close the database to the visitor
- Load Balancing
- Automatic sharding Technology (most NoSQL offers, which decouple the application code from the database sharding function)
- Master-slave replication
- You can treat your system as a single-server storage scenario with immediate backup capability
- Advantages:
- By adding a new slave node, you can easily scale horizontally, handle more read requests, and ensure that read requests are directed to the slave node
- can enhance the failure recovery capability of read operations
- The primary node is faulted, and the slave node can still provide read service
- A slave node with the same content as the primary node can be quickly assigned as a new master node instead of the original node of the fault
- Reduced conflict probability for write operations
- Disadvantages:
- Inconsistencies in data
- The primary node is a performance bottleneck
- Peer copy
- Advantages
- All nodes can read and write
- Easy to scale
- There are no performance bottleneck nodes
- Disadvantages:
- Two extreme ways to resolve write-operation conflicts:
- Always coordinate the relationship between nodes to ensure that there is no conflict, just ensure that the majority of copies of the consistency,
- Allow collisions between nodes, but try to merge these conflicting write operations
- Shard + Copy
- Peer copy + Shard
- Storing shards in a certain number (replication factor) peer node
V. Consistency
- Need to understand and weigh
- Strong consistency (strong consistency): always consistent
- Final consistency (eventual consistency): There are inconsistent time windows, but finally consistent
- Update consistency
- Problem: Write conflict (write-write conflict)
- Common ways to solve:
- Prerequisites: The order in which update operations are processed must be consistent
- Sequential consistency (sequential consistency)
- Pessimistic way: avoid conflict
- Write lock
- Greatly reduces the responsiveness of the system
- Easy to create deadlocks
- Optimism: conflict, conflict resolution
- Conditional updates
- Save all updates, annotate conflicts, and merge conflicts
- Read Consistency
- problem: Read-write conflict (read-write conflict)/read inconsistent (logical consistency)
- NoSQL support for things:
- for aggregations : Supports atomic update, but does not support multiple aggregations of things
- Graph database: Support
- Inconsistent window: Time length of inconsistent data logic
- Replication Consistency: Consistency of data in different replicas
The
- can typically specify the level of conformance required for a single request, and a reasonable reduction in the consistency level of a partial request can raise performance
- read the consistency of the content as is (read-your-writes consistency)
- session consistency
- sticky session
- Read/write Binding to a node
- reduces the performance of the load Balancer
- version stamp
- Easing the "consistency" constraint
- Establish a reasonable level of isolation and reasonably relax conformance requirements
- Cap theory
- Consistency (consistency)
- Availability (availability)
- Zoning resistance (Partition tolerance)
- Common Error Understanding:
- We can only meet two of them at the same time.
- In fact: When the system may encounter "partitioning" conditions, we need to weigh the tradeoffs between consistency and availability, which is not a two-choice issue.
- It is sometimes appropriate to relax consistency, allow conflicts to occur, and resolve conflicts through application code under Domain knowledge guidance in exchange for better concurrency.
- The base theory advocated by NoSQL:
- Basic available (basically Available)
- Flex status (Soft state)
- Final consistency (eventual consistency)
- In essence, the trade-offs between consistency and delay are mostly
- Relaxing "persistence" constraints
- Non-persistent write operations
- For example, Redis writes memory first and then periodically writes to the hard disk
- Replication persistence
- Failure of the Write node during the copy process can result in loss of data
- Need to weigh the guarantee of replication quality or the responsiveness of the database
- Arbitration (the way to avoid conflict)
- Write Quorum
- Under the peer distribution model:
- W>n/2
- W: Write node count
- N: Replication Factor
- Under the master-slave distribution model:
- Read Quorum
- Under the peer distribution model:
- R+w>n
- R: Number of reading points
- Under the master-slave distribution model:
- It is still necessary to determine the specific mode of arbitration according to the actual situation
Six, version stamp
- Business affairs vs. System transactions
- Business transactions: From the user, the entire interaction process
- System transactions: After a user commits a transaction to the system
- Problem: There is a large time window between business affairs and system transactions
- Workaround: Offline concurrency technology
- Optimistic offline lock (one of the conditional updates): Compare-and-set (CAS operation)
- Verifies whether the information has changed to determine whether an update operation is required by comparing the version stamp at the beginning of the execution of the business transaction with the version stamp at which the system transaction begins to execute
- Common version Stamp type:
- counter
- easy to compare
- but requires a master server to generate and ensure that different versions of counter values are not duplicated
- GU ID (globally unique Identifier), globally unique identifier:
- anyone can generate
- but with a large number, and it is difficult to directly compare
- based on Source content generated hash code
- is large enough to uniquely identify
- anyone can generate a
- hash value is OK
- but verbose and cannot be directly compared
- Last updated timestamp
- short, easy to build
- can be directly compared
- but clocks between different servers must be synchronized, otherwise it is easy to cause data corruption
- and The precision of the timestamp is difficult to determine: too low to distinguish; too high, frequent updates are required
- Composite version stamp (composite stamp)
/ul>
- Generate a version stamp in a multi-node environment
- Single-server or master-slave replication model:
- The basic version-stamping scheme is sufficient: the counter
- Timestamps are OK, but not as good as counters
- Multiple Master nodes:
- Each node maintains a copy of the version stamp record (versions stamp history)
- Identify new and old relationships by judging "ancestor" records, or detect conflicts if they are not ancestors
- Peer Replication Model
- Array version-stamp (vector stamp)
- Maintains an array of records for all node versions, for example [Server01:1, Server02:4, Server03:5]
- If some values are missing, they are treated as 0, such as the server04:0
- This makes it easy to add nodes
- The version stamp only detects conflicts and does not resolve conflicts, and conflict resolution relies on domain knowledge
Seven, Map-reduce
- A form of dispersion-aggregation (scatter-gather) pattern
- Put some of the computational logic on the database server
- The input value is a collection, and the output value is a collection of key-value pairs
- The main functions include the following
- Mapper Combinable is usually required, both mapper and used as combiner
- Piping and Filters (pipes-and-filters) are typically used to combine processing
- Incremental Map-reduce: You usually need to save some intermediate results for your next use
- Classic implementation: Hadoop
- Pig: Dedicated language
- Hive: Class SQL language
Viii. Common NoSQL implementations
- Key value
- Document
- Column Family
- HBase
- Cassandra
- Amazon SimpleDB
- Figure
A brief reading of the "NoSQL Essence"