https://mp.weixin.qq.com/s?__biz=MzAwMDU1MTE1OQ==&mid=2653550127&idx=1&sn= 93f79e007d757a2ba887e8dc9e6c47f6&chksm= 813a67b7b64deea1093304b21065d76073f0eb6e4b2923181211d0cb4159e1dc8c3104f7436b&scene=0&key= f9325dcb38245ddc7742d7cded4d414f3281d7d65fa874e0f7fc88b5427449e479341ba5713746a72604354fc9dee61561ea327a1ef14a3e92076444b 3223ba4f3c0af86cfdc217e49ff5008dfbb9cee&ascene=1&uin=mjgwmtewndqxng%3d%3d&devicetype= windows-qqbrowser&version=6103000b&lang=zh_cn&pass_ticket=dxc1954%2bk1sgtbnf0bfrohv9qhwlnnepi% 2bhwkn5vyuopmhizjz4o33viful%2fvdwv
The traditional relational database has been unable to cope with these problems, and it has exposed a lot of difficulties to overcome.
As a result, a variety of NoSQL (not only SQL) databases as a powerful complement to traditional relational data has developed rapidly.
This article will analyze some problems existing in the traditional database, and how to solve these problems by several kinds of NoSQL, hoping to provide some reference for the selection of storage technology in different business scenarios.
Disadvantages of traditional databases
The traditional database has several disadvantages as follows:
I/O is higher in big data scenarios because the data is stored on a row, even if only one of the columns is being operated on, and the relational database reads the entire row of data from the storage device into memory, causing the I/O to be high.
A row record is stored and the data structure cannot be stored.
table Structure Schema extension is inconvenient, such as to modify the table structure, you need to execute DDL (data definition Language), statement modification, during the modification will cause the lock table, some services are not available.
The full-text search function is weak, only the matching query of substring can be carried out under relational database, and when the data of the table becomes larger, the match of like query will be very slow, even in the case of index. Moreover, the relational database should not index text fields.
The ability to store and handle complex relational data is weak, and many applications need to understand and navigate the relationships between highly connected data to enable use cases such as social applications, referral engines, fraud detection, knowledge maps, life sciences, and it/networks.
However, traditional relational databases are not good at dealing with the relationships between data points. Their tabular data models and strict patterns make it difficult for them to add new or different kinds of association information.
NoSQL Solutions
NoSQL, which refers to non-relational databases, can be understood as a powerful complement to SQL.
In many aspects of NoSQL performance is much better than non-relational database, and often accompanied by some features of the missing, more common is the Transaction library transaction function is missing.
The four basic elements of the correct execution of database transactions are ACID as follows:
The following is a description of the disadvantages of the 5 major categories of NoSQL data for traditional relational databases and the solutions provided:
Column Database
A column database is a database of data stored in a column-dependent storage schema, and is primarily suitable for batch data processing and instant querying.
Corresponding to the row-based database, the data in the row-related storage architecture for spatial allocation, mainly suitable for small batches of data processing, often used in online transaction-type data processing.
The Column-column storage feature based on the column database can solve the problem of high-level relational database I/O in some specific scenarios.
Fundamentals
A traditional relational database is a row-based database, called a row database, and a column database stores data in columns.
There are two ways to put a table into a storage system, and most of us use row storage. The row storage method is to put each row into a contiguous physical location, much like a traditional record and file system.
The Columnstore method is to store data in a database as columns, similar to row storage. is a graphical interpretation of two storage methods:
Common Column Database
HBase: is an open source, non-relational, distributed database (NoSQL) that references Google's BigTable modeling and implements the Java programming language.
It is part of the Apache Software Foundation's Hadoop project and runs on the HDFS file system, providing Hadoop with services similar to BigTable size. As a result, it can store large amounts of sparse data in a fault-tolerant manner.
BigTable: is a compressed, high-performance, highly scalable data storage system based on Google file System,gfs for storing large-scale structured data for cloud computing.
Related Features
The advantages are as follows:
Efficient Storage Space utilization: the column database because of its different columns of data characteristics of the different algorithms to make it often more than the row database high compression ratio.
Ordinary row-type database generally compression rate of 3:1 to 5:1, while the compression rate of the column database is generally 8:1 to 30:1 or so.
It's more common to compress data through a dictionary table: that's what the table looks like. After data compression in the dictionary table, the strings in the table become numbers.
Because each string appears only once in the dictionary table, it achieves the purpose of compression (a bit like canonicalization and denormalized Normalize and Denomalize).
High Query efficiency: the same column that reads multiple data is efficient because the columns are stored together, and one disk operation can read all the specified columns of data into memory.
The benefits of Columnstore (and data compression) are illustrated by the execution of a query.
The following steps are performed:
Go to the dictionary table to find the string corresponding number (only one string comparison).
Match the number to the list, and the position on the match is set to 1.
A bitwise operation of the matching results of different columns is performed to obtain a record subscript that meets all the criteria.
Use this subscript to assemble the final result set.
The column database is also suitable for aggregation operations, and is suitable for large amounts of data rather than small data.
Disadvantages are as follows:
Not suitable for scanning small amounts of data.
Not suitable for random updates.
Not suitable for real-time operations that contain deletions and updates.
Single-row data is ACID, multi-line transactions, does not support normal rollback of transactions, supports I (isolation) Isolation (transactional serial commit), D (durability) persistence, does not guarantee A (atomicity) atomicity, C (consistency) consistency.
Usage Scenarios
Take HBase as an example to illustrate:
Large data volumes (100s terabytes of data), with the need for fast, random access.
write-intensive applications , with a large number of writes per day, and relatively small numbers of applications, such as IM history messages, game logs, and so on.
does not require complex query conditions to query the application of data, HBase only supports Rowkey-based queries, and for HBase, a single record or a small range of queries can be acceptable.
A wide range of queries may have a performance impact due to distributed reasons, and HBase does not apply to data models that have joins, multilevel indexes, and table relationships complex.
applications with very high performance and reliability requirements, because HBase itself has no single point of failure, the availability is very high.
An application that has a large amount of data and an unpredictable amount of growth that requires an elegant data extension for HBase to support online expansion, even if the volume of data is growing well over a period of time, can also be extended through HBase to meet functionality.
Storage of structured and semi-structured data.
K-v Database
Refers to a database stored using key values (Key-value), whose data is organized, indexed, and stored in the form of key-value pairs.
K-V storage is well suited for data that does not involve too much data-relational business relationships, and can effectively reduce the number of read and write disks, better read and write performance than SQL database storage, and solve the problem that relational databases cannot store data structures.
Common K-V Databases
Redis: is an open source, support network, memory-based, optional persistent key-value pair store database written in ANSI C.
Since June 2015, the development of Redis has been sponsored by Redis Labs, while its development was sponsored by Pivotal from May 2013 to June 2015.
Prior to May 2013, its development was sponsored by VMware. According to the monthly ranking website db-engines.com data, Redis is the most popular key-value pair store database.
Cassandra: Apache Cassandra (commonly referred to as c* in the community) is a set of open source distributed NoSQL database systems.
Originally developed by Facebook to store simple format data such as Inbox, Google BigTable's data model integrates with Amazon Dynamo's fully distributed architecture.
Facebook will Cassandra Open source in 2008, and since then, due to Cassandra's good scalability and performance.
It has been adopted by famous websites such as Apple,comcas,instagram,spotify,ebay,rackspace,netflix, and has become a popular distributed structured data storage scheme.
LevelDB: is a key/value pair (Key/value pair) embedded database management System programming library developed by Google, released with an open source BSD license.
Related Features
For Redis, for example, the K-V database has the following advantages:
extremely high performance: Redis can support more than 10W of TPS.
rich data types: Redis support includes string,hash,list,set,sorted Set,bitmap and Hyperloglog.
Rich features: Redis also supports features such as publish/subscribe, notifications, key expiration, and so on.
Disadvantages are as follows:
In particular, there is no guarantee that atomicity is a transactional operation for Redis because transactions do not support rollback (roll back), and because Redis's single-threaded model, the normal operation of Redis is atomic.
Most businesses do not need to strictly adhere to the ACID principles, such as the game real-time leaderboards, fan concerns and other scenarios, even if some data persist failure, in fact, the business impact is very small. Therefore, when designing the solution, you need to choose according to the business characteristics and requirements.
Usage Scenarios
Applicable scenarios:
Store user information (such as sessions), configuration files, parameters, shopping carts, and more. This information is typically linked to the ID (key).
Scenario Not applicable:
you need to query by value, not by key. There is no way to query a value in the Key-value database.
the relationship between data needs to be stored. data cannot be associated with two or more keys in the Key-value database.
requires the support of the transaction. cannot be rolled back when a failure occurs in the Key-value database.
Document Database
A document database (also known as a document database) is a database that is designed to store semi-structured data as a document. Document databases typically store data in JSON or XML format.
Because of the no-schema nature of the document database, arbitrary data can be stored and read.
Because the data format used is JSON or BSON, because the JSON data is self-describing, it is not necessary to define a field before use, to read a field that does not exist in the JSON, and not to cause SQL-like syntax errors, it can solve the problem that the relational database table structure Schema extension is inconvenient.
Common Document Databases
MongoDB: A document-oriented database management system, written by C + +, to address a large number of real-world problems in the application development community. In October 2007, MongoDB was developed by the 10gen team. Launched in February 2009.
CouchDB: Apache CouchDB is an open source database that focuses on ease of use and becomes a "fully embraced Web database".
It is a NoSQL database that uses JSON as the storage format, JavaScript as the query language, MapReduce and HTTP as the API.
One notable feature is multi-master replication. The first version of CouchDB was released in 2005, becoming the Apache project in 2008.
Related Features
Taking MongoDB as an example, the documentation database has the following advantages:
The new field is simple, do not need to like a relational database to execute DDL statements to modify the table structure, program code directly read and write.
easy to be compatible with historical data , for historical data, even without new fields, will not result in errors, only return null values, when the code is compatible with the processing.
easy to store complex data, JSON is a powerful description language that can describe complex data structures.
Compared with the traditional relational database, the disadvantage of the document database is that the transaction support of multiple data records is weaker, which is embodied as follows:
atomicity (atomicity), supports only single-line/document-level atomicity, does not support multiple lines, multiple files, multi-statement atomicity.
solation (Isolation), the isolation level only supports Read committed levels, which can result in non-repeatable read, Phantom read issues.
complex queries, such as join queries, are not supported, and the database needs to be manipulated multiple times if a join query is required.
MONGONDB also supports consistency (consistency) and durability (persistence) for multi-document transactions, although it is officially announced that MongoDB will formally launch multi-document ACID transaction support in version 4.0, and the final landing situation remains to be seen.
Usage Scenarios
Applicable scenarios:
The amount of data is large or the future will become large.
table structure is ambiguous, and the field is increasing, such as content management system, information management system.
Scenario Not applicable:
you need to add a transaction on a different document. The document-oriented database does not support transactions between documents.
complex queries, such as joins, are required between multiple documents .
Full-Text search engine
Traditional relational database mainly through index to achieve the purpose of fast query, in full-text search business, index is powerless, mainly reflected in:
The conditions of full-text search can be arranged arbitrarily, and if satisfied by index, the number of indexes is very large.
the fuzzy matching of full-text search, index can not meet, can only use like query, and like query is the whole table scan, the efficiency is very low.
The emergence of full-text search engine is to solve the problem of the weak full-text search function of the relational database.
Fundamentals
The technical principle of the full-text search engine, called "Inverted Index" (inverted), is an index method whose basic principle is to establish the index of the word to the document. In contrast to the "positive row index", the basic principle is to create a document-to-word index.
The following collections of documents are now available:
The positive row index is indexed as follows:
By visible, a positive row index is useful for querying document content based on the document name. The simple inverted index is as follows:
The inverted index with Word frequency information is as follows:
From the above, the inverted index is useful for querying document content based on keywords.
Common Full-text search engines
Elasticsearch: is a Lucene-based search engine. It provides a distributed, multi-tenancy, ability to full-text search with engine HTTP Web interface and no schema JSON files.
Elasticsearch is developed in Java and published as an open source according to the Apache License terms.
According to Db-engines rankings, Elasticsearch is the most popular corporate search engine, followed by Lucene-based Apache SOLR.
SOLR: an open source enterprise search platform for Apache Lucene projects. Its main functions include full-text search, hit-mark, faceted search, dynamic clustering, database integration, and rich text (such as Word, PDF) processing. SOLR is highly extensible and provides distributed search and index replication.
Related Features
Taking Elasticsearch as an example, the full-text search engine has the following advantages:
The query efficiency is high, and the massive data is processed in near real time.
Scalability, based on the cluster environment can be easily scale-out, can host petabytes of data.
High Availability, Elasticsearch cluster resiliency, they will discover new or failed nodes, reorganize and rebalance data to ensure that data is secure and accessible.
Disadvantages are as follows:
acid support is insufficient, data for a single document is ACID, transactions with multiple documents do not support normal rollback of transactions, support for I (isolation) isolation (based on optimistic locking mechanism), D (durability) persistence, does not support A ( atomicity) atomicity, C (consistency) consistency.
Weak support for complex multi-table association operations with foreign keys in similar databases.
read-write has a certain delay, write data, the fastest 1s can be retrieved.
update performance is low, the underlying implementation is to delete the data before inserting new data.
The memory footprint is large because Lucene loads the index part into memory.
Usage Scenarios
The applicable scenarios are as follows:
A distributed search engine and data analysis engine.
Full-Text Search, structured search, data analysis.
In the near real-time processing of massive data, we can distribute massive data to multiple servers to store and retrieve.
The scenario below is not applicable:
Graphics database
Graph database applies graph theory to store relationship information between entities. The most common example is the interpersonal relationship between social networks.
Relational databases are not good for storing "relational" data, and their queries are complex, slow, and exceed expectations.
The unique design of the graphic database makes up for this flaw, and solves the problem that relational database storage and processing complex relational data function is weak.
Common Graphics Database
neo4j: It is a graphical database management system developed by NEO4J,INC. Described by its developers as an ACID-compliant transactional database with native graph storage and processing, NEO4J is the most popular graphical database based on Db-engines rankings.
Arangodb: is a native multi-model database system developed by Triagens GmbH. The database system supports three important data models (key/value, document, graphics), which contains a database core and a unified Query Language AQL (arangodb query language).
The query language is declarative, allowing different data access patterns to be combined in a single query. Arangodb is a NoSQL database system, but AQL is similar to SQL in many ways.
Titan: is an extensible graphics database optimized for storing and querying graphs that contain tens of billions of vertices and edges distributed across multi-cluster clusters.
Titan is a transactional database that enables thousands of concurrent users to perform complex graphical traversal in real time.
Related Features
Taking neo4j as an example, NEO4J uses the concept of graph in data structures to model. The two most basic concepts in neo4j are nodes and edges.
Nodes represent entities, and edges represent relationships between entities. Both nodes and edges can have their own properties. Different entities are linked together through a variety of relationships to form complex object graphs.
For relational data, the storage structure of the two databases is different:
In neo4j, the storage node uses "Index-free adjacency", that is, each node has a pointer to its neighbor node, allowing us to find the Neighbor node in O (1) time.
In addition, according to the official statement, in the neo4j is the most important, that is, "first-class entities", so separate storage, which is conducive to the graph in the time of the traverse to improve speed, can also be easily traversed in any direction.
The advantages are as follows:
high performance, graph traversal is a unique algorithm of graph data structure, that is, starting from a node, according to its connection relationship, can quickly and conveniently find its adjacent nodes.
This method of finding data is not affected by the size of the data, because proximity queries always look for limited local data and do not search the entire database.
design flexibility, natural stretching characteristics of data structures, and unstructured data formats make the design of graph database much more scalable and flexible.
Because the nodes, relationships, and their attributes that increase as the requirements change, do not affect the normal use of the original data.
The agility of development , the intuitive data model, from the discussion of the need, to the development and implementation of the program, and ultimately the way it is stored in the database, does not seem to change or even be the same.
fully supports acid, unlike other NoSQL databases, NEO4J also has a full transaction management feature that fully supports ACID transaction management.
Disadvantages are as follows:
Has a limit on the number of nodes, relationships, and attributes that are supported.
Splitting is not supported.
Usage Scenarios
The applicable scenarios are as follows:
In some highly relational data, such as social networks.
Recommended engine. If we show the data in the form of graphs, it will be very useful to recommend the formulation.
The scenario below is not applicable:
Record large amounts of event-based data, such as log entries or sensor data.
Processing large-scale distributed data is similar to Hadoop.
Suitable for structured data that is stored in a relational database.
Binary data storage.
Summarize
The selection of relational databases and NoSQL databases often requires several indicators to be considered:
Common software system Database selection reference is as follows:
internal use of the management system , such as operating system, the data volume is small, concurrent volume, the preferred relationship-based.
large flow system, such as e-commerce single page, the background to consider the choice of relationship, the front desk consider selecting the memory type.
log-type system, the original data considering the choice of columns, log search consider the inverted index.
Search-based systems, such as in-site search, non-generic search, such as product search, the background to consider the relationship of choice, the foreground is considered to choose Inverted Index.
transactional systems, such as inventory, transactions, bookkeeping, consider the relationship type + cache + Consistency protocol.
offline computing, such as a large number of data analysis, consider the choice of columns or relational type can also.
Real-time calculation, monitoring, you can consider the choice of memory-based or column database.
In the design practice, we should be based on demand, business-driven architecture, regardless of the choice of Rdb/nosql/drdb, must be demand-oriented, the final data storage scheme is necessarily a comprehensive design of various tradeoffs.
References:
Learn architecture from 0--alibaba Li Yunhua
A ramble on NoSQL
Graphic database neo4j development in combat
9 Big Key-value Storage databases in the Big Data era
Transaction--redis Official documentation
How does MongoDB implement the acid of a transaction?
MySQL dirty read, virtual read, Phantom read
Comprehensive grooming of relational databases and NoSQL usage scenarios
Analysis of characteristics of column database
One minute to understand the column and row database
HBase Basic Concepts
NoSQL Databases, why we should use, and which one we should choose
Traditional relational database and distributed database Knowledge point
Chen Yaahua
Editor: Tao Jiarong, Sun Shujuan
NoSQL or SQL? This article speaks clearly