1. Preface
In order to adapt to the requirements of big data scenarios, new architectures such as Hadoop and nosql that are completely different from traditional enterprise platforms are rapidly emerging. The fundamental revolution of the underlying technology will inevitably affect the superstructure: data models and algorithms. Simply copying the traditional model based on the four-paradigm structured relational database to the new engine is tantamount to fits, which not only increases the difficulty and complexity of big data application development, but also fails to release the potential of the new framework.
How do I build a nosql-based data model? Now the open knowledge that can be used for reference is accumulated either as an empty simple phrase "to normalize" or a rough wide table ("Row" of all the fields that query and application need to access, put in a structured table with many columns), This is either a specific tool or implementation details for a specific scenario, such as how to design the HBase primary key in the HBase authoritative guide. There is no methodology like programming design patterns that can be followed at the model architecture level.
When comparing different NoSQL databases, various metrics, such as extensibility, performance, and consistency, are often used in addition to features. Since these indicators are often used in the original sense of NoSQL, they are deeply studied from a theoretical point of view and a practical point of view, and the basic conclusion of a distributed system like the cap theorem applies to NoSQL systems as well. On the other hand, in the field of NoSQL data model, it has not been studied very well and lacks the systematic theory in relational database.
In this article, I have made a relatively simple comparison of NoSQL family systems from the point of view of data modeling, and briefly introduced several common modeling techniques.
2.NoSQL Data Model View
To explore data modeling techniques, you must start with a systematic view of the NoSQL data model, which can help us uncover trends and relationships. Depicts the process of virtual "evolution" of the main NoSQL family system, namely key-value storage, BigTable type database, document database, full-text search engine, database and graph database:
First, we should note that, in general, SQL and relational models were designed long ago to be used for end-user interaction. This user-oriented nature has a very deep impact:
End users tend to be interested in summarizing report information rather than individual data items, so SQL does a lot of work.
It is not expected that users who act as natural persons can explicitly control concurrency, integrity, consistency, or data type validity. That's why SQL is focused on transaction assurance, schema, and referential integrity.
software applications, on the other hand, tend to have little interest in aggregating within the database, and at least in many cases, the program is able to control its integrity and effectiveness. In addition, eliminating the effects of these features on performance and scalability storage is of paramount importance.
The evolution of the new data model has begun:
Key-value storage is a very simple, but very powerful model. Many of the techniques described below are fully applicable to this model.
One of the most deadly drawbacks of a key-value model is the scenario where the primary key is not handled by scope. An ordered key-value model overcomes this limitation and significantly improves the aggregation capability.
An ordered key-value model is very powerful, but it does not provide any modeling framework for value. In general, the modeling of values can be done by the application, but the BigTable-style database is more thoughtful, and it can model the values according to mapped mappings (Map-of-maps-of-maps), with a clear point, which are the columns (column family), respectively. column, and the timestamp version.
The document database presents two significant improvements to the BigTable model. First, the value can be declared as any complex schema, not just a mapped mapping (map-of-maps). Second, at least some of the products have been indexed for database management. In this sense, full-text search engines can also be thought of as providing a flexible schema and an automated index. The main difference between them is that the document database groups the index based on the field name, and the search engine uses the field values to group the indexes. It is important to note that key-value storage systems such as Oracle coherence increase the functionality of indexes and embedded ingress processors, and are progressively evolving to the file database.
Finally, the graphical data model can be thought of as an ordered key-value model that evolves in another direction. The graphical database allows for very transparent modeling of business entities (something that depends on that), while layered modeling techniques use a different data model, but they can be comparable. The graphical database and the file database are closely related, because many implementations allow modeling values to be mappings or documents.
General considerations for modeling 3.NoSQL data
Unlike relational modeling, NoSQL data modeling often begins with the application of a particular query:
Relational modeling is typically driven by the structure of the data that is available to the hand. The design revolves around "What kind of answers do I have?"
NoSQL data modeling is typically driven by the access patterns of specific applications, such as the types of queries that need to be supported. The design revolves around "What's wrong with me?"
NoSQL data modeling often requires more in-depth understanding of data structures and algorithms than relational database modeling. In this article, I've introduced several well-known data structures that are unique to nosql, but are useful for real-world NoSQL modeling.
Data copying and de-normalization are the one-class citizens.
Relational databases are not very convenient when modeling and processing hierarchical or graphical data. The graphics database is clearly the perfect solution for this field, but in fact most nosql are very good at solving such problems. That's why this article has written a separate chapter for hierarchical data modeling.
Although data modeling techniques are largely unrelated to implementation, I have listed the products I can think of when writing this article:
Key-Value storage: Oracle Coherence,redis,kyoto Cabinet
BigTable-style database: Apache hbase,apache Cassandra
Document database: MONGODB,COUCHDB
Full-Text search engine: Apache lucene,apache SOLR
Graphics database: NEO4J,FLOCKDB
4. Conceptual Technology
This section specifically describes the basic principles of NOSQL data modeling.
1, to standardize (denormalization)
You can define de-normalization to copy the same data into multiple documents or data tables, which simplifies/optimizes query processing or enables user data to match a specific data model. Most of the techniques in this article have been used to normalize such or that.
In general, go to normalization for the following tradeoffs:
The amount of data queried or the tradeoff between each query io** and the total amount of data. Go to normalization you can combine all the data you need for a query to be stored in the same place. This usually means that different queries for the same data will access different combinations of data. Therefore, the data needs to be copied multiple copies, which means that the total amount of data added.
Deal with the tradeoff between complexity and total data volume. The normalization of modeling and the connection of corresponding queries (join) Significantly increase the complexity of the query processor, especially in distributed systems. De-normalization allows data to be stored in a query-friendly manner, simplifying the processing of queries.
Applicability: Key-value storage, document database, bigtable-style database
2. Polymerization (aggregates)
All mainstream NoSQL offers this or that loose schema (soft schema) Support:
Key-value stores and graphics databases generally do not constrain values, so the values may be in any format. Alternatively, you can represent a business entity as multiple records by using a composite key. For example, you can model a user account as a collection of entities represented by a combination of keys such as Userid_name,userid_email,userid_messages. If the user does not have an email or message, then the corresponding entity is not logged.
The BigTable schema also supports loose schemas, because a column family is a mutable set of columns, and a cell can store an indefinite number of data versions.
The document database is inherently out of schema, although some document databases allow validation using user-defined schemas when data is entered.
The loose schema allows the use of complex internal structures (nested entities) to construct the classes of entities, and also allows the structure of specific entities to be changed. This will bring two important conveniences:
By nesting entities, a one-to-many relationship is minimized, and therefore the join IS reduced.
A model of a heterogeneous business entity can use a collection of documents or a data table. The loose schema hides the "technology" differences between this modeling and business entities.
We use the following diagram to illustrate these conveniences. This diagram depicts the modeling of a product entity in the e-commerce domain. First we can assume that all products have an ID, price, and description (Deion). Further, we find that different types of products have different properties, books contain author information, and jeans have length attributes. Some of the attributes in the middle of these properties are inherently one-to-many or many-to-many features, such as tracks in a music album.
Further, there may be entities that cannot be modeled with a fixed type. For example, the properties of different brands of jeans are not fixed, and the properties of the jeans produced by each manufacturer are inconsistent. Although these problems can be solved in the normalized relational data model, the method is very wretched. The loosely schema soft schema allows you to model all types of products and their attributes using only one aggregation (Aggregation) (product):
Inline normalization will have a significant impact on the performance and consistency of the update operation, so pay special attention to the update process.
Applicability: Key-value storage, document database, BigTable style database
3. Application End Connection (application Side Joins)
There are few NoSQL solutions that support connectivity. The consequence of the NoSQL "problem-oriented" nature is that the join is typically handled at design time, whereas a relational model handles joins when the query is executed. Processing joins at query time is almost certainly a performance penalty, but in many cases, you can use de-normalization and aggregation to embed nested entities to avoid joins. Of course, joins are unavoidable in many cases and should be handled by the application. The main use cases:
Many-to-many relationships are often modeled by link, which requires a join.
Aggregation operations are often not suitable for scenarios in which internal entities are frequently modified. Usually a better approach is to keep what happens as a new record, and to join all the records at the time of the query, instead of changing the values. For example, for an information system, you can model with nested user entities that contain the message entity. However, if you add messages frequently, it might be better to extract the message as a standalone entity and then connect it to the user at query time:
Applicability: Key value store, document database, bigtable style database, graphic database
5. General Modeling Techniques
In this section, we will discuss general modeling techniques for a variety of nosql implementations.
1. Atomic polymerization (Atomic aggregates)
Many NoSQL solutions offer limited transactional support, although some nosql do not support it. In some cases, people can also use distributed locks or application-managed MVCC mechanisms to implement transactional behavior, but it is common to use aggregation techniques to model data to ensure some acid properties.
A powerful transaction mechanism is essential for relational databases, one of the reasons is that normalized data often needs to be updated in multiple places. Aggregations, on the other hand, allow a single business entity to be stored as a file, row, or key-value pair, so that it can be atomically updated:
Of course, as a data modeling technique, atomic aggregation is not a perfect transactional solution, but if the store can provide some guarantees on atomicity, lock or TAS (test-and-set, test and set) instructions, then atomic aggregation is possible.
(Translator Note: The aggregation of business data that is about to require transactional operations is stored in a data structure that provides or applies atomic operations to a noqsql. When using HBase, storing all the data for a user's business, such as a row, is an application of this pattern. )
Applicability: Key value store, document database, BigTable style database
2. Enumerable primary KEY (Enumerable keys)
Perhaps the biggest benefit of the unordered key-value data model is that entity data can be stored on multiple servers by means of a primary key hash. Sorting makes things more complicated, but even if the storage does not provide such functionality, sometimes the application can take advantage of an ordered primary key. Let's model e-mail as an example:
Some NoSQL stores provide atomic counters that can generate a sequential ID. In this case, you can use Userid_messageid as a composite key to store messages. If the latest message ID is known, it is possible to traverse the previous message. Also, for any given message ID, you can traverse forward or backward.
You can also put messages in buckets (buckets), for example, daily data into a bucket. This allows a mailbox to be traversed forward or backward from any specified date or current date.
Applicability: Key-value storage
(Translator Note: can use some natural or business dimensions of the primary key features, the random read-write conversion to sequential read and write can improve the traversal performance, but also facilitate the application of logic to write. However, you need to be aware of the impact of concurrent writes on distributed deployments and over-coupling to your business. The discussion of unordered primary keys and ordered primary keys can be found in the schema design section of the HBase authoritative guide. )
3. Dimensionality reduction (dimensionality Reduction)
dimensionality reduction This technique allows a multidimensional data model to be mapped to a key-value model or other non-multidimensional model.
Traditional Geographic information systems use a variant of the four-tree (quadtree) or R-tree (r-tree) to index. These structures require an in-place update operation, so the maintenance overhead is significant when the amount of data is large. Another approach is to traverse the two-dimensional structure and flatten it into a plain list of entries. A well-known example of the use of this technique is geohash. Geohash uses a Z-shaped route to scan the entire two-dimensional space, and each move is encoded as 0 or 1 depending on the direction of travel. Changes in longitude and latitude of the intersection move and move. The encoding process is described in, where the black and red bits represent longitude and latitude, respectively:
, an important feature of Geohash is the ability to estimate the distance between regions by the approximate degree of this bit-wise encoding. Geohash encoding allows the use of simple, common data models to store geographic information, such as preserving spatial relationships with ordered key values. [6.1] describes the bigtable in the reduction of dimension technology. More information on Geohash and its related technologies can be found in [6.2] and [6.3].
Applicability: Key-value storage, document database, bigtable-style database
(Translator Note: It is a very important modeling mode to store data that is required to be multi-dimensional, such as cube, in a one-dimensional key-value storage system by means of interlaced coding, which provides the data that adjacency in multidimensional space under different scaling levels is still stored sequentially, and the traversal is efficient; At the same time, the similarity between the front and back of different primary keys is consistent with the distance of space, and the position "similarity" can be judged by the simple order of the key values.
Its application is far more than geographical information representation, there are multiple dimension attributes of different granularity of data representation can use this technology, such as offline sales transaction data usually have time and branch structure information, the user query sales data usually first query a large area, and the use of time span is also relatively large, When you need to query more specific information while narrowing down the region and time to step through the details, then the branch office and time is like two axes of the longitude dimension, by weaving time and branching structure coding to establish the primary key, can well meet this scenario. The simple use of time or branch office key or index can not adapt to the above scenario. )
4. Index Table
The index table is a very simple technique that provides support for indexing on storage that does not support indexes internally. The most important category in this type of storage is the BigTable-style database. The idea of an indexed table is to create and maintain a special table based on the keys required to access the pattern. For example, there is a primary table that stores user accounts that can be accessed directly through the user ID. Querying all users in a given city can be supported by an additional table with the city key:
The index table can be updated at each master table record update or with batch mode updates. Either way, it leads to additional performance losses and leads to data consistency issues.
An indexed table can be thought of as a simulation of a relational database materialized view.
Applicability: bigtable-style database
5. Combined primary key index (Composite key)
Composite primary keys are a very common technique, but especially useful when primary keys are stored in an orderly fashion. Composite primary key combined with two order can establish a multidimensional index, which is similar to the above-mentioned dimensionality reduction technique in principle. For example, suppose we have a set of records, and each record is a user statistic. If we want to aggregate these statistics according to the area that the user is from, we can use such a primary key format (STATE:CITY:USERID). If the storage support for a primary key selects a range (such as a bigtable-style database) through partial matching, it can be traversed on a particular state or city record:
Applicability: bigtable-style database
(Translator Note: When using bigtable-style databases such as hbase, it is simply throwaway if the primary key uses only one field/Domain information.) Combining a primary key with multiple fields not only solves the problem that the primary key value cannot be duplicated, but also improves performance for two lookups of a subset of data. )
6. Aggregation of combined primary keys
Composite primary keys can be used not only for indexing, but also for grouping different types. Let's take a look at an example. There is a huge log array that records the information of Internet users and their visits to different websites (clickstream). Our goal is to calculate the number of clicks per site for each unique user. This is similar to the following SQL query:
We can model this situation using the combined primary key of the user name's current prefix:
Our idea is to put all of the records of a user together so that it is possible to load them all into memory (one user does not produce too many events) and use a hash table or other method to eliminate duplicate sites. Another technique is to make the user the primary key, adding the site to the back of the data each time the event arrives. However, in most implementations, modifying data is generally less efficient than inserting data.
Applicability: Ordered key-value storage, bigtable-style database
7. Inverted search-Direct aggregation
This technique is more like a data-processing pattern than a modeling model. However, the data model is also affected by the use of this pattern. The main idea of this technique is to use the index to find the data that satisfies the condition, but the aggregation operation uses the original way or the full table scan. Let's consider an example. There is a bunch of log data that records the information of Internet users and their access to different websites (click Stream, clicking Stream). Assume that each record includes a user ID, the category to which the user belongs (male, female, blogger), the city that the user is from, and the URL visited. Our goal is to identify the audiences that meet the conditions (URLs, cities, etc.) and categorize the different users that appear in the audience (such as the standard set of users) by category.
It is clear that users who meet the criteria can find them very efficiently through an inverted index table such as {class->[user ID]} or {site->[user ID]}. With such an inverted index, you can get the intersection of the desired user ID or the set (if the user ID is stored as an ordered list or bitmap, which can be implemented very efficiently) to get the target user. But if the target user is described using a clustered query like this:
If the number of categories is large, it is not possible to use inverted index to do effective processing. To solve this problem, you can create a direct index with {user name->[classification set]}, and then traverse it to establish the final report. This architecture illustrates such as:
Finally, we need to know that it may be inefficient to randomly access records for each user ID in the target user. You can solve this problem by leveraging batch query processing. This means that some number of user sets can be pre-computed (for different report conditions), and then all reports for the target user can be calculated by making a full table scan of the Direct Index table or inverted index table.
Applicability: Key-value storage, bigtable-style database, document database
Layered Modeling Technology (Hierarchy Modeling techniques)
(Translator Note: If you need to quickly find the corresponding entity through attributes or categories, the index is less than that.) Now it's very hot. Analyzing user images with big data does not require a structure like the user group {user tags. )
8. Tree Aggregation (Aggregation)
A single record or file model can be built into a tree, or even an arbitrary graph (by normalization).
This technique is efficient when the tree is accessed in a one-time scenario (for example, the entire comment tree of a blog is read and displayed on a page of an article).
There may be a problem searching and accessing any entry.
In most NoSQL implementations, the update operation is inefficient (as opposed to a node that is independent of each other).
Applicability: Key value store, document database
(Translator Note: This is the document-type NoSQL world, for the scene without cross-tree join, not only the high-speed reading and writing efficiency, but also a good support for local transactional applications.) )
9. adjacency list (adjacency Lists)
The adjacency list is a simple pattern modeling method-Each node is modeled as a single record, with an array containing a direct ancestor or an array containing descendants. It allows you to search for a node by its parent or child's identifier, and, of course, to traverse a graph in a further way before querying. This approach is usually inefficient for a given node in the entire subtree, regardless of the depth-first or breadth-first traversal.
Applicability: Key value store, document database
10. Materialized path
A materialized path is a technique that helps to avoid recursive traversal on a tree structure. It can also be thought of as a way to standardize technology. The idea is to identify the node with all the parent or child nodes of a node, so that you can get all the ancestor nodes or derived nodes of a node without traversing it:
Because this technique can transform hierarchies into flat documents, it is particularly useful for full-text search engines. As you can see, all products or subcategories under the men's Shoes category can be simply queried by a category name. This query is very short.
A materialized path can be stored either as a collection of IDs or as a string that contains a cascading ID. The latter approach allows the use of regular expressions to find nodes that specify the path of a specified part to a certain condition. This method is shown in the (path also includes the node itself):
Applicability: Key-value storage, document database, search engine
11. Nested Set (Nested sets)
Nested collections are a standard practice when modeling a tree-like structure. It is widely used in relational databases, but it is also fully applicable to key-value stores and document databases. The idea is to use an array to store the tree's leaf nodes (the translator: each leaf node corresponds to a positional subscript in the array) and map each non-leaf node to a range of leaf nodes, which is the beginning of the leaf node and the end of the leaf node in the array position subscript. As shown in the following:
This structure is very effective for unchanging data because it occupies a small amount of memory and can get all the leaf nodes of a given node without traversing the tree. However, because the addition of a leaf node brings a lot of updates to the location subscript, the cost of inserting and updating the operation is quite high.
Applicability: Key value store, document database
(Translator Note: Applies to dictionary tables, historical log data tables sorted by date, and so on.) )
12. Flattening nested files: Field name number
Search engines typically work on flat documents, where each document consists of two flat lists, each of which records the name of the field and its corresponding value. The goal of data modeling is to map business entities to simple unstructured documents, but if the structure inside the entity is complex, this is tricky. A typical difficulty is a hierarchical model, such as mapping a document that is nested inside a document to a simple unstructured document. Let's consider the following example:
Each business entity is a form of CV that contains the person's name, and enumerates a list of the skills he or she has and the corresponding skill level. One obvious way to model such an entity is to create a simple unstructured document that contains a list of skill and level fields. This pattern allows a person to be searched by skill or level, but combining the two fields will easily lead to a false match, as described in. (Translator: A simple and operation cannot perceive skills and their level of correspondence.) )
A method for solving this problem is proposed in [4.6]. The main idea of this technique is to combine each skill with the corresponding level to form a pairing (pair) and use subscript markers to become skill_i and level_i. You need to query all of these pair values at the same time (the number of or conditional statements in the query is the same as the maximum of the skills a person can have):
This approach is virtually non-scalable because the increase in the number of nested structures quickly increases the complexity of the query.
Applicability: Search engine
(Translator Note: If you are unable to use the index to speed up the query, especially when the user is ignorant of how to query data in the future, can not be optimized when designing the model, this is the last straw: Use full-text Search tool. It can at least detect something in an acceptable time to ^o^. )
13. Flat nested file: Approximate query
Another technique for solving nested file problems is described in [4.6]. The idea is to use approximate queries to limit the distance between words in a document to an acceptable distance. In the figure below, all the skills and levels are indexed to a field called Skillandlevel. If you use "excellent" and "poetry" to make a query representation, you will find the entry for the two-word adjacency:
[4.3] Describes a successful case of using this technology over SOLR.
Applicability: Search engine
(Translator note: Ditto, but the query recall rate is high, there will be false matches, dirty data.) )
14, Batch map processing
A graphics database such as neo4j is surprisingly good when you are browsing an adjacent node of a specified node or browsing relationships between two or more nodes. However, it is not very efficient for the general graphics database to do global processing of large images because of poor extensibility. Distributed graphics processing can be implemented using the MapReduce or message passing mode. One of my previous articles introduced a pattern like this. This approach uses key-value stores, document databases, and bigtable-style databases to handle large graphics.
Big Data Architect must-read NoSQL modeling technology