NoSQL Data Modeling Technology

Source: Internet
Author: User
Tags apache solr cassandra solr neo4j value store

This article was reproduced from: http://coolshell.cn/articles/7270.html

================================================

Full-text translation from the wall of the article "NoSQL Data Modeling techniques", translation is not good, please forgive me. After reading this article, you may feel a little bit about NOSQL data structures. My feeling is that the relational database wants to do a good job of consistency, integrity, indexing, crud, NoSQL doing only one thing, but sacrificing a lot of other things. Overall, I think NoSQL is a better place to be a cache. Here is the text--

NoSQL databases are often used as a place for many non-functional places, such as extensibility, performance, and consistency. These NoSQL features are being extensively researched in theory and practice, and the focus of research is on non-functional things related to performance distribution, and we all know that the CAP theory is well used in NoSQL systems (Chenhao Note: cap is, consistency (consistency), Availability (availability), partition tolerance (Partition tolerance), in distributed systems, these three features can only be implemented at most two simultaneously, and NoSQL generally abandons consistency. On the other hand, NoSQL data modeling techniques are not well researched by the world because of the lack of basic theories like relational databases. This article compares the NoSQL family with data modeling and discusses several common data modeling techniques.

To start discussing data modeling techniques, we have to look at the growing trend of NOSQL data models more or less, so that we can understand some of their inner connections. Is the evolution of the NoSQL family, and we can see this evolution: The Key-value era, the BigTable era, the document age, the full-text search era, and the graph Database era: (Chenhao Note: Note that the SQL said in the figure of the sentence, The way NoSQL goes on is SQL, haha. )


NoSQL Data Models

First, we need to note that the SQL and relational data models have been around for a long time, and this nature of user-oriented means:

    • End users are generally more interested in aggregating the data, rather than separating the data, which is done primarily through SQL.
    • We cannot manually control data concurrency, integrity, consistency, or data types to check these things. That's why SQL needs to do a lot of things in the transactional, two-dimensional table structure (schema) and appearance union.

SQL, on the other hand, allows software applications to control data aggregation and data integrity and effectiveness in many cases without concern for the database. And if we go beyond data consistency, the integrity of these things can be a heavy help for performance and distributed storage. Because of this, we have the evolution of the data model:

    • The key-value key value is very simple and powerful for storage. Many of the techniques below are basically based on this technology. However, Key-value has a very deadly problem, that is, if we need to find a range of Key. (Chenhao Note: People who have learned the hash-table data structure should know that hash-table is a non-sequential container, which is not an orderly container like arrays, links, queues, we can control the order of data storage). Thus, the ordered key value (Ordered key-value) data model is designed to address this limitation and to fundamentally improve the data set problem.
    • The Ordered Key-value ordered key-value model is also very powerful, but it does not provide some kind of data model for Value. Typically, the model for value can be parsed and accessed by the application. This is very inconvenient, so there is a bigtable type of database, this data model is actually a map in the Map,map set map, layer by layer, that is, layers of nested key-value (value is a key-value), The value of this database is controlled primarily by the column family (column families), columns, and timestamps. (Chenhao Note: The time stamp to the version of the data control is mainly to solve the data storage concurrency problem, that is, the so-called optimistic lock, see "Multi-version concurrency control (MVCC) in the application of Distributed Systems")
    • The document databases documentation database improves the BigTable model and provides two meaningful improvements. The first is to allow a subjective pattern (scheme) in value, rather than a map set map. The second one is the index. Full Text Search Engines fulltext search engine can be seen as a variant of the document database, they can provide flexible variable data schema (scheme) and automatic indexing. The difference between them is that the document database is indexed with field names, and the full-text search engine is indexed with field values.
    • The Graph data models schema database can be thought of as a branch of the evolutionary process that evolved from the Ordered key-value database. Schema database allows to construct the data model of the proposed graph structure. The reason it is related to the document database is that many of its implementations allow value to be a map or a document.
NoSQL Data Model Summary

The remainder of this article will introduce you to the technical implementation and related patterns of data modeling. However, before introducing these techniques, let's start with a preface:

    • NoSQL data Model designs typically start with specific data queries for business applications, rather than relationships between data:
      • The relational data model is basically the analysis of the structure and relationship between the data. Its design philosophy is: "Whatanswers does I have?"
      • The NoSQL data model basically starts with how the application accesses the data, such as: I need to support some kind of data query. Its design philosophy is "What questions does I have?"
    • NoSQL Data model design requires a deeper understanding of data structures and algorithms than relational databases. In this article I'll talk to you about the well-known data structures that are not just used by NoSQL, but are very helpful for nosql data models.
    • Data redundancy and normalization is a one-class citizen.
    • Relational database is very inconvenient for processing hierarchical data and schema data. NoSQL is a very good solution to solving schema data, and almost all NoSQL databases can solve such problems very strongly. This is why this article is devoted to a chapter to illustrate the hierarchical data model.

Here's a list of NoSQL categories, and the products I used to do when I wrote this article:

    • Key-value storage: Oracle Coherence, Redis, Kyoto Cabinet
    • Class BigTable Storage: Apache HBase, Apache Cassandra
    • Document database: MongoDB, CouchDB
    • Full-text index: Apache Lucene, Apache SOLR
    • Figure database: neo4j, FLOCKDB
Conceptual technology Conceptual techniques

This section focuses on the basic principles of the NoSQL data model.

(1) Anti-normalization denormalization

Anti-normalization denormalization can be thought of as copying the same data into different documents or tables, which simplifies and optimizes the query, or just fits the user's particular data model. Most of the techniques described in this article are more or less guided by this technique.

In general, anti-normalization needs to weigh these things:

    • query data volume/query IO VS Total data volume . With normalization, you can combine all the data needed for a query statement and store it in one place. This means that the same data needed for different different queries needs to be placed in different places. As a result, this creates a lot of redundant data, which leads to an increase in the amount of data.
    • Complexity of processing VS Total data volume . Querying a table connection in a data schema that conforms to the paradigm obviously increases the complexity of query processing, especially for distributed systems. The normalized data model allows us to construct data structures in a convenient way to simplify query complexity.

Applicability : Key-value Store Key-value pairs database, document databases documentation database, bigtable-style database.

(2) Polymerization aggregates

All types of NoSQL databases provide a flexible schema (data structure, restrictions on data formats):

    • Key-value Stores and Graph Databases are basically not in the form of value, so value can be in any format. This allows us to arbitrarily combine keys for a business entity. For example, we have a business entity for a user account that can be combined by these keys: userid_name, Userid_email, Userid_messages , and so on. If a user does not have an email or message, then there is no such record.
    • The BigTable model supports a flexible schema through a collection of columns, which we call the column family (family). BigTable can also appear in different versions on the same record (through timestamps).
    • The document Databases documentation database is a hierarchical "go-to-schema" store, although some of these databases allow you to verify that the data that needs to be saved satisfies a schema.

The flexible schema allows you to store a set of associated business entities in a nested internal data way (Chenhao Note: A data encapsulation format similar to JSON). This can bring us two benefits.

    • Minimize "one-to-many" relationships--you can store entities in a nested way, which allows for fewer table joins.
    • Data storage in-house technology can be closer to business entities, especially that of mixed business entities. May exist in a document set or in a table.

Both of these benefits are signalled. The picture depicts the commodity model in e-commerce (Chenhao Note: I remember the challenge of the product Classification database design in the "challenges everywhere" article)

    • First, all product products will have a id,price and Description.
    • We can then know that different types of goods will have different properties. For example, the author is the property of the book, length is the property of the jeans. Some of these properties may be "one-to-many" or "many-to-many" relationships, such as tracks in a record.
    • Next, we know that some business entities cannot use a fixed type. Such as: The properties of jeans are not all brands have, and some brands will be very special properties.

For relational databases, it is not easy to design such a data model, and the design is absolutely far from elegant. And the flexible schema of our nosql allows you to use an aggregation Aggregate (product) to build all the different kinds of goods and their different attributes:

Entity Aggregation

We can compare the differences between relational databases and NoSQL. but we can see that non-normalized data storage has a big impact on performance and consistency on data updates, which is where we need to focus and sacrifice .

Applicability : Key-value Store Key-value pairs database, document databases documentation database, bigtable-style database.

(3) Application layer coupling application Side Joins

Table joins are largely not supported by NoSQL. As we said earlier, NoSQL is "problem-oriented" rather than "face-to-answer", and not supporting table junctions is the consequence of "problem-oriented". The junction of a table is constructed at design time, not at execution time. So, table joins at run time is very expensive (Chenhao Note: The SQL table junction knows what the Cartesian product is, big can be seen in the previous cool shell "diagram database table joins"), but after using denormalization and aggregates technology, We basically do not have to do table joins, such as: you use nested data entities. Of course, if you need to connect data, you need to do it at the application level. Here are a few of the main use case:

    • Many-to-many data entity relationships-often need to be connected or joined.
    • Aggregation aggregates does not apply to situations where data fields are often changed. For this, we need to divide the fields that are often changed into separate tables, and we need to join the data when we query. For example, we have a message system that can have a user entity that includes an inline message entity. However, if the user is constantly attaching a message, it is best to split the message into another separate entity, but join the user and message entities at query time. Such as:

Applicability : Key-value Store Key-value pairs database, document Databases documentation database, bigtable-style database, graph Databases graph database.

Universal modeling Technology General Modeling techniques

In this book, we will discuss various common data modeling techniques in NoSQL.

(4) Atom polymerization Atomic Aggregates

Many NoSQL databases (not all of them) are short-board in transaction processing. In some cases, they can be implemented through distributed lock technology or application layer-managed MVCC technology (Chenhao Note: See the "Multi-version concurrency control (MVCC) application in Distributed Systems") however, In general, only the polymerization aggregates technique can be used to guarantee some acid principles.

This is why our relational database requires a strong transaction mechanism-because the data of the relational database is normalized and stored in a different place. So, aggregates aggregation allows us to save a business entity as a document, save it as a single line, and save it as a key-value so that it can be atomically updated:


Atomic Aggregates

Of course, atomic aggregation Atomic aggregates this data model does not implement transactional processing in a full sense, but if you support atomicity, locks, or test-and-set directives, then Atomic aggregates can be applied.

Applicability : Key-value Store Key value pairs database, document databases documents database, BigTable style database.

(5) Enumerable key Enumerable keys

Perhaps the biggest benefit for Key-value is that business entities can be easily hashed to partition on multiple servers. The sorted key will complicate things, but sometimes an app can get a lot of benefits from sorting keys, even if the database itself doesn't provide this functionality. Let's consider the data model for the email message:

    1. Some NoSQL databases provide atomic counters to allow for the birth of some sequential IDs. In this case, we can use Userid_messageid as a combination key. If we know the latest message ID, we can know the previous message, or we may know the message before and after.
    2. Messages can be packaged. For example, a Daily Mail package. This way, we can traverse the message for a specified time period.

Applicability : Key-value Store Key-value pair database .

(6) Reduced dimension dimensionality Reduction

Dimensionality Reduction is a technique that allows a multidimensional data map to be mapped into a key-value or other non-multiple data model.

The traditional geographic information system uses some such as "four-tree Quadtree" or "r-tree" to do geo-indexing. The content of these data structures needs to be updated in the appropriate location, and if the amount of data is large, the cost of operation will be high. Another way is that we can traverse a two-dimensional data structure and flatten it into a list. A well-known example is Geohash (geo-hash). A geohash uses the "zigzag" route to scan a 2-dimensional space, and the movement in the traverse can be simply represented by 0 and one, and then 0/1 strings are generated during the move. Show this algorithm: (Chenhao Note: First to divide the map into four parts, longitude is the first, latitude is the second, so the longitude on the left is 0, the right is 1, the same latitude, the above is 1, the following 0, so that the latitude can be combined into 01,11,00,10 this four values, which identifies four areas, We can do this constantly recursively four points per region, then we can get a string of 1 and 0, and then use 0-9,b-z to remove (remove a, I, L, O) the 32 letters for BASE32 encoding to get a 8-length encoding, which is Geohash algorithm)


Geohash Index

The most powerful feature of Geohash is the ability to know the distance between two regions using a simple bit operation, as shown in the figure (Chenhao: The two of the proximity box, which is much like an IP address). Geohash has transformed a two-dimensional coordinate into a one-dimensional data model, which is the dimensionality reduction technique. BigTable's dimensionality reduction technology See [6.1] later in the article. More about Geohash and other technologies can be found in [6.2] and [6.3].

Applicability : Key-value Store Key value pairs database, document databases documents database, BigTable style database.

(7) Index Table

Index table is a straightforward technique that allows you to get the benefits of indexing in a database that does not support indexing. BigTable is the most important database of its kind. This requires that we maintain a special table with corresponding access patterns. For example, we have a main table with a user account, which can be accessed by the UserID. A query needs to find out all the users in a city, so we can add a table, this table with the city key, all and this city is related to the UserID is its value, as follows:


Index Table Example

It is visible that the needs of the City Index table and the user table for the main table are consistent, so each update of the primary table may need to be updated on the index table, or it is a batch update. Either way, this can damage some performance, because consistency is required.

The index table can be considered the equivalent of a view in a relational database.

Applicability : BigTable database.

(8) Key combinations index Composite key index

The Composite key combination is a very common technique, which can be very beneficial when our database supports key ordering. Composite key combination is a second sort field that allows you to build a multidimensional index, much like the dimensionality Reduction technology we said before. For example, we need to access user statistics. If we need to count the distribution of the users according to different regions, we can design the key into such a format (State:City:UserID), so that we can go through the state to the city to traverse the user by group, In particular, our NoSQL database supports query-by-zone on key (e.g., BigTable class system):

1

2

SELECTValues WHEREstate="CA:*"

SELECTValues WHEREcity="CA:San Francisco*"


Composite Key Index

Applicability : BigTable database.

(9) key combination aggregation Aggregation with Composite keys

The Composite keys combination technique can be used not only for indexing, but also for distinguishing between unused types of data to support data grouping. Consider an example where we have a huge array of logs that record the source of access for users on the Internet. We need to calculate the number of independent visitors coming from a site, and in a relational database, we might need the following SQL query statements:

1

SELECTcount(distinct(user_id)) FROM clicks GROUP BYsite

We can create the following data models in NoSQL:


Counting Unique Users using Composite Keys

In this way, we can sort the data by UserID, and we can easily process the same user's data (one user does not generate too many event), and remove the duplicate sites (using hash table or whatever). Another alternative technique is that we can create a data entity for each user and then append its site source to the data entity, which, of course, will have a loss in performance compared to the data update.

Applicability : Ordered key-value Store sort key value to database, BigTable style database.


(10) Reverse Search inverted search– direct aggregation directly Aggregation

This technology is more data-processing technology, rather than modeling technology. Still, this technique will affect the data model. The main idea of this technique is to use a lead to find data that satisfies a condition, but aggregating the data requires a full-text search. Or let's say an example. Or with the example above, we have a lot of logs, including Internet users and their sources of access. Let's assume that each record has a userid, as well as the type of user (men, women, Bloggers, etc.), the city where the user is located, and the site visited. What we want to do is to find individual users who meet certain criteria (access source, city, etc.) for each user category.

Obviously, we need to search for those who meet the criteria, and if we use reverse search, this will make it easy for us to do this, such as: {Category--[user IDs]} or {Site-and [user IDs]}. With such an index, we can take the intersection or the set of two or more userid (which is easy to dry and can be done quickly if the userid is in order). However, it can be a bit cumbersome for us to generate reports by user type, as we may use statements like this

1

SELECTcount(distinct(user_id)) ... GROUP BYcategory

But this kind of SQL is inefficient, because there are too many category data. To deal with this problem, we can create a direct index {UserID-and [Categories]} and then we use it to generate the report:


Counting Unique Users using inverse and Direct Indexes

Finally, we need to understand that random queries on each userid are inefficient. We can solve this problem by batch query processing. This means that for some user sets, we can preprocess (different query conditions).

Applicability : Key-value Store Key-value pairs database, document databases documentation database, bigtable-style database.

Hierarchical model Hierarchy Modeling techniques (11) tree-shaped aggregation trees Aggregation

Tree-shaped or arbitrary graphs (which need to be normalized) can be directly placed into a record or document.

    • This is very effective when the tree structure is removed at once (e.g. we need to show a blog's tree comment)
    • There is a problem with searching and any access to this entity.
    • Updating data is not economical for most NOSQL implementations (compared to independent nodes)


Tree Aggregation

Applicability : Key-value Key-value pair database, document Databases documentation database

(12) adjacency list adjacency Lists

Adjacency Lists adjacency list is a graph – each node is a separate record that contains all the parent nodes or sub-nodes. In this way, we can search by a given parent or child node. Of course, we need to traverse the graph through hop queries. This technique is inefficient in the breadth and depth of queries, and the subtree that gets a node.

Applicability : Key-value Key-value pair database, document Databases documentation database


(+) materialized Paths

Materialized Paths can help avoid recursive traversal (e.g., tree structure). This technique can also be thought of as a variant of anti-normalization. The idea is to add the identity attribute of the parent node or the child node to each node, so that you know all descendant nodes and ancestor nodes without having to traverse it:


Materialized Paths for EShop Category Hierarchy

This technique is very helpful for full-text search engines because it allows you to turn a hierarchy into a document. In the above illustration we can see that all the products or subcategories under the men's Shoes can be processed by a very short query statement-only given a category name.

Materialized Paths can store a collection of IDs, or a bunch of string IDs to spell out. The latter allows you to search for a specific branch path through a regular expression. This technique is shown (the path of the branch includes the end point itself):


Query materialized Paths using REGEXP

Applicability : Key-value key value to database, document Databases documentation data, search Engines search engine

(14) Nested set Nested sets

Nested sets nesting set is the standard technique of tree structure. It is widely used in relational databases and is fully applicable to Key-value key-value pairs database and document Databases documents database. The idea of this technique is to store the leaf nodes in an array and map each non-leaf node to a leaf node set by using the start and end of the index, as shown here:


Modeling of ECommerce Catalog using Nested sets

Such a data structure is very efficient for the immutable of the database, because its point memory space is small, and can quickly find all the leaf nodes without the need for tree traversal. However, there is a high performance cost to insert and update because the new leaf nodes need to be updated on a large scale.

Applicability : Key-value Stores Key value database, document Databases documentation database

(15) Nested Document flattening: Limited field name Nested documents flattening:numbered Field Names

Search engines basically work with flat documents, such as: Each document is a tabular table of fields and values. This data model is used to map business entities to a text document, which can become challenging if your business entity has a complex internal structure. A typical challenge is to map a hierarchy of documents. For example, a document is nested within another document. Let's take a look at the following example:


Nested Documents problem

Each of these business entity codes is a resume. It includes a name and a list of skills. I mapped this hierarchical document into a text document, one way to create the Skill and level fields. This model can search for a person by technology or rank, and the combination query that is labeled will fail. (Chenhao Note: Because it is not clear whether the excellent is math or poetry)

[4.6] In the reference gives a solution. Each field is labeled with the number skill_i and level_i, so that each pair can be searched separately (using or to traverse the lookup for all possible fields):


Nested Document Modeling using numbered Field Names

There is no extensibility in this way, and for some complex problems only the complexity of the code and the maintenance work are made larger.

Applicability : Search Engines Full Text Search

(16) Nested document flattening: Proximity query Nested documents flattening:proximity Queries

This technique is used to solve flat-level documents in Appendix [4.6]. It uses neighboring queries to limit the range of words that can be queried. , all skills and levels are placed in a field called Skillandlevel, and the "excellent" and "poetry" that appear in the query must be followed by another:


Nested Document Modeling using Proximity Queries

Appendix [4.3] Describes a successful case in which this technique was used in SOLR.

Applicability : Search Engines Full Text Search

(17) Graph Structure Batch graph processing

The graph Databases graph database, such as NEO4J, is an excellent graph database, especially when you use a node to explore neighbor nodes, or to explore relationships before two or fewer nodes. However, processing large amounts of graph data is inefficient because the performance and extensibility of the graph database is not its purpose. Distributed Graph data processing can be handled by the MapReduce and Message passing pattern. For example, in my previous article. This method can make Key-value stores, Document databases, and Bigtable-style databases suitable for processing large graphs.

Applicability: Key-value Stores, Document Databases, Bigtable-style Databases

Reference

Finally, I provide a list of useful links related to NoSQL data modeling:

  1. Key-value Stores:
    1. http://www.devshed.com/c/a/MySQL/Database-Design-Using-KeyValue-Tables/
    2. Http://antirez.com/post/Sorting-in-key-value-data-model.html
    3. Http://stackoverflow.com/questions/3554169/difference-between-document-based-and-key-value-based-databases
    4. Http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-major-types-of_29.html
  2. bigtable-style Databases:
    1. http://www.slideshare.net/ebenhewitt/cassandra-datamodel-4985524
    2. H Ttp://www.slideshare.net/mattdennis/cassandra-data-modeling
    3. http://nosql.mypopescu.com/post/ 17419074362/cassandra-data-modeling-examples-with-matthew-f-dennis
    4. http://s-expressions.com/2009/03/08 /hbase-on-designing-schemas-for-column-oriented-data-stores/
    5. http://jimbojw.com/wiki/index.php?title= Understanding_hbase_and_bigtable
  3. Document Databases:
    1. http://www.slideshare.net/mongodb/ Mongodb-schema-design-richard-kreuters-mongo-berlin-preso
    2. http://www.michaelhamrah.com/blog/2011/08/ data-modeling-at-scale-mongodb-mongoid-callbacks-and-denormalizing-data-for-efficiency/
    3. http// seancribbs.com/tech/2009/09/28/modeling-a-tree-in-a-document-database/
    4. http://www.mongodb.org/display/ Docs/schema+design
    5. Http://www.mongodb.org/display/DOCS/Trees+in+MongoDB
    6. http://blog.fiesta.cc/ Post/11319522700/walkthrough-mongodb-data-modeling
  4. full Text Search Engines:
    1. http://www.searchworkings.org/blog/-/blogs/query-time-joining-in-lucene
    2. http://www.lucidimagination.com/devzone/technical-articles/ Solr-and-rdbms-basics-designing-your-application-best-both
    3. http://blog.griddynamics.com/2011/07/ solr-experience-search-parent-child.html
    4. Http://www.lucidimagination.com/blog/2009/07/18/the-spanquery /
    5. http://blog.mgm-tp.com/2011/03/non-standard-ways-of-using-lucene/
    6. http://www.slideshare.net/ Markharwood/proposal-for-nested-document-support-in-lucene
    7. http://mysolr.com/tips/ Denormalized-data-structure/
    8. http://sujitpal.blogspot.com/2010/10/ denormalizing-maps-with-lucene-payloads.html
    9. http://java.dzone.com/articles/ Hibernate-search-mapping-entit
  5. Graph Databases:
    1. Http://docs.neo4j.org/chunked/stable/tutorial-comparing-models.html
    2. Http://blog.neo4j.org/2010/03/modeling-categories-in-graph-database.html
    3. Http://skillsmatter.com/podcast/nosql/graph-modelling
    4. Http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Schatz_MLG2010.pdf
  6. Demensionality Reduction:
    1. Http://www.slideshare.net/mmalone/scaling-gis-data-in-nonrelational-data-stores
    2. Http://blog.notdot.net/2009/11/Damn-Cool-Algorithms-Spatial-indexing-with-Quadtrees-and-Hilbert-Curves
    3. http://www.trisis.co.uk/blog/?p=1287

(End of full text)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.