We design relational database schemas with a complete set of solutions, but NoSQL does not. Half a year ago I read this "SQL anti-mode" book, feel very good. Just start to notice, is there an inverse pattern for nosql? A good counter pattern can be used in our design schema to tell where traps and cliffs are. NoSQL propaganda often claims to be schemaless, which can make people misunderstand that they don't need to design schemas. But if you are not aware of the need to design a schema, the trap is always in the dark. This article sums up some other people's design schema errors that have their own deep pain.
NoSQL database is the most mainstream of the document database, the list of databases, key-value database. The three are representative mongodb,hbase and Redis respectively. If you compare NoSQL to a weapon, you can do this (MySQL is a typical relational database, as in the comparison): ">
MySQL was produced earlier, and with the lamp tide matured. Despite its lack of major improvements, the emerging Internet uses the most databases. Like the traditional kitchen knife, the structure is simple, hundreds of years without improvement. But without prejudice to a variety of knife, as long as there is one, can be competent in the kitchen most of the business. MySQL is the same, the core has been stable. But Çeku, tables, backups, monitoring, and so on are readily available. MongoDB is a new thing, providing more flexible schema,capped Collection, asynchronous submission, geo-indexing and other five-flower 10-color features. Like a Swiss army knife, not only can you use it as a knife, you can also open caps and cut nails. But he is no better than MySQL, because there is lack of time sharpening. One is the stability of the system itself, the second is the development, operation and maintenance needs more experience to be popular. HBase is a bullying elephant soldier. With the ecological environment of Hadoop, it can be extended well. But like soldiers, users need an elephant (Hadoop) to drive him. Redis is the representative of key-value storage, the simplest feature. Provides a random data store. Like a stick, there is no extra structure. But that is why he has a particularly good elasticity. Just like Wu empty-handed in the divine, can poke a broken day, small can into a needle. The gain and loss of document database
The relational model attempts to separate the database model from the database implementation, allowing the developer to break away from the underlying good operational data. But the author thought the relational model has the weakness in some application scenarios, now has to face.
SQL Vulnerability One: join must be supported. Because the data can not be duplicated. So the relational model of using paradigm will inevitably have a lot of joins. If the join is a table that is smaller than the memory is fine. But if large table joins or tables are distributed across multiple machines, join is a nightmare of performance. SQL Vulnerability two: computing and storage coupling. As a unified data model, the relational model can be used for both data analysis and online business. But the two emphasize high throughput, one that emphasizes low latency and has evolved completely different architectures. It is obviously inappropriate to abstract with the same set of models. Hadoop is about the part of the calculation. Mongodb,redis, etc. for online business. Both have abandoned the relational model.
Against these two nightmares. The main purpose of a document database, such as MongoDB, is to provide richer data structures to discard joins to accommodate online business. Of course, is not mongodb completely can not use join, can not take to do data analysis, discuss this is a matter of opinion. Therefore, the document database is not stronger than the relational database, because of the weak support for join, the function will be much weaker. When designing a relational model, it is usually only necessary to consider the direct relationship of data and define the data model. When designing a document Database model, it is also necessary to consider how to use the application. So designing a document database schema is more difficult than designing a relational model. In addition, because the document database transaction support is also relatively weak, the general NoSQL will only provide a row lock. This also adds to the difficulty of designing schemas. There is a lot of attention to be paid to the use of document database, this article only focuses on the part of model design.
Anti-pattern One: inertial thinking/following relation model
Relational models are classic models of data storage, and the benefits of using data model paradigms are obvious. However, because the document database does not support join (including foreign key constraints related to foreign keys) and other characteristics, the habitual use of the relational model sometimes there will be problems. Need to use the rich data model provided by the document database to deal with.
It is worth mentioning that the document database design and relational model is different, is flexible and diverse. For the same situation, you can design a variety of models that work, without the best model in the absolute sense.
The following illustration is a comparison of a relational model and a document model.
Relational Model VS document model
This is a blog data model, there are blog,user and other tables. The left side is the relational model, and the right side is the document model. This document model is not entirely reasonable and can be described as a "pros and cons" textbook.
Problem one: There are relational table symptoms that describe Many-to-many: The document database is stored in a pure relational table, for example:
Id |
user_id |
blog_id |
0 |
0 |
0 |
1 |
0 |
1 |
Such a table is also inappropriate in a relational model because the ID is redundant and can be solved with a federated primary key. However, in the document database, because a single primary key must be enforced, this design has to be done.
Harm:
Destroy the completeness of data. Since the ID is the primary key, there is no constraint on the data model to ensure that duplicate user_id,blog_id pairs are not present. Update deletion is a problem once the data has been duplicated. Too many indexes. Because it is a relational table, you must build an index on user_id and blog_id. Affect performance.
Solution: Using a document database typically handles Many-to-many approaches. Instead of creating a relational table, add a list field to one of the documents (such as user).
user_id |
User_name |
Blog_id[] |
...... |
0 |
Jake |
0,1 |
...... |
1 |
Rose |
1,2 |
...... |
Problem two: There is no distinction between "one-to-many relationship" and "many-to-many relationship" Symptoms: The relational model does not distinguish between "one-to-many" and "Many-to-many", and the relational model is only "One-to-many" for document databases. Just like this comment table:
comment_id |
user_id |
Content |
...... |
0 |
0 |
"NoSQL reverse mode is a good article" |
...... |
1 |
0 |
"Yes." |
...... |
If the whole model is such a "many-to-many" relationship needs rethinking.
Harm:
Additional indexes. If the client is known to be user_id, you need to obtain user information and comment information, and you need to perform two queries. One query needs to use an index. And to join themselves on the client side. This may have potential performance problems.
Solution: The core of the problem is whether you know user_id query two tables, or you know comment_id query two tables. If it is known that comment_id such a design is reasonable, but if it is known user_id to query, the relationship in the user table design more reasonable.
user_id |
User_name |
Comment_id[] |
...... |
0 |
Jake |
0,1 |
...... |
1 |
Rose |
1,2 |
...... |
Such a design, you can avoid an index. Likewise, for Many-to-many, the index can be avoided by reasonably arranging the location of the field.
The correct use of the occasion:
The relational model is a very successful data model, and reasonable use is very good. However, due to the characteristics of the document database, it is necessary to adjust the data model, which has the best flexibility, although the performance is not optimal. And also facilitates the conversion of relational databases.
Anti-mode two: reference client join everywhere
Symptom: The database design is full of xx_id word end, need a lot of manual join operation when querying. It involves this inverse pattern. As the relational model of the blog mentioned above, if you know blog_id query comments, you need to execute at least 3 queries and join manually.
Harm:
Manual join, Trouble and error prone. The document database does not support join and there is no foreign key guarantee. Therefore, the need to join in the client, such operations for software development is more cumbersome. Because there is no foreign key guarantee, the acquired ID cannot be guaranteed to have data in the database. In the process of the need to constantly judge, error prone. Multiple queries. If you have too many references, you need to query multiple times to find enough data. Originally the document database is very fast, but due to multiple queries, the database increased the pressure, the time to get all the data will also increase. Transaction processing is cumbersome. Document databases generally do not support transactions in the general sense, and only row locks are supported. If the document database has multiple connections. At the time of insertion, the transaction is a nightmare. Using a transaction in a document database requires a row lock for a large amount of processing. Too cumbersome, interested readers can search.
Solution: Use inline data structures appropriately. Because document databases support more complex data structures, you can convert references to inline data without creating a new table. This can solve some of the above problems, is a recommended scheme. Just like the previous blog example. Simplifies the five tables into two tables. When do I use inline? Generally think
Using inline can solve read performance problems and significantly reduce query times. You can simplify the data model and the relationship between the tables without affecting the flexibility. Transactions can be simplified to a single transaction when the correct use of the occasion:
A normalized usage scenario in which a document database is used by multiple applications. Because the database design is not able to estimate multiple applications for present and future queries, great flexibility is required. At this time, using references is more reliable than inline.
Anti-pattern three abuses in the inner union
One problem prevents the query's inline symptoms: frequently querying some inline fields and discarding other fields.
Harm:
No ID constraint: The use of inline fields and references is different and there is no ID constraint. Therefore, it is not possible to manage by ID (primary key), and it can be very inconvenient if you often need to operate inline objects separately. Index overrun: If the inline field is queried for criteria, an index is required. May result in an index overflow. Performance waste: Most document database implementations are stored in rows, which means that, although querying only one field, DB needs to take the entire row out of the disk. If the field is small enough, the document is large enough to be very uneconomical.
Solution: If the crux of the problem arises, consider using a reference instead of inline. The main purpose of inline characteristics is to improve performance, if the performance does not rise and fall, it is meaningless. If you have strong performance requirements, you can consider using duplicate data, which is also in the inner and exterior of the reference in the inline field. This can combine the performance advantages of inline and reference. The disadvantage is that data duplication, maintenance will be more cumbersome.
Problem two infinite expansion of inline symptoms: The List,map type's inline fields are expanding and there is no limit. Just like the previous mentioned blog's inline field comment. If there is no limit to the number of comment per blog, comment will expand indefinitely. Light affects performance and heavy inserts fail.
blog_id |
Content |
Comment[] |
...... |
0 |
“...” |
"NoSQL is a Good article", "Yes", "unlimited growth" ... |
...... |
Harm:
Insert failed. The document database has a maximum size for each record, and it also has the best recommended size. Generally not more than 4M. As the example just mentioned, if it is a popular blog post, the size of the comment can easily exceed 4M. The document will not be updated and the new comment cannot be inserted. Performance drag. Because the inline field expands, its size will be much larger than other parts, affecting the performance of other parts. And as a result, the record size changes frequently, and the data files within the file database may have a lot of fragmentation.
Solution: Set the maximum number or use a reference. or blog and comment example, can be comment from the blog into a sheet. If you consider performance, you can create a new field in the blog table, such as recent comments. This ensures both performance and the ability to prevent expansion.
blog_id |
Content |
Last_five_comment[] |
...... |
0 |
“...” |
"NoSQL model is a good article", "Yes", "up to 5" ... |
...... |
Problem three failed to maintain inline symptoms: The DBA wants to maintain inline fields separately, but it cannot.
Harm:
Permission management is difficult. The minimum granularity of database permissions management is the table. If you use inline technology, it means that the inline part must be managed with the same permissions as the other fields. There is no way to hide at the db level. It's hard to cut the table. If you find a table of large need to cut the table. This time is more tangled. If across, the choice of partion key, the failure of the index will become a problem. If you think that two tables, will be very good operation, is the overuse of the inline. Backup is difficult. Each table in a relational database can have a different backup strategy. But if it's inline, it's not going to be a backup. Workaround: When designing a database model, you need to consider the following maintenance operations, especially if the inline fields require no separate maintenance. Need to consult with the operational dimension. If the inline field has separate maintenance requirements, you can split it up as a reference.
The problem of the four-stare of the application of the inline symptoms: applications can be very good to run on the database. But it can be a hassle when new apps are plugged in. Because the data model is designed to take into account the query. So when there are new applications, new query access, it will be difficult to use the original model.
Harm:
New applications are difficult to access. When new applications try to use the same database, access is more difficult. Because the query is different, you need to adjust the data model to adapt. But the adjustment model will affect the original application. Integration is difficult. Different relational databases can be integrated together and used together. But for the document database, although the function can complement each other, but because of the difference of the inline data structure, it is also difficult to integrate. ETL difficult. Now most of the data analysis system uses the relational model, even though Hadoop does not use the relational model, but its hive common tool is also according to the relational model design.
Solution:
Use the Paradigm design database, that is, use references instead of inline. Or in the use of inline, to each inline object a globally unique key to ensure that it and the relationship model can have a direct mapping relationship, so as to improve the flexibility of the data model. such as blog table:
blog_id |
Content |
Comment[] |
...... |
0 |
“...” |
[{"id" =1, "content" = "NoSQL reverse mode is Good article"}, {"id" =2, "content" = "yes"} ...] |
...... |
Such designs can take advantage of the benefits of inline and map them to the relational model. It is necessary to manually maintain comment_id to ensure its global uniqueness.
Anti-mode four: Online computing
Symptom (s): There are some long-running query, and the index cannot be resolved because of aggregation calculations. As the data volume grows, it becomes a bottleneck of performance gradually.
Harm:
affect the user experience. In the online business, if a query is larger than 4s, the user experience drops dramatically. Both the primary key and the indexed query meet the requirements. But aggregation operations often need to scan the entire table or a large number of data, as the amount of data increase, query time will be longer, users can not tolerate. Affect database performance. The disadvantages of long queries are not clear. On-line applications, if a long query appears, it may occupy most of the data resources, including IO, connectivity, CPU and so on. Results in other good queries, lightweight performance also drops, and heavy users cannot use the database. Long queries can be called db killers.
Solution: The first thing to weigh is whether this aggregation operation is necessary and must be done in real time. If there is no need to do it in real time, you can take the offline action program. In the dead of night, run a long query, the results are cached for the next day to use. If you have to do this in real time, you can create a new field, and use the "INCR" operation to aggregate the results in real time when you run. Instead of a query, execute a long query. If the logic is more complex, or if a lot of "incr" operations bring pressure to the database system, you can use the real-time data processing framework such as Storm. In a word, be careful with long queries.
Anti-mode five: Use the key of an inline map object as an ID
Symptom (s): The document database supports the inline map type. Use the key of the map as the primary key for the database.
blog_id |
Content |
comment{} |
...... |
0 |
“...” |
{"1" = "NoSQL anti-mode is Good article", "2" = "Yes"} |
...... |
This inverse pattern is easy to make because the map data structure is used in programming languages. But for the database model, this is an uncompromising pattern.
Harm:
Unable to make various (><=) queries through the database. For relational databases, although data structures can be flexible, queries are layered. Like Comment.id,comment.content. That is, the key in its map type can be interpreted as a property name, not as an ID. So once this is used, it is out of the control of the database and cannot use a variety of query functions. Unable to pass index query. Document data can be indexed by the name of the column. Like Comment.id. This data structure cannot be indexed because it does not have a fixed column name.
Solution: Use array +map to resolve. Such as:
blog_id |
Content |
Comment[] |
...... |
0 |
“...” |
[{"id" =1, "content" = "NoSQL reverse mode is Good article"}, {"id" =2, "content" = "yes"} ...] |
...... |
This allows you to use Comment.id as an index, or you can use the query capabilities of the database. Simple and effective. The key in the map type is the property name, and value is the property value. Such usage is the intent of the document database data model, so that the various functions it provides can be exploited. Otherwise it cannot be used.
Anti-pattern Six: Unreasonable IDs
Symptom (s): use string or even more complex data structures as IDs, or all use the bootstrap ID provided by the database. Such as:
ID (this ID system is generated from) |
blog_id |
Content |
...... |
0 |
0 |
... |
...... |
Harm:
ID confusion. If you use the bootstrap ID provided by the database, and there is a blog_id similar to the primary key meaning in the table, this is not good enough to cause logical confusion. Because the document database does not support renaming of IDs, people who are accustomed to relational database practices may then establish their own logical ID fields. This is not necessary. The index is large and performance is low. IDs are a very important part of the database. The length of the ID determines the size of the index (including the primary key index) and affects database performance directly. If the index is smaller than memory, performance will be good. But once the index size exceeds memory, data exchange occurs, and performance drops dramatically. A long occupies 8 bytes, and a 20-character UTF8 string occupies about 60 bytes. The difference is 10 times times the huge, cannot but consider.
Solution: Try to use a field that has some meaning to do the ID and not repeat it in other fields. Do not use complex data types for IDs, only use Int,long or system-supplied primary key types for IDs.
Anti-pattern summarization of document database
With so many anti patterns, here's a checklist that covers all of the above patterns. This list is based on the document Database model. is an example of a document Database model.
Id |
Anti-schema name |
Problem |
0 |
There are relational tables that describe Many-to-many |
[{id:00 Symptom: The document database is stored in a pure relational table Disadvantage: [Destroy data completeness, index too much] Solution: Add a list field },{ Id:01 Symptom: Relational model does not differentiate between "one-to-many" and "Many-to-many" Disadvantage: Extra Index Solution: Reasonable arrangement of the location of the field }] |
1 |
Reference client Join Everywhere |
[{ Id:10 Symptom: A large number of manual join operations are required when querying Disadvantages: [manual join, multiple query, transaction processing cumbersome] Solution: Use inline data structures appropriately. }] |
2 |
Abuse of the inner Union |
[{ Id:20 Symptom: frequently query some inline fields, discard other fields Disadvantages: [No ID constraints, index flooding, performance waste] Solution: Use references instead of inline, allowing duplicate data },{ Id:21 Symptom (s): the inline field of the List,map type is expanding and there is no limit Disadvantages: [Insert failure, performance drag] Solution: Set the maximum number or use a reference. },{ Id:22 Symptom (s): DBA wants to maintain inline fields separately, but cannot do Disadvantages: [Authority management difficult, difficult to cut, backup difficult] Solution: When designing a database model, you need to consider the following maintenance operations },{ Id:23 Symptom: The application can run very well on the database. But it can be a hassle when new apps are plugged in. Inline Stare dead Application Disadvantage: [New application access difficult, integration difficult, ETL difficult] Solution: Use the paradigm to design a database, that is, a reference instead of inline. To ensure that the mapping relationship exists directly with the relational model }] |
3 |
Online Computing |
[{ Id:30 Symptom: There are some long-running query, which is becoming a performance bottleneck. Disadvantages: [Impact user experience, database performance impact] Solution: Eliminate unnecessary aggregation operations. When running, the results are aggregated in real time. Use third party real-time or non-real time tools. such as Hadoop,storm. }] |
4 |
To use the key of an inline map object as an ID |
[{ Id:40 Symptom (s): The document database supports the inline map type. Use the key of the map as the primary key for the database. Disadvantage: [Unable to do various (>< "" =) queries through the database, cannot be indexed query] Solution: Use array +map to resolve. }] |
5 |
Unreasonable ID |
[{ Id:50 Symptom (s): ID with string or even more complex data structure, or all using the bootstrap ID provided by the database. Disadvantages: [ID confusion, index size] Solution: Try to use a field that has some meaning to do the ID. Do not use complex data types for IDs. }] |
This paper attempts to sum up the important document database in the author's inverse mode. Now the discussion about NoSQL data Model design mode is just beginning, and may gradually become a system in the future. For the column database and Key-value model, I wait until there is enough accumulation, and then share with you.