relational database design for developers
Directory
The origin of a Codd RDBMS12 law--rdbms
Design phase of two-relational database
Three design principles
Four naming rules
Database design, a cornerstone of the success of a software project. Many practitioners agree that database design is not really that important. The reality of the scene is quite similar, the number of developers is the database designers several times. Most people use a subset of the database, so it's easy to think about database design. In fact, database design is also the door of knowledge.
From the author's experience, I am more in favor of the database design by the developer early in the project (the DBA is needed for the later tuning). According to the author's project experience, a proficient OOP and ORM developers, the design of the database is often more reasonable, more able to adapt to changes in demand, if the reason, the author's personal guess is because the database standardization, and OO part of the idea of the same (such as cohesion). DBAs, however, have the advantage of designing a database that is capable of maximizing the capabilities of the DBMS, using SQL and the DBMS to implement many of the program's logic, and that the DBA-optimized database is more efficient and stable than the developer. As the title suggests, this article is intended to share a developer's database design experience that does not involve complex SQL statements or DBMS usage, and therefore is not limited to a DBMS product. I really hope that this article will help developers, but also hope that readers can help the author to check the gaps.
The origin of a Codd RDBMS12 law--rdbms
Edgar Frank Codd (Edgar Frank Code) is known as the "father of relational Databases" and was awarded the Turing Award in 1981 for his outstanding contribution to the theory and practice of database management systems. In 1985, Dr. Codd published 12 rules that succinctly define the concept of a relational database as a design guideline for all relational database systems.
- Information Law All information in a relational database is represented in a single way-the value in the table.
- ensure that access rules rely on a combination of table names, primary key values, and column names to ensure access to each data item.
- Null values are systematically processed to support null values (NULL), and null values are handled in a systematic manner, and null values do not depend on the data type.
- The description of the dynamic online catalog database based on the relational model should be self-describing, with the same representation of the normal data at the logical level, that the database must contain a system table that describes the structure of the database, or that the database description information should be included in the table that the user can access.
- Uniform Data sub-linguistic law a relational database system can support several languages and multiple terminal uses, but must have at least one language whose statements can be represented as strings in a well-defined syntax, and fully support all of the following rules: Data definition, view definition, data manipulation, Constraints, authorizations, and transactions. (This language is SQL)
- View Update Law all views that can theoretically be updated can also be updated by the system.
- advanced Insert, UPDATE, and delete operations the ability to handle an underlying relationship or derivation as a single operand is not only suitable for data retrieval, but also for inserting data, modifying deletions, i.e., data rows are treated as collections in insert, modify, and delete operations.
- the physical independence of data regardless of how the data in the database changes in the way it is stored or accessed, both the application and the terminal activity maintain a logical invariance.
- logical independence of the data when the table is made theoretically without compromising the information changes, both the application and the terminal activity remain logically invariant.
- The integrity of data integrity constraints that are specific to a relational database must be defined in a relational database sub-language and can be stored in a data directory, rather than in a program.
- distribution Independence Regardless of whether the data is in physical or distributed storage, or at any time changing the distribution strategy, the RDBMS's data manipulation sub-language must be able to keep the application and terminal activity logically invariant.
- non-destructive law if a relational database system supports a low-level (one-time, single-record) language, this low-level language cannot violate or bypass the integrity rules or constraints imposed by higher-level languages (one-time processing of multiple records), that is, users cannot violate database constraints in any way.
Design phase of two-relational database
(i) Planning phase
The main task of the planning phase is to analyze the necessity and feasibility of the database. Determine whether you need to use a database, what type of database to use, and which database product to use.
(ii) Conceptual stage
The main task of the concept phase is to collect and analyze requirements. Identify requirements, primarily identifying data entities and business rules. For a system, the database mainly includes the business data and the non-business data, and the definition of the business data depends on the analysis of the user's requirements at this stage. It is necessary to identify business entities and business rules as much as possible, to have a preliminary understanding of the overall system and to understand the flow of data. In theory, this phase will refer to or produce multiple documents, such as "use Case diagrams", "streaming diagrams", and other project documents. If these results can be produced at this stage, it will undoubtedly be of great assistance to the latter. Of course, many documents are beyond the scope of the Database Designer's consideration. Also, if you are not proficient in the field and the user's business, then abandon the idea of independently completing user needs analysis. The user is not a technical expert, and when you are unable to play the role of "business consultant", please choose to work with the project team or refer to it as a risk report to PM. Again, most of the time, users are just industry practitioners, not professional technicians, and we only collect demand from users, not relying on user knowledge.
There are some techniques you can use to document your users ' needs, but some of this may be beyond the responsibility of the Database Designer:
- Strive to maintain a range of documents containing system design and specification information, such as meeting records, interview records, key user expectations, functional specifications, technical specifications, test specifications, etc.
- Frequently communicate with stakeholders and collect feedback.
- Mark the pending content that you added, which is not part of the customer's requirements.
- Identify project scope with all key stakeholders as soon as possible and seek to freeze demand.
In addition, business rules must be rigorously handled and documented in detail. In the later stages, the design will be based on these business rules.
When the stage is over, you should be able to answer the following questions:
- What data do you need?
- How is the data used?
- What rules govern the use of data?
- Who will use what data?
- What do customers want to see on the core functionality interface or report?
- Where is the data now?
- Does the data interact, integrate, or synchronize with other systems?
- What are the subject data?
- Core data value geometry, the degree of reliability requirements?
and get the following information:
- Entities and relationships
- Properties and Fields
- Business rules that can be enforced in the database
- Business processes that require the use of a database
(iii) Logical phase
The main task of the logic phase is to draw e-r diagrams, or modeling. There are many modeling tools, and there are different methods and software for graphic presentation. The use of these tools and software is not critical, and I do not recommend that readers spend a lot of time on the choice of modeling methods. For most applications, the E-r diagram is sufficient to describe the relationship between entities. The key to modeling is thought, not a tool, and software is just an adjunct, and identifying entity relationships is the focus of this phase.
In addition to entity relationships, we should also consider the domain of the attribute (value type, RANGE, constraint)
(iv) Implementation phase
The implementation phase focuses on the table for the selected RDBMS definition E-r diagram, taking into account the attribute type and scope, and constraints.
(v) Physical phase
The physical phase is a validation and tuning phase that deploys the database on the actual physical device and tests and tuning.
Three design principles
(i) Reduced reliance on database functionality
The function should be implemented by the program, not the DB implementation. The reason is that if the function is implemented by DB, once the replacement DBMS is not as powerful as the previous system and some functionality cannot be implemented, then we will have to modify the code. Therefore, in order to prevent this kind of situation, the function should have the program implementation, the database is only responsible for the data storage, to achieve the lowest coupling.
(ii) Principles for defining entity relationships
When defining a relationship between an entity and another entity, you need to consider the following:
- The entity involved identifies all the entities involved in the relationship.
- ownership takes into account the situation in which an entity "owns" another entity.
- Cardinality considers the number of instances of an entity associated with another entity instance.
Relationship and number of tables
- Describes a minimum of 1 tables for 1:1 relationships.
- A minimum of 2 tables is required to describe the 1:n relationship.
- A minimum of 3 tables is required to describe an n: = relationship.
(c) The column means the unique value
If you are representing coordinates (0,0), you should use two columns instead of "0,0" in 1 columns.
(iv) Order of the columns
The order of the columns is irrelevant to the table, but it is customary to sort the columns in the order of "primary key + foreign KEY + Entity Data + non-Entity data", which obviously makes the column more readable.
(v) Defining primary and foreign keys
The data table must define the primary key and foreign key (if there is a foreign key). Defining primary keys and foreign keys is not only an RDBMS requirement, but also a development requirement. Almost all code generators require this information to generate code for common methods (including SQL and references), so defining primary and foreign keys is a must in the development phase. The reason is that in the development phase is necessary because, a lot of team for performance considerations after a large number of tests, in order to ensure that the referential integrity will not have a large defect, the DB will be removed all foreign keys to achieve optimal performance. The author believes that the foreign keys should be preserved when there is no problem in performance, and even if the performance is really problematic, the SQL text should be optimized rather than the foreign key constraints.
(vi) SELECT key
1 artificial keys and natural keys
Artificial health-An unnatural attribute of an entity, imposed by a person as required, such as a GUID, which has no meaning to the entity; natural health--the natural attributes of an entity, such as a ID number.
Benefits of the Manual key:
- The key value is never changed
- is always a single row store
Disadvantages of the manual key:
- Because a manual key is a unique value that has no practical meaning, you cannot avoid repeating rows by using a manual key.
I recommend using the manual key. The reasons are as follows:
- In the design phase we cannot predict what the code really needs, so simply discard the guess key and use the manual key.
- The design of a manual key that is complex to handle entity relationships and not responsible for any attribute descriptions is highly decoupled from entity and entity content.
Another suggestion from the author is that each table requires a natural key that is meaningful to the user, and may not be able to find such an item in a particular case, and a composite key can be used at this time. This key I don't use as a unique identifier in my program, but it can be used when querying directly against a database.
Another disadvantage of using a manual key stems from the consideration of query performance, so it is important to choose the form of the manual key (the type of the column):
- The self-increment type is more efficient due to the type of lightweight query, but the value is limited.
- The GUID query efficiency is inferior to the value type, but the value is unlimited, and the developer is more cordial.
2 Smart and non-intelligent keys
Smart Key--the key value contains additional information, which is encoded according to a well-defined coding specification, which can obtain certain information from the key value itself, non-intelligent key, simple meaningless key value, such as the increment number or GUID.
The smart key is a double-edged sword, and the developer favors the key value that contains the information, and the program looks forward to the underlying data; The database administrator or designer hates the smart key, and the reason is that the smart key is a potential risk to the database. As mentioned earlier, one of the principles of database design is not to implement a combination of values that have independent meanings into a single column, and you should use multiple independent columns. The Database Designer wants developers to get smart keys by stitching multiple columns, which are used in the form of composite primary keys for developers instead of decomposing the values of a column. Developers should accept this database design, but many developers do not understand the advantages of both. The author argues that using a single column to implement smart keys is a risk that we may not be able to anticipate in the design phase that coding rules may change at a later stage. For example, the value of the local key that makes up the smart key is used up and causes the change of rule or length, and the change of the coding rule is destructive to the validation of the program and the Intelligent key parsing, which is the last thing the system operators want to see. So I suggest that if you need smart keys, encapsulate them in the business logic layer (using read-only attributes), and do not persist the layer implementations to avoid these problems.
(vii) whether NULL is allowed
About NULL we need to understand several of its features:
- Any value and null stitching are null after the concatenation.
- All mathematical operations that are performed with NULL return NULL.
- When NULL is introduced, logic is not easy to handle.
So should we allow the column to be empty? I think the answer to this question is influenced by the language of our development. In C #, for example, because a nullable type is introduced to handle a case where the database value type is NULL, it does not make sense for the developer to allow null. It is important to note, however, that validating non-null must be handled in the Assembly, rather than relying on the non-null constraints of the DBMS, and must ensure that the full data (all required attributes are assigned) reaches the db (so-called "safe Zone"), and we must define the data in those areas of the multilayer system to be safe and pure.
(eight) attribute cutting
One of the misconceptions is that the attribute is a 1:1 relationship to the column. For developers, we expose properties rather than fields. For example, for an entity "employee" has a "name" attribute, "name" can be decomposed into "last name" and "First name", for developers, it is obvious that the second data structure is more favored ("Last name" and "name" as two fields). Therefore, we should also consider whether or not to cut attributes when designing.
(ix) Normalization-paradigm
When I am still in college, Paradigm is the most headache when learning relational database. I think there may be readers who still do not understand the value of the paradigm, simply speaking-the paradigm will help us to ensure the validity and completeness of the data. The purpose of normalization is as follows:
- Eliminate duplicate data.
- Avoid writing unnecessary code that is used to synchronize data in duplicate.
- Keep the table thin, and reduce the number of read operations that need to be performed to read data from a single table.
- Maximizes the use of clustered indexes, allowing for more optimized data access and connectivity.
- Reduce the number of indexes used per table because the cost of maintaining indexes is high.
Normalization is designed to pick out complex entities and extract simple entities from them. This process continues until each table in the database represents only one thing, and each of the tables describes this thing.
1 normalize entities and attributes (remove redundancy)
1NF: Each attribute should only represent a single value, not multiple values.
Several things to consider:
- attributes are atomic and need to be considered to be familiar with whether the decomposition is thorough enough, so that each attribute represents a single value. (And the "(c) column means unique value" describes the same principle.) The decomposition principle is--when you need to process each part separately, decompose the value and break it down to enough use. (Even if you do not currently need to completely decompose attributes, you should consider possible future requirements changes.) )
- all instances of a property must contain the same number of value entities with a fixed number of attributes (the table has a fixed number of columns). When designing an entity, you want to have only a fixed number of values associated with each property.
- All entity types that appear in the entity must be different
The current design does not meet the "odor" of 1NF:
- String data that contains the delimiter class character.
- The tail end of the name has a number attribute.
- There are no tables that define bad keys or key definitions.
2 Relationship between attributes (remove redundancy)
The 2nf-entity must conform to 1NF, and everything described by each attribute must target the entire key (which can be understood as the cohesion of the type attribute in OOP).
The current design does not meet the "odor" of 2NF:
- duplicate key property name prefixes (data redundancy outside the design) indicate that these values may describe some additional entities.
- There are duplicate data groups (data redundancy outside the design) This marks a function-dependent type between attributes.
- Composite primary key with no foreign key This marks the key value in the key may identify a variety of things, rather than a thing.
3nf-entities must conform to 2NF, non-key attributes cannot describe other non-key attributes. (unlike 2NF, 3NF handles the relationship between non-key and Nonkey properties, not the key attribute.)
The current design does not meet the "odor" of 3NF:
- Multiple attributes have the same prefix.
- The repeating data group.
- Aggregated data, the referenced data is in a completely different entity. (Some people tend to use views, and I prefer to use object collections, which are done by the program.) )
bcnf-entities satisfy the first paradigm, all attributes depend entirely on a key, and if all the decisions are a key, the entity satisfies the BCNF. (bcnf simply expands the previous paradigm by saying that an entity may have several keys, all of which must depend on one of these keys, or that "each key must uniquely identify the entity, and each non-key familiarity must describe the entity.") ”
3 removing redundancy from the entity key combinations
The 4nf-entity must satisfy the BCNF, where a multivalued dependency (a record is determined by combining multiple values in the uniqueness of the entire table) cannot exceed one between the keys of an attribute and the entity.
The current design does not meet the "odor" of 4NF:
- Ternary relationship (Entity: Entity: entity).
- A latent multi-valued attribute. (such as multiple phone numbers.) )
- Temporary data or historical values. (The subject of historical data needs to be raised, otherwise there will be a lot of redundancy.) )
4 try to decompose all relationships into two-dollar relationships
The 5nf-entity must satisfy 4NF, and when the decomposed information is lossless, ensure that all relationships are decomposed into two-tuple relationships.
5NF guarantees that any ternary relationships that exist in the fourth paradigm that can be decomposed into entities are decomposed. Some ternary relationships can be decomposed into two-yuan relationships without losing information, and when the process of decomposing into two two-ary relationships loses information, the relationship is declared to be in the IV paradigm. Therefore, the five-paradigm proposal is that it is best to decompose the existing ternary relationship into 3 two-yuan relationships.
It is important to note that the normalized result may be more tables and more complex queries. Therefore, the degree of processing depends on the performance and data architecture of the multiple considerations. The suggestion is normalized to the IV paradigm, because the 5NF judgment is too vague. For example: Table x (teacher, student, course) is a ternary relationship that can be decomposed into table A (teacher, student), table B (student, course), Table C (teacher, course). Table x indicates that a teacher is a teacher of a course of a student, table A indicates that the teacher teaches the students, table B indicates that the students are in class, and C indicates that the teacher teaches. It is not possible to see the problem alone, but from the data, "Table x= table + B + Table C" is not necessarily established, that is, the data before decomposition can not be built by the connection. Because there may be multiple combinations, you lose the business rules that table x feeds out. This kind of phenomenon is easy to be neglected in the design phase, but it is good to be visible in the open phase, and it doesn't happen very often.
Recommended Practice:
- Adhere to the above normalization principles as much as possible.
- All attribute descriptions should be content that embodies the nature of the modeled entity.
- There must be at least one key that uniquely identifies and describes the nature of the entity being built.
- The primary key should be chosen with caution.
- How much normalization can be done in the logical phase (performance is not considered in the logical phase).
(10) Select data type (MS SQL 2008)
Common types of MS sql:
Exact number |
does not occur loss of precision |
bit tinyint smallint int bigint decimal |
Approximate Numeric |
can have loss of precision for extrema |
float (N) real |
Date and time |
|
datetime time S Malldatetime datetime datetime2 DateTimeOffset |
binary data |
|
bingary (N) Varbi Nary (n) varbinary (max) |
character (string) data |
|
char (n) varchar (n) varchar (max) NCHAR ( n) nvarchar (n) nvarchar (max) |
stores arbitrary data |
|
sql_variant |
td> timestamp
|
timestamp |
GUID |
|
uniqueidentifier |
XML |
do not attempt to use this type to circumvent 1NF |
xml |
Spatial data |
|
Geometry Geography |
Hierarchical Data |
|
Heirarchyid |
Not supported or bad type selection in MS SQL
- Image: Replaced by varbinary (max).
- Text and ntext: superseded by varchar (max) and nvarchar (max).
- Money and smallmoney: It is not good to use in the development process, we recommend using Decimal.
Common Type selection:
The most basic rule of type selection is to choose the lightest type that meets your needs, because it is faster to query.
bool |
It is recommended to use bit instead of char (1), because the development language supports it well and can be mapped directly to BOOL or bool. |
Large Value data |
With the smallest of all alternative types, the larger the type, the slower the query, and Max should be used when the bytes are greater than 8000. |
Primary key |
The self-increment primary key chooses int or bigint,guid according to the expected range using uniqueidentifier instead of varchar (N). |
(11) Optimizing parallel
When designing db, you should consider optimizations for parallelism, such as the timestamp type in MS sql, which is an excellent choice.
Four naming rules
- Table--"Module name _ table name". The table name is best not to use complex numbers, because when developed with ORM Framework, the code generator generates class definitions based on DB, and the table generates a type definition for an instance rather than an instance collection. The table name should not be too long. One reason is that some software has a limit on the maximum length of a table name, and the second reason is that using code generators often produces type names based on table names, and then lazy people use the name directly, which is obviously not advisable if the name is too long across the network boundary.
- The field--bool type is denoted by "is", "Can", "has", and so on, the date type name must contain "date", and the time type must contain "times".
- Stored procedure--using the "proc_" prefix.
- View--Use the "View_" prefix.
- Trigger--Use the "Trig_" prefix.
(GO) original address: http://www.cnblogs.com/MeteorSeed/archive/2013/03/27/2880054.html
relational database design