Relational database design for developers

Relational database design for developers _ MySQL

Last Update:2018-04-14 Source: Internet

Author: User

Tags sql 2008

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Database design is the cornerstone of a successful software project. Many practitioners believe that database design is not that important. In reality, the scenarios are quite the same. the number of developers is several times that of database designers. Most people use a part of the database, so it is easy to design the database. In fact, database design is also a learning topic. From the perspective of my experience, I prefer to design databases by developers in the early stage of the project (DBA is required for later optimization ). Based on my project experience, a developer database design proficient in OOP and ORM is the cornerstone of a successful software project. Many practitioners believe that database design is not that important. In reality, the scenarios are quite the same. the number of developers is several times that of database designers. Most people use a part of the database, so it is easy to design the database. In fact, database design is also a learning topic.

From the perspective of my experience, I prefer to design databases by developers in the early stage of the project (DBA is required for later optimization ). Based on the author's project experience, a developer proficient in OOP and ORM often designs more rational databases and better adapts to changes in requirements. if the reason is pursued, I guess it is because of database standardization, similar to some OO ideas (such as cohesion ). While DBA, the database designed has the advantage of being able to make full use of DBMS capabilities and use SQL and DBMS to implement logic implemented by many programs. compared with developers, the database optimized by DBA is more efficient and stable. As shown in the title, this article aims to share the database design experience of a developer, which does not involve complex SQL statements or DBMS usage, so it is not limited to a DBMS product. I really hope this article will be helpful to developers and help readers find and fix the gaps.

The RDBMS12 rule of Codd-the origin of RDBMS

Edgar Frank Codd is known as "the father of relational databases" and won the Turing Award in 1981 for his outstanding contributions to the theory and practice of database management systems. In 1985, Dr. Codd released 12 rules that clearly define the concept of a relational database as a guiding principle for the design of all relational database systems.

Information Rules all information in a relational database is represented in a unique way-the value in the table.
Ensure that the access rule relies on the combination of table name, primary key value, and column name to ensure that each data item can be accessed.
Systematic processing of NULL values supports NULL values, which are processed in a systematic manner. NULL values do not depend on data types.
The description of a dynamic online directory database based on the relational model should be self-described. the logical level is the same as that of common data, that is, the database must contain the system table or database description that describes the database structure, which should be included in the table that users can access.
Unified data sub-language rules a relational database system can support several languages and multiple terminal usage methods, but there must be at least one language, its statements can be expressed as strings in a well-defined syntax and fully support all the following rules: Data Definition, view definition, data operations, constraints, authorization, and transactions. (This language is SQL)
View update rules all views that can be updated theoretically can also be updated by the system.
Advanced insert, update, and delete operations take a basic or derivative relationship as the processing capability of a single operation object not only for data retrieval, but also for data insertion, modification, and deletion, that is, data rows are treated as collections in the insert, modify, and delete operations.
The physical independence of data no matter how the database data changes in the storage representation or access mode, the application and terminal activities remain logically unchanged.
Logical independence of data when the table theoretically does not damage information changes, the application and terminal activities will remain logically unchanged.
The independence of data integrity is dedicated to the integrity constraints of a relational database. it must be defined in a relational database sub-language and stored in a data directory rather than in a program.
Distribution independence no matter whether the data is stored in a distributed physical environment or whether the distribution policy is changed at any time, the data control sub-language of RDBMS must be logically immutable for applications and terminal activities.
Non-destructive rules if a relational database system supports a certain low-level (one-time processing of a single record) language, this low-level language cannot violate or bypass a more advanced language (one-time processing of multiple records) integrity rules or constraints, that is, the user cannot violate the constraints of the database in any way.

Design stage

　　(1) planning stage

The main task in the planning stage is to analyze the necessity and feasibility of the database. Determine whether to use the database, which type of database to use, and which database product to use.

　　(2) concept stage

The main task in the concept phase is to collect and analyze requirements. The main requirement is to identify data entities and business rules. For a system, the database mainly includes business data and non-business data. The definition of business data depends on the analysis of user requirements at this stage. Identify business entities and business rules as much as possible, have a preliminary understanding of the system as a whole, and understand the data flow process. Theoretically, this phase will reference or produce a variety of documents, such as the "use case diagram", "data flow diagram", and other project documents. If these results can be produced at this stage, it will undoubtedly be of great help to the later stage. Of course, many documents are beyond the consideration of database designers. In addition, if you are not proficient in this field and the user's business, please give up your idea of completing the user demand analysis independently. The user is not a technical expert. if you cannot assume the role of "business consultant", please select to cooperate with the relevant personnel of the project team or report the user to PM as a risk. Again, in most cases, users are only industry practitioners, rather than professional technicians. We only collect requirements from users rather than relying on user knowledge.

Some tips can be used to record user requirements. of course, this part may be beyond the responsibilities of the database designer:

Efforts should be made to maintain a series of documents containing system design and specifications, such as meeting records, interview records, key user expectations, functional specifications, technical specifications, and test specifications.
Communicate frequently with stakeholders and collect feedback.
Mark the pending content that you have added, which is not required by the customer.
Confirm the scope of the project as soon as possible with all key stakeholders, and strive to freeze requirements.

In addition, business rules must be strictly handled and detailed records must be recorded. In subsequent stages, the design will be based on these business rules.

When this stage ends, you should be able to answer the following questions:

What data is required?
How should data be used?
Which rules control the use of data?
Who will use what data?
What do customers want to see on the core function interface or report?
Where is the data now?
Does data interact, integrate, or synchronize with other systems?
What are theme data?
What are the requirements for reliability in the core data value Ry?

And get the following information:

Entities and relationships
Attributes and domains
Business rules that can be enforced in the database
Business processes requiring database use

　　(3) logic phase

The main work of the logic phase is to draw a E-R diagram, or modeling. There are many modeling tools with different graphical representation methods and software. The use of these tools and software is not critical, and I do not recommend that you spend a lot of time on the choice of modeling methods. For most applications, a E-R diagram is sufficient to describe the relationship between entities. The key to modeling is ideas rather than tools. software only plays an auxiliary role. identifying entity relationships is the focus of this phase.

In addition to the entity relationship, we should also consider the attribute fields (value type, range, constraints)

　　(4) implementation stage

The implementation phase mainly defines the table corresponding to the E-R graph for the selected RDBMS, considering the attribute type, scope and constraints.

　　(5) physical stage

The physical stage is a stage of verification and optimization. it is to deploy a database on the actual physical device for testing and tuning.

3. design principles

　　(1) reducing dependency on database functions

Functions should be implemented by programs rather than DB. The reason is that if the function is implemented by DB, once the replaced DBMS is not as powerful as the previous system and cannot implement some functions, we will have to modify the code. Therefore, in order to prevent such cases, the function should be implemented by a program, and the database is only responsible for data storage to achieve the lowest coupling.

　　(2) principles for defining object relationships

When defining the relationship between an object and other entities, consider the following:

The entities involved identify all entities involved in the link.
Ownership takes into account the situation where one entity "owns" another entity.
The base number is the number of instances associated with one object and the other.

Relationship and table quantity

At least one table is required to describe the relationship.
Description 1: at least two tables are required for the n relationship.
Description n: The n relationship requires at least three tables.

　　(3) columns indicate unique values.

If it represents the coordinate (0, 0), two columns should be used instead of "0, 0" in one column.

　　(4) column order

Column order is irrelevant to the table, but in terms of habits, sorting columns in the order of "primary key + foreign key + entity data + non-entity data" is obviously readable.

　　(5) define primary keys and foreign keys

The data table must define the primary key and foreign key (if there is a foreign key ). Defining primary keys and foreign keys is not only a requirement of RDBMS, but also a requirement of development. Almost all code generators need this information to generate code for common methods (including SQL and references). Therefore, it is necessary to define the primary key and foreign key in the development phase. It is necessary in the development stage because many teams will conduct a large number of tests for performance and ensure that there will be no major defects in the integrity of references, all the foreign keys of the database are deleted to achieve optimal performance. I believe that foreign keys should be retained when there is no performance problem, and even if there is a real performance problem, we should optimize the SQL text, rather than discard the foreign key constraints.

　　(6) selection key

1. manual and natural keys

Artificial health-the unnatural attributes of an object, which are imposed by people as needed, such as GUID, have no significance to the object; natural health-the natural attributes of the object, such as ID card numbers.

Benefits of manual keys:

The key value remains unchanged.
Always single-column storage

Disadvantages of manual keys:

Because the manual key is a unique value without practical significance, you cannot use the manual key to avoid repeated rows.

We recommend that you use the manual keys. The reason is as follows:

In the design phase, we cannot predict the values actually required by the code, so we simply discard the guess key and use the manual key.
The manual key is responsible for complex processing of object relationships, rather than any attribute descriptions. this design ensures that the entity relationship is highly decoupled from the object content, and the design philosophy is clearer.

Another suggestion from the author is that each table requires a natural key that makes sense to the user. in special cases, such a key may not be found. in this case, you can use a composite key. This key is not used as a unique identifier in the program, but can be used for direct queries to the database.

Another drawback of using a manual key is mainly due to the consideration of query performance. Therefore, it is important to select the form of the manual key (column type:

The self-added type has better efficiency due to its lightweight query type, but its value is limited.
GUID query is not as efficient as value type, but has unlimited values and is more friendly to developers.

2. smart keys and non-smart keys

Intelligent Key-the key value contains additional information, which is encoded according to the encoding specification agreed upon by a certain convention. The key value itself can obtain some information. non-Intelligent Key, simple meaningless key value, such as an auto-increment number or GUID.

Smart keys are a double-edged sword. Developers prefer the information-containing key values, and the program looks forward to the potential data. the reason why database administrators or designers hate this smart key is also obvious, smart Keys pose potential risks to databases. As mentioned above, one of the Principles of Database design is not to combine values with independent meanings into a single column. multiple independent columns should be used. The database designer prefers developers to splice multiple columns to obtain the smart key, that is, they use the composite primary key instead of breaking down the values of a column. Developers should accept this type of database design, but many developers cannot understand the advantages of the two. The author believes that the risk of using a single column to implement the smart key is that we may not be able to predict that the encoding rules may change later in the design phase. For example, when the value of a local key that constitutes a smart key is used up, the rule changes or the length changes. Such a change in encoding rules is destructive to program validation and smart key parsing, this is the least visible to System O & M personnel. Therefore, the author suggests that if you need a smart key, encapsulate it in the business logic layer (using the read-only attribute) and do not implement it at the persistence layer to avoid the above problems.

　　(7) whether NULL is allowed

About NULL, we need to know several features of it:

The concatenation of any value and NULL is NULL.
All mathematical operations with NULL return NULL.
After NULL is introduced, the logic is not easy to process.

So should we allow the column to be empty? I think the answer to this question is influenced by our development language. Taking C # as an example, since the NULL type is introduced to process the case where the database value type is NULL, whether to allow NULL is of little significance to developers. However, it must be noted that non-null verification must be performed in the Assembly, instead of relying on the non-null constraints of the DBMS. complete data must be ensured (all required attributes are assigned values) to reach the database (the so-called "security zone", we must define that the data obtained in those regions in a multi-tier system is secure and pure ).

　　(8) attribute cutting

An incorrect idea is that the relationship between attributes and columns is 1:1. For developers, we expose attributes rather than fields. For example, if the entity "employee" has the "name" attribute, the "name" can be further divided into "Last Name" and "name". for developers, obviously, the second data structure is more favored ("surname" and "name" are two fields ). Therefore, we should also consider whether to cut the attribute as needed during design.

　　(9) Standardization-paradigm

While I was in college, the paradigm was the biggest headache for learning relational databases. I think there may be readers who still don't understand the value of the paradigm. Simply put, the paradigm will help us to ensure the effectiveness and integrity of data. The purpose of normalization is as follows:

Eliminate duplicate data.
Avoid writing unnecessary code to synchronize duplicate data.
Keep the table thin and reduce the number of read operations required to read data from a table.
Maximize the usage of clustered indexes to optimize data access and connection.
Reduce the number of indexes used by each table because the cost of maintaining indexes is high.

Normalization aims to pick out complex entities and extract simple entities from them. This process continues until every table in the database represents only one thing, and each table describes this thing.

1. normalize entities and attributes (remove redundancy)

1NF: each attribute should only represent a single value, not multiple values.

Considerations:

Attributes are Atomic. you need to consider whether they are completely decomposed, so that each attribute represents a single value. (This is the same as the description of "(3) columns that indicate unique values .) The principle of decomposition is that values are decomposed only when each part needs to be processed separately and enough for use. (Even if you do not need to completely break down attributes, you should consider possible future demand changes .)
All instances of the attribute must contain the same number of values. the entity has a fixed number of attributes (the table has a fixed number of columns ). When designing an object, you must associate each attribute with a fixed number of values.
All entity types in the object must be different.

The current design does not meet the "odor" of 1NF ":

String data that contains delimiter characters.
The end of a name has a numeric attribute.
There are no tables with poorly defined keys or keys.

2 Relationship between attributes (remove redundancy)

2NF-the object must conform to 1NF, and each attribute description must be specific to the entire key (it can be understood as the cohesion of type attributes in oop ).

The current design does not conform to the "odor" of 2NF ":

Duplicate key attribute name prefixes (data redundancy outside design) indicate that these values may describe some additional entities.
Duplicate data groups (data redundancy outside of design) indicate function-dependent properties.
A composite primary key without a foreign key indicates that the key value in the key may identify a variety of things, rather than a kind of things.

3NF-the object must comply with 2NF. non-key attributes cannot be described as other non-key attributes. (Unlike 2NF, 3NF processes the relationship between non-key attributes and non-key attributes, rather than between key attributes.

The current design does not conform to the "odor" of 3NF ":

Multiple attributes have the same prefix.
Duplicate data group.
Summarized Data. the referenced data is in a completely different entity. (Some people tend to use Views. I prefer to use object sets, that is, programs .)

BCNF-the entity satisfies the first paradigm. all attributes depend entirely on a key. If all judgments are a key, the entity satisfies BCNF. (BCNF simply extends the previous paradigm. It says that an entity may have several keys, and all attributes must depend on one of these keys, it can also be understood as "each key must uniquely identify an object, and each non-key must be familiar with describing an object."

3 Remove redundancy in the entity key combination

4NF-the object must satisfy BCNF. between an attribute and the object key, the multi-value dependency (the uniqueness of a record in the entire table is determined by the combination of multiple values) cannot exceed one.

The current design does not meet the "odor" of 4NF ":

Ternary relationship (entity: Entity: entity ).
Latent multi-value attributes. (For example, multiple mobile phone numbers .)
Temporary data or historical values. (The subject of historical data needs to be proposed, otherwise there will be a lot of redundancy .)

4. break down all links into binary relationships as much as possible.

5NF-the entity must satisfy 4NF. when the decomposed information is lossless, ensure that all relations are broken down into binary relationships.

5NF ensures that any ternary relationship that can be decomposed into entities in the fourth paradigm is decomposed. Some ternary relationships can be broken down into binary relationships without information loss. when information is lost in the process of breaking down into two binary relationships, the relationship is declared to be in the fourth paradigm. Therefore, the fifth paradigm suggests that the existing ternary relationships should be divided into three binary relationships.

Note that the canonicalized query result may be more tables and more complex queries. Therefore, the degree of processing depends on the performance and data architecture considerations. It is recommended to normalize to the fourth paradigm because the 5NF judgment is too obscure. For example, table X (teacher, student, course) is A three-element relationship, which can be divided into table A (teacher, student), Table B (student, course), and Table C (teacher, course, course ). Table X indicates that A teacher is the teacher of A certain course of A student. Table A indicates that the teacher teaches the students. Table B indicates that the students attend classes. Table C indicates that the teachers teach classes. The problem cannot be identified separately, but starting from the data, "table X = table A + Table B + Table C" is not necessarily true, that is, the data before decomposition cannot be constructed through A connection. Because there may be multiple combinations, the business rules returned by table X are lost. This phenomenon is easy to be ignored in the design stage, but it is good to be apparent in the open stage and does not happen frequently.

Recommended Practices:

Observe the preceding standardization principles as much as possible.
All attributes must be described to reflect the essence of the modeled object.
There must be at least one key that uniquely identifies and describes the nature of the built object.
Exercise caution when selecting a primary key.
In the logic phase, you can do as much normalization as you can (performance is not the scope of the logic phase ).

　　(10) select data type (ms SQL 2008)

Common types of MS SQL:

Exact number	No loss of precision	Bit tinyint smallint int bigint decimal
Approximate number	Loss of precision may occur in extreme values.	Float (N) real
Date and time		Date time smalldatetime datetime2 datetimeoffset
Binary data		Bingary (N) varbinary (max)
Character (string) data		Char (N) varchar (max) nchar (N) nvarchar (max)
Store arbitrary data		SQL _variant
Timestamp		Timestamp
GUID		Uniqueidentifier
XML	Do not try to use this type to circumvent 1NF	Xml
Spatial data		Geometry geography
Hierarchical Data		Heirarchyid

Unsupported or bad type selection in MS SQL

Image: replaced by varbinary (max.
Text and ntext: replaced by varchar (max) and nvarchar (max.
Money and smallmoney: it is difficult to use during development. we recommend that you use decimal.

Select common types:

The most basic rule for type selection is to select the lightest type that meets the requirements, because the query is faster.

Bool	We recommend that you use bit instead of char (1). because the development language supports bit, it can be directly mapped to bool or bool ?.
Shard data	Use the smallest of all alternative types. the larger the type, the slower the query. when the byte is greater than 8000, use max.
Primary key	The auto-increment primary key selects int or bigint based on the expected range, and the GUID uses uniqueidentifier instead of varchar (N ).

　　(11) Parallel Optimization

Parallel optimization should be taken into account when designing the DB, for example, the timestamp type in ms SQL is an excellent choice.

4. naming rules

Table -- "module name_table name ". It is recommended that you do not use the plural number of table names because the code generator generates class definitions based on the database when developing using the ORM framework. The Table generates the type definitions of an instance instead of the instance set. The table name must not be too long. One of the reasons is that some software has a limit on the maximum length of table names. The second reason is that using code generators will often generate a name based on the table name. later, the lazy will directly use this name, it is not wise to leave a long name across network boundaries.
Field-the bool type Is represented by "Is", "Can", and "Has". The Date type name must contain "Date", and the Time type must contain "Time ".
Stored Procedure-use the prefix "proc.
View -- use the prefix "view.
Trigger -- use the prefix "trig.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More