Design of relational database

Last Update:2018-08-13 Source: Internet

Author: User

Tags sql 2008 naming convention

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The origin of the codd RDBMS12 rule--rdbms

Edgar Frank Codd (Edgar Frank Code) was hailed as "the father of relational databases" and was awarded the Turing Prize in 1981 for his outstanding contribution to the theory and practice of the database management system. In 1985, Dr. Codd issued 12 rules that succinctly define the concept of a relational database that is used as a design guideline for all relational database systems. Information law all information in a relational database is represented in a unique way-the values in the table. Guaranteed access rules rely on the combination of table name, primary key value, and column name to guarantee access to each data item. The systematic processing of NULL values supports NULL values (NULL), a systematic approach to null values, and null values that do not depend on data types. The description of the dynamic online catalog database based on the relational model should be self-describing, with the same representation of normal data at the logical level, that is, the database must contain system tables that describe the structure of the database or the database description information should be included in the table that the user can access. A unified data sub-language rule a relational database system can support several languages and a variety of terminal use modes, however, there must be at least one language whose statements can be represented as strings by a well-defined syntax and fully support all of the following rules: Data definition, view definition, data manipulation, constraints, authorization, and transactions. (This language is SQL) View update rule All views that are theoretically updatable can also be updated by the system. Advanced INSERT, UPDATE, and delete operations the ability to process an underlying or derived relationship as a single Action object is not only appropriate for data retrieval, but also for inserting data, modifying deletes, in which data rows are considered collections in insert, modify, and delete operations. The physical independence of data regardless of how the data in the database changes in how it is stored or accessed, the application and terminal activities remain logically invariant. Logical independence of data the application and the terminal activity remain logically invariant when the table is theoretically not detrimental to information changes. The independence of data integrity specific to a relational database the integrity constraints must be defined in the relational database child language and can be stored in the data directory, not in the program. Distribution independence regardless of whether the data is distributed in physics or not, the data manipulation sub language of the RDBMS must be able to maintain the logical invariance of the application and the terminal activity. Non-destructive law if a relational database system supports some low-level (one-time, single-record) language, then this low-level language cannot violate or circumvent the integrity rules or constraints of a higher-level language (multiple records processed at a time), that is, users cannot violate database constraints in any way. two relational database design phase

　　(i) Planning phase

The main task of the planning stage is to analyze the necessity and feasibility of the database. Determine whether you need to use a database, which type of database to use, and which database product to use.

　　(ii) Conceptual phase

The main task of the conceptual phase is to collect and analyze requirements. Identify requirements, primarily by identifying data entities and business rules. For a system, the database mainly includes business data and non-business data, and the definition of business data depends on the analysis of user requirements at this stage. You need to identify business entities and business rules as much as possible, have a rudimentary understanding of the system as a whole, and understand the flow of data. In theory, this phase will refer to or produce multiple documents, such as use case diagrams, data flow diagrams, and other project documents. If these results can be produced at that stage, it will undoubtedly be of great help to the latter. Of course, many documents are beyond the scope of the Database Designer's considerations. Moreover, if you are not proficient in this area and the user's business, then please give up their own independent completion of user needs analysis of the idea. The user is not a technical expert, and when you are unable to play the role of "business advisor", you may choose to collaborate with the project team or as a risk escalation to PM. Again, most of the time, the user is just an industry practitioner, not a professional technician, and we're just collecting requirements from users, not relying on the user's knowledge.

There are some tricks you can use to record the needs of your users, of course, some of which may go beyond the responsibility of the Database Designer: strive to maintain a series of documents containing system design and specification information, such as meeting minutes, interview records, key user expectations, functional specifications, technical specifications, test specifications, etc. Communicate with stakeholders frequently and collect feedback. Mark out what you have added that is not part of the customer's request, pending content. Identify project scope as soon as possible with all key stakeholders and seek to freeze requirements.

In addition, business rules must be handled rigorously and documented in detail. At a later stage, the design will be based on these business rules.

When this phase is over, you should be able to answer the following questions: What data is needed. How the data should be used. Which rules govern the use of data. Who will use what data. What the customer wants to see in the core function interface or report. Where the data is now. Whether the data is interacting, integrated, or synchronized with other systems. What are the subject data? Core data value geometry, the degree of reliability requirements.

and get the following information: entity and relationship attributes and business rules that the domain can enforce in the database need to use the business process of the database

　　(iii) logic phase

The main task of the logic phase is to draw an E-R diagram, or modeling. Modeling tools are many, with different graphical representations and software. The use of these tools and software is not critical, and I do not recommend that readers spend a lot of time on the selection of modeling methods. For most applications, an E-R diagram is sufficient to describe the relationship between entities. The key to modeling is thinking, not tools, and software just plays an auxiliary role, and identifying entity relationships is the focus of this phase.

In addition to entity relationships, we should also consider the domain of the attribute (value type, scope, constraint)

　　(iv) Implementation phase

The implementation phase mainly aims at defining the table corresponding to the E-R diagram for the selected RDBMS, considering the attribute type and scope as well as constraints.

　　(v) Physical phase

The physical phase is a validation and tuning phase that deploys the database on the actual physical device and tests and tuning it. Three design principles

　　(i) reduced reliance on database functionality

The functionality should be implemented by the program, not by the DB implementation. The reason is that if the functionality is implemented by DB, once the replacement DBMS is not as powerful as the previous system and does not implement certain features, then we will have to modify the code. Therefore, in order to eliminate this kind of situation, the function should have the program realization, the database is responsible for the data storage only, achieves the lowest coupling.

　　(ii) Principles for defining entity Relations

When defining a relationship between an entity and another entity, consider the following: The entity that is involved recognizes all the entities involved in the relationship. Ownership takes into account the circumstances in which an entity "owns" another entity. The cardinality considers the number of instances of an entity that are associated with another entity instance.

Relationship with Table quantity Description 1:1 relationship requires at least 1 tables. A minimum of 2 tables are required to describe the 1:n relationship. A minimum of 3 tables are required to describe the n:n relationship.

　　(iii) The column means a unique value

If you represent coordinates (0,0), you should use two column representations instead of placing "0,0" in 1 columns.

　　(iv) Order of the columns

The order of the columns is irrelevant to the table, but it is customary to sort the columns in the order of "primary key + foreign KEY + Entity Data + non-Entity data", which is obviously a better readability.

　　(v) defining primary and foreign keys

The datasheet must define primary and foreign keys (if there are foreign keys). Defining a primary key and a foreign key is not only an RDBMS requirement, but also a development requirement. Almost all code generators require this information to generate code for common methods, including SQL and references, so defining primary keys and foreign keys is necessary during the development phase. The reason why it is necessary in the development phase is that many teams, for performance reasons, will remove all foreign keys from DB to achieve optimal performance after a large number of tests are guaranteed without significant defects in referential integrity. The author believes that when the performance is not a problem should retain foreign keys, and even if the performance is really a problem, you should optimize the SQL, rather than abandon the foreign key constraints.

　　(vi) SELECT Key

1 artificial key and natural key

Artificial health-An unnatural attribute of an entity, as required by a person, such as a GUID, that has no meaning to the entity; Natural health-The natural attribute of an entity, such as an ID number.

The benefits of the manual key: The key value will never change forever is a single-column storage

The disadvantage of a manual key: Because a human key is a unique value that has no practical meaning, it is not possible to avoid duplicate rows through a manual key.

The author recommends all use of artificial keys. Here's why: at design time we can't predict what the code really needs, so simply give up guessing and use a manual key. Manual key complex processing entity relations, and not responsible for any attribute description, such a design makes the entity relationship and entity content is highly decoupled, so the design idea is clearer.

Another suggestion of the author is that each table needs to have a natural key that is meaningful to the user, and may not be able to find such an item in a particular case, at which point a composite key can be used. This key is not used as a unique identifier in my program, but it can be used when querying the database directly.

Another disadvantage of using a manual key stems from the consideration of query performance, so choosing the form of a human key (the type of the column) is important: self-value-added types are more efficient but have limited value because of their lightweight type of query. GUID queries are not as efficient as value types, but they are infinitely valued and are more approachable to developers.

2 Smart and non-smart keys

Smart Key--the key value contains additional information, which is encoded according to a certain agreed coding specification, from which the key value itself can obtain some information; a non-intelligent key, a simple, meaningless key value, such as a self-increasing number or GUID.

The smart key is a double-edged sword, developers prefer this information-containing key value, the program is looking forward to the potential data; database administrators or designers hate this kind of smart keys, because it's obvious that smart keys are a potential risk to the database. As mentioned earlier, one of the principles of database design is not to implement a combination of values with independent meanings into a single column, and multiple separate columns should be used. Database designers, you want developers to get smart keys by stitching together multiple columns, using a composite primary key for developers, rather than decomposing the value of a column to use. Developers should accept this database design, but many developers do not understand the advantages of the two. The author believes that there is a risk that using a single column to implement a smart key is that we might not anticipate in the design phase that the coding rules might change later. For example, the value of the local key that makes up the smart key is used up to cause rule change or length change, the change of coding rules is destructive to the validation of the program and the parsing of the intelligent key, which is the last thing the system operators want to see. So the author suggests that if you need a smart key, encapsulate it in the business logic layer (using the read-only attribute), and do not persist the layer implementation to avoid the problem.

　　(vii) Whether NULL is allowed

About NULL we need to understand several of its features: any value and null concatenation are null. All mathematical operations with NULL return NULL. After the introduction of NULL, logic is difficult to handle.

So should we allow the column to be empty? I think the answer to this question is influenced by our development language. In C #, for example, because a nullable type is introduced to handle a situation where a database value type is NULL, it makes little sense for developers to allow Nulls. One thing to note, though, is that verifying non-null must be handled in an assembly. Instead of relying on the non-null constraints of the DBMS, you must ensure that the complete data (all the required attributes are assigned) to the DB (the so-called "safe zone"), we must define the data that is available in those areas of the multi-tier system to be safe and pure.

　　(eight) attribute cutting

One wrong idea is that the attribute is in relation to the column 1:1. For developers, we expose attributes rather than fields. For example, for an entity "employee" to have a "name" attribute, "name" can be broken down into "last name" and "name", it is obvious to developers that the second data structure is preferred ("Last name" and "name" as two fields). Therefore, in the design we should also consider whether or not to cut properties.

　　(ix) Normalization-paradigm

When the author is still in college, Paradigm is the most troublesome problem when learning relational databases. I think there may be readers who still don't understand the value of paradigms, simply-paradigms will help us to ensure the validity and completeness of our data. The purpose of normalization is as follows: Eliminate duplicate data. Avoid writing code that is unnecessary to synchronize duplicate data. Maintain the weight of the table and reduce the number of read operations that are required to read data from a single table. Maximizes the use of clustered indexes, allowing for more optimized data access and connectivity. Reduce the number of indexes used per table because the cost of maintaining the index is high.

Normalization is designed to pick out complex entities and extract simple entities from them. This process continues until each table in the database represents only one thing, and each description in the table is the same thing.

1 normalize entities and attributes (remove redundancy)

1NF: Each attribute should only represent a single value, not multiple values.

A few things to consider: attributes are atomic and need to consider whether familiarity is decomposed sufficiently thoroughly so that each attribute represents a single value. (and "(iii) columns imply a unique value" the same principle is described. The decomposition principle is--when you need to process each part separately, decompose the value and decompose it into sufficient use. (Even if you do not currently need to thoroughly decompose attributes, you should consider possible future requirements changes.) All instances of a property must contain the same number of value entities that have a fixed number of attributes (the table has a fixed number of columns). When designing an entity, you want each property to have a fixed number of values associated with it. All entity types that appear in an entity must be different

The current design does not conform to the "odor" of 1NF: String data that contains delimiter-class characters. The attribute with a number at the end of the name. There are no tables defined for key or key definitions.

2 relationships between attributes (redundancy removed)

The 2nf-entity must conform to 1NF, and each attribute description must be specific to the entire key (which can be understood as the cohesion of the type attribute in OOP).

The current design does not conform to the 2NF "odor": Duplicate key Attribute name prefixes (data redundancy outside the design) indicate that these values may describe some additional entities. Duplicate data sets (redundancy of data outside the design) This marks a function dependency between attributes. Compound primary key Without foreign key This indicates that the key value in the key may identify a variety of things, rather than a thing.

The 3nf-entity must conform to 2NF, and Non-key properties cannot describe other Non-key properties. (unlike 2NF, the relationship between the Non-key and Non-key properties of 3NF processing, not the key attributes.)

The current design does not conform to the "odor" of 3NF: multiple attributes have the same prefix. A repeating data group. The aggregated data in which the data is referenced in a completely different entity. (Some people tend to use views, and I prefer to use a collection of objects, which is done by a program.) ）

The bcnf-entity satisfies the first normal form, all attributes depend entirely on a key, and if all the decisions are a key, the entity satisfies the BCNF. (bcnf simply extends the previous paradigm by saying that an entity may have several keys, all attributes must depend on one of these keys, or that "each key must uniquely identify the entity, and each non-key must describe the entity.") ”

3 removing redundancy in the entity key combination

The 4nf-entity must satisfy the BCNF, between a property and the key of the entity, a multivalued dependency (a record in which the uniqueness of the entire table is combined by multiple values) cannot be more than one.

The current design does not conform to the 4NF "odor": Ternary relationship (Entity: Entity: entity). The underlying multivalued attribute. (such as multiple cell phone numbers.) ) temporary data or historical values. (The subject of historical data needs to be raised, otherwise there will be a lot of redundancy.) ）

4 Divide all relationships into two-yuan relationships as much as possible

The 5nf-entity must meet 4NF, ensuring that all relationships are broken down into two-yuan relationships when the decomposed information is lossless.

5NF guarantees that any ternary relationship that exists in the fourth normal form that can be decomposed into entities is decomposed. Some ternary relationships can be broken down into two-yuan relationships without losing information, and relationships are declared to be in the Forth paradigm when the process of decomposing into two two-element relationships loses information. Therefore, the five paradigm suggests that it is best to decompose the existing ternary relationship into 3 two-yuan relationships.

It should be noted that the normalized results may be more tables and more complex queries. Therefore, the extent to which they are handled depends on the multiple considerations of performance and data architecture. The recommendation is normalized to the IV paradigm, because the 5NF judgment is too vague. For example: Table x (teacher, student, course) is a ternary relationship that can be decomposed into table A (teacher, student), table B (student, course), Table C (teacher, course). Table x indicates that a teacher is a teacher of a course for a student; Table A indicates that the teacher teaches the students; Table B indicates that the students are in class; Can not find the problem alone, but from the data, "table x= table B + Table C" is not necessarily true, that is, can not build the data before the decomposition of the connection. Because there may be multiple combinations, the business rules that are fed out of table X are lost. This phenomenon, which is easily overlooked in the design phase, is likely to be apparent in the open phase and does not occur frequently.

Recommended Practice: As far as possible adhere to the above standardization principles. All attribute descriptions should be content that reflects the nature of the entity being modeled. There must be at least one key, which uniquely identifies and describes the nature of the entity being built. Primary keys should be carefully selected. How much normalization can be done in the logical phase (performance is not considered in the logical phase).

　　(10) Select data type (MS SQL 2008)

Common types of MS sql:

Exact number	There will be no loss of precision	bit tinyint smallint int bigint decimal
Approximate numbers	Loss of precision may occur for extreme value	Float (N) real
Date and time		Date Time smalldatetime datetime datetime2 DateTimeOffset
Binary data		Bingary (n) varbinary (n) varbinary (max)
Character (string) data		char (n) varchar (n) varchar (max) nchar (n) nvarchar (n) nvarchar (max)
Storing arbitrary data		sql_variant
Time stamp		Timestamp
Guid		uniqueidentifier
Xml	Do not attempt to circumvent 1NF with this type	Xml
Spatial data		Geometry geography
Hierarchical data		Heirarchyid

Not supported or bad type selection image in MS sql: replaced by varbinary (max). Text and ntext: replaced by varchar (max) and nvarchar (max). Money and smallmoney: It is not easy to use in the development process, it is recommended that decimal.

Common Type selection:

The most basic rule for type selection is to choose the lightest type that meets your needs, because the query is faster.

bool	Bit instead of char (1) is recommended because the development language is good for its support and can be mapped directly to BOOL or bool.
Large Value data	Using the smallest of all alternative types, the larger the type, the slower the query, and the max when the byte is greater than 8000.
Primary key	The self-added primary key selects int or BIGINT,GUID based on the expected range using uniqueidentifier instead of varchar (N).

　　(11) Optimizing parallel

When designing db, you should take into account the optimization of parallelism, for example, the timestamp type in MS SQL is an excellent choice. four naming Convention table-"Module name _ table name." Table names are best not in complex numbers, because when you develop using an ORM framework, the code generator generates class definitions based on DB, and the table generates a type definition for an instance, not a collection of instances. The table name is not too long. One reason is that some software has a limit to the maximum length of a table name; second, the use of code generators often produces type names based on the table name, and then the lazy person uses the name directly, which is clearly not advisable if the name is too long across the network boundary. The field--bool type is represented by "is", "Can", "has", and the date type name must contain "date"; Stored procedures-Use the "proc_" prefix. View--Use the "View_" prefix. Triggers--Use the "Trig_" prefix.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More