Why is standardization important? At present, many databases have not been standardized for various reasons. This article explains some of the reasons and standardizes the claims form of an insurance company using different forms of normal form. In this process, table changes and some additional tables are added to make the database more efficient, with fewer errors, and easier maintenance.
Database standardization is the practice of optimizing the table structure and organizing data into tables. This makes the data clearer. Standardization allows you to change business rules, requirements, and data without restructuring the entire system.
By changing the way data is stored-simply changing a bit-and changing the program that accesses the information, you can eliminate many errors or spam data and reduce the workload necessary to update information.
A real problem in the company can be summarized in one sentence: "We generally do this ". We generally store information in the same way. We generally allow people to write any information into <insert field Name>. We generally program in that way. This is usually a bad thing, especially for young and learning companies. However, when there is a new system and a better way to complete the task, sometimes "the task is done well in that way" may need to be re-explored and modified. Standardized data is one of the useful methods that companies often use.
Even if you use data in a COBOL program (such as a file layout familiar to any COBOL programmer), storing them (data) in a relational database is similar to storing them in a flat file, however, the method stored in a flat file is not the best way to complete the task, especially because you do not know the difference between the two or are afraid of changes, however, it is a simple way to bring past ideas into the present.
Note: dictionary.com is standardized as follows: "enable its standard, especially to make it conform to a certain standard or specification. "Or" mandatory acceptance of certain standards ". Webopedia considers Normalization as "the process of organizing data to minimize redundancy in relational database design. Normalization usually involves dividing a database into two or more tables and defining the relationship between tables. The goal is to isolate data. In this way, you only need to add, delete, and modify a field in one table. Then, you can pass the defined relationship to the remaining table in the database ". I prefer this definition.
Terms
Before you understand an example of an insurance company in the real world, you need to understand some terms that will be used in the discussion. When processing databases, especially when dealing with standardization issues, the following section describes a set of new keywords that are useful:
· Relationship: essentially, a relationship is a two-dimensional table or array containing rows and columns.
· Relationship: Association is a method for connecting data in different tables. Association also exists between data items that form different entities and between the table entities themselves, which constitutes the basic core issue of database standardization. There are three basic types of data association. It is important to understand them:
One-to-one (): one-to-one association means that each (rather than most) given instance closely matches one instance of another entity. Each person has only one correct fingerprint, which is unique. Each phone number exactly corresponds to an independent private customer who pays an account (not a company ). Everyone in the United States has only one social security number.
One-to-multiple (1: m): One-to-multiple association means that an instance of a given entity can be associated with zero instances, one instance, or multiple instances of another entity. Each person may have no children, one or more children. Everyone may not have a car, a car or multiple cars.
Many-to-many (M: N): Multiple-to-many Association (zero or multiple instances of a given entity are associated with zero or multiple instances of another entity) it is a complex association that is directly simulated. It is often divided into multiple 1: m associations. Because multiple families are combined, one or more children may have no parent (orphan) or one parent (single-parent family ), more than one parent (two parents who are still together or divorced, or divorced and remarried ). A house or property can be transferred to one or more people, and one or more of these people may have one or more houses or property in their wills.
· Attribute: an attribute is considered a property or feature that can be modified by certain components in a program or database. It can be set to a different value, link, or column in a table.
· Tuple: tuple is a set of values or value Attributes sorted by a relational database or a non-Relational Database: a row in a link.
· Deletion exception: Deletion exception refers to the loss of data conflicts or unexpected data (information) caused by deliberate deletion of other data.
· Insertion exception: the insertion exception refers to the inability to add information to the database due to the lack or absence of data.
· Update exception: An Update exception refers to a data conflict caused by data redundancy or incomplete updates of redundant data.
· Relationship decomposition: Relationship decomposition refers to breaking a relationship into multiple relationships, so that the relationship conforms to a higher paradigm.
· Data redundancy: data redundancy refers to unnecessary data duplication in the database.
· Data Integrity: Data integrity refers to data consistency in the database. It is important to ensure data integrity. Only in this way can users know that the data they depend on is correct, and the results and programs they query are accurate and expected.
· Atomic value: An atomic value is a value. It is neither a group of values that can be further split nor a duplicate group. Each column has a complete value, but only one value. This value cannot be divided into multiple parts. It is either used by the database or accessed by users of the database.
· Reference integrity rules: a reference integrity rule indicates that the value stored in a non-empty external key must be a key data item in a certain link.
· External Key: an external key is a group of attributes (one or more columns) in a link. It is also a primary key in a (same or other) link. It is the logical link between links. Refer to external Jian of your own link as recursive external Jian.
· Functional dependency: function dependency means that the value of an attribute in a row is determined by the value of another attribute in the row. This usually occurs between the primary key (to make a row unique information segment) and other information of the row. The combination of cities and states depends on zip code, even if a given State contains many zip code associated with a city. Every legal person in the United States depends on his social security number.
· Decisive: The function depends on the attribute on the left to determine the value of other attributes in the row (the ZIP Code determines the city and state; the social security number determines the identity of the person; the license number and State determine the car owner ).
· Entity integrity rules: entity integrity rules indicate that the key attributes of a row may be blank (if you have a zip code in a city; if you have a car, you have a license number ).
· Constraint: A constraint is a rule that limits values in the database. The telephone number must be a number; the dollar number must be a number; the State must be a legal state or province; the country must be a legal country; the date cannot be January 1, February 31.
Now you know a lot of related terms. Let's take a look at the meanings of the terms. The example below is not a typical example of employee-Manager-department or student-Professor-course. I will demonstrate a hypothetical insurance company database. The tables in the database are much more complex than the tables used in this example, but they are similar to what people encounter.
Figure 1 shows the non-standardized definition of the Claim table. Although there are much more tables in an insurance company's database than it, these tables provide us with some background, through which we can see normalization and its branches. Remember that the examples in each section only have some columns, which simplifies the examples and allows you to easily see what has changed.
| Claim_num, comment, claim_status, comment, comment, reported_dt, entered_dt, comment, claim_dt6, comment, claim_dt10, closed_dt, death_dt, tags, tags, award_cd, cause_cd, tags, location, site, coverage_cd, tags, ded_recov, tags, paid_1, reserved_1, paid_2, reserved_2, paid_3, reserved_3, paid_4, reserved_4, paid_5, expires, paid_6, expires, paid_7, expires, paid_8, expires, paid_9, expires, paid_10, expires, expires, key1, key2, key3, key4, key5, key6, key7, key8, key9, key10, success, policy_num, payment_num, SSN, state, success, entry_dt, admin_cd, admin_desc, reopen_dt, insured_name, insured_address, success, fail, claimant_name, claimant_address, claimant_city, claimant_state, claimant_zip, claimant_phone, comment, special_dt_2, comment, comment y_id |
Figure 1: columns in the nonstandard claims table
1nf)
It is generally quite simple to convert a database or a database table to the first paradigm. The first paradigm requires that duplicate groups be eliminated, which is achieved by creating a separate table for relevant data. It determines the table by observing the data and table structure to complete the first paradigm.
The first paradigm is to remove duplicate groups by placing duplicate groups in each independent table and associating these tables through one-to-Multiple Association.
No repeated attributes and no repeated values -- this sounds simple enough. However, sometimes there are no other options to convince people that it is difficult to simply add any other set to the design, but this is what you do.
If we want to express claims to the first paradigm, we need to find all the attributes that are truly associated with a specific claim. What constitutes a claim?
· Claims must be numbered.
· The person who makes the claim.
· Claims must have a report date.
· Claims must have an accident or illness date.
· The number of items to be retained due to an accident or illness.
· Claims belong to or are written based on certain policies.
· The settlement can be completed.
· The claim can be restarted.
· Does claims cover a certain extent? Or is there more things in a policy?
· Are there any causes for claims? Or is there a cause of an accident or illness?
· Have you paid the claim? Or have you paid the invoice?
· Do claims have social security numbers? Or sometimes a Social Security number is a request-making person?
· The date of death is an interesting part. Has the claimant died? No, but if it is life insurance, it may be related to claims, so it should be kept.
Modify the column directly related to the claim. Expected result 2 is displayed:
| Claim_num, claim_status, expiration, entered_dt, closed_dt, death_dt, expiration, expiration, adjuster_name, agent_cd, agent_name, award_cd, expiration, payment_num, location, site, website, policy_no, policy_description, state, run_dt, activity_dt, entry_dt, reopen_dt, insured_name, insured_address, expiration, claimant_address, expiration, expires, gross_pd |
Figure 2: claims form of the first paradigm
A revised version of the claims form that complies with the first paradigm will contain only information related to claims, not payments or invoices, policies, or accidents.
| Payment_num |
Claim_status |
Accident_dt |
Accident_yr |
Reported_dt |
Entered_dt |
| 123456789 |
Open |
20-jun-2000 |
2000 |
28-jun-2000 |
29-jun-2000 |
| 234567890 |
Reviewed |
15-feb-1984 |
1984 |
19-feb-1984 |
20-feb-1984 |
| 147258369 |
Reopened |
08-apr-2003 |
2003 |
10-apr-2003 |
11-apr-2003 |
| 258369147 |
Closed |
18-dec-1980 |
1980 |
18-dec-1980 |
19-dec-1980 |
If you have a payment table that stores the reserved quantity of specific claims for other different bills, why not store them in the payment table? In short, you store some information in the payment table, so why not store the content in it, instead of in the Claims table?
If the only reason for putting this information into the claims table is that a user may need this information during claims, the Claims table and the payment table can be joined, in addition, the information can come from the sum of all payments incurred by a single claim. And because you have different types of insurance policies (so there are different types of claims), why not store all types of claims payment information in one table? It is logical to store all the payment information in the same table. Most of the information associated with a certain payment (attribute) is the same, whether it is that type of payment or that type of claims. However, the account information for different types of claims is somewhat different.
2nf)
The second paradigm deals with the deletion of redundant data. When the information in a table depends on other columns in the table that are not the primary key, the second paradigm is usually violated.
If the new first paradigm claims table is listed as follows, the redundant data that can be quickly and easily seen is the city and state of the insured and the city and state of the person who makes the claim. Both cities and states depend on zip code directly, rather than anything related to claims.
| Claim_num, claim_status, comment, entered_dt, closed_dt, death_dt, comment, comment, adjuster_name, agent_cd, agent_name, award_cd, comment, location, site, comment, comment, policy_no, policy_description, state, run_dt, activity_dt, entry_dt, reopen_dt, insured_name, insured_address, expiration, expiration, insured_state, expiration, claimant_name, claimant_address, expiration, claimant_state |
Figure 3. Claim of the second paradigm
| Claim_num |
Claimant_name |
Claimant_address |
Claimant_city |
Claimant_state |
Claimant_zip |
| 123456789 |
Jennifer Smith |
1234 main |
Pitt0000gh |
Pa |
15201 |
| 234567890 |
Bill Smith |
7852 eagle |
Pitt0000gh |
Pa |
15202 |
| 147258369 |
John Jones |
4562 edge |
Eighty four |
Pa |
15330 |
| 258369147 |
Eleanor Stillwater |
7531 West Eastern |
Somerset |
Pa |
15510 |
| Zip_code |
City |
State |
| 15330 |
Eighty four |
Pa |
| 15510 |
Somerset |
Pa |
| 15201 |
Pitt0000gh |
Pa |
| 15202 |
Pitt0000gh |
Pa |
| 15203 |
Pitt0000gh |
Pa |
| 15204 |
Pitt0000gh |
Pa |
| 15205 |
Pitt0000gh |
Pa |
| 15206 |
Pitt0000gh |
Pa |
| 15207 |
Pitt0000gh |
Pa |
| 15208 |
Pitt0000gh |
Pa |
| 15209 |
Pitt0000gh |
Pa |
| 15210 |
Pitt0000gh |
Pa |
Because pitt0000gh, eighty fou, Somerset, and pa do not rely on claims, but on information-related zip code, they do not directly belong to the payment table. Although this is not the only problem with this table, it eliminates the difficulties caused by dependencies with the city, state, and zip code.
| Claim_num |
Claimant_name |
Claimant_address |
Claimant_zip |
| 123456789 |
Jennifer Smith |
1234 main |
15201 |
| 234567890 |
Bill Smith |
7852 eagle |
15202 |
| 147258369 |
John Jones |
4562 edge |
15330 |
| 258369147 |
Eleanor Stillwater |
7531 West Eastern |
15510 |
Other information that can be migrated to other tables so that the claims form conforms to the second paradigm includes a combination of compensation numbers and compensation descriptions. You only need to store the compensation numbers in the claims form. When this method is used, any update to the description for the given number requires some changes. It can change a column of a row in the compensation table, and this will not cause an update exception, however, if you update a column in a table that affects hundreds of objects, an update exception may occur. The same logic can be applied to mediators and agents to migrate their information to their own tables. You only need to store the value of the number column in the Claims table, in this way, it is easy to access the auxiliary information through the connection.
| Adjuster_cd |
Adjuster_name |
3nf)
The third paradigm rule searches to eliminate the primary key attributes of tables that are not directly dependent on the first and second paradigms. We have created a new table for all the information not associated with the table's primary key. Each new table stores information from the source table and the primary keys they depend on.
Note: The third paradigm is usually described as "keys, all are keys, and there is no information except keys ".
| Claim_num, claim_status, expiration, expiration, entered_dt, closed_dt, death_dt, expiration, expiration, agent_cd, award_cd, location, site, expiration, expiration, policy_no, state, run_dt, activity_dt, expiration, reopen_dt, insured_name, insured_address, insured_phone, insured_zip, claimant_name, claimant_address, claimant_zip |
Figure 4: claims form of the third paradigm
In the third paradigm, we can see more changes in the Claims table. In this table, the name, address, phone number, and zip code of the insured are more dependent on the signed policies rather than the claims themselves. Therefore, we can put the information of the insured into the policy table. This makes the remaining information in the Claims table more directly related to claims, and puts all other information into your own table to ensure sufficient (no omission) information. A simple connection between these tables can reconstruct the information of the source table, which is also the goal of relational algebra and relational operations (relational theory and relational database dependencies.
| Policy_no |
Insured_name |
Insured_address |
Insured_phone |
Insured_zip |
The third paradigm is usually the highest level of standardization that people can obtain. It is also the highest level of standardization and data standardization in practice. But there are more paradigms. The higher the level, the more difficult it is to use a simple step, and the closer it is to the theory.
Normalization or lack of standardization-and the standardized extension you can use-is usually the result of the synthesis of relevant personnel. If there is enough important demand to store a piece of information in a certain location, and it does not necessarily conform to the definition of a certain paradigm, such storage should also be respected. In addition, the results of normalization must be based on the use of tables and databases. Generally, in decision-making support systems or data warehouses, due to the image of the time variable component of the data warehouse, we strongly desire to obtain extreme non-standardized information (especially in fact tables ).
In a team-oriented environment, these decisions are all Department decisions (or common guidance decisions ). We hope these guidelines will help you understand standardization and make more informed decisions.