Database Design (1/9): Metadata (data Elements)

Last Update:2016-05-06 Source: Internet

Author: User

Tags in degrees scalar square root

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Is it completely new for designing and creating a database? It doesn't matter, Joe Celko, one of the most-read SQL authors in the world, will tell you these basics. As always, even the most professional database veteran will surprise them. Joe is Dmbs magazine, the most popular author of the book for many years. He teaches SQL in the United States, Britain, Northern Europe, South America and Africa. He worked for the Ansi/iso SQL Standards Committee for 10 years and made outstanding contributions to the SQL-89 and SQL-92 standards.

Before you start thinking about your database schema or table, you need to think about your data: What type of data you have, and the range of values you use. It should be unique, precise and unambiguous. Then your name should be easy to understand. Joe Celko explains it this way.

Don't be surprised, the first step in database design is data. But in fact, too many programmers don't take the time to design their data, but to design the architecture first. It won't work. SQL handles structured data rather than unstructured data that you want for text or pictures. A basic concept in a relational database management system (RDBMS) is Dr. Codd's so-called information principle (information Principle). The rule reads: In a relational database management system, all data is modeled as a scalar in a table row.

This means that all rows in the table have the same structure. That is, the correct table design does not appear to have a row is the model of the car, the other line is the model of squid, there is a row of Lady Gaga.

Again, this means that each column in the row is a value of the same data element. In the table, a column does not change from shoe size to other temperature values. That is, the range does not change-if you measure the temperature in degrees Celsius, all values in this column use a range of degrees Celsius.

It is important to select the range and type of data. It makes mathematical and comparative sense in some areas, rather than with others. This also explains why SQL is a strongly typed language (strongly typed language). Some languages are weakly typed-that is, variables can modify data types during program execution-some are strongly typed, and those are variables that keep the same data types, and only when you are forced to convert are new data types. Not all mandatory conversions are legal. The classic example of a weakly typed conversion problem is pl/i, if you assign a single array element to the square root of a negative number, it converts the entire array from float to complex type. This occurs without any warning message. There's an old joke about weak data types:

Teacher: "Billy, 6 times 9 how much?" ”

Billy: "The amount ~ ~, red?" ”

Teacher: "No!" Sally, how much is 6 times 9? ”

Sally: "Thursday?" ”

Teacher: "No!" Tommy, how much is 6 times 9? ”

Tommy: "54"

Teacher: "Answer right!" Now tell everyone how you got the answer. ”

Tommy: I'll divide the red by Thursday! ”

When people write weakly typed languages, they often add prefixes or endings to the name of the data element, telling them the original data type and preventing them from modifying the type in the program. This violates the ISO-11179 metadata rule. This standard summarizes what the named data element should be, not where it is stored , or how it is used in the table.

For data elements, the iso_11179 format is "[< role >_]< Properties >_< properties;]" ("[<role>_]<attribute>_<property>"). In the entire architecture, one data element has one and only one name. If there is only one name that is better across the enterprise. Best of all, there is only one name in the universe. Let me tell you another old joke:

"When I was a kid, we had three cats. ”

"What are their names?" ”

"Cats, cats and cats." ”

"It doesn't sound clear; how do you separate them?" ”

"Who cares?" When you call them, they won't come! ”

We want the data to know what it is when it's called. There is a logical identification rule, which is usually simply literally a. More formally, we call it:

is specific; The entity has an identity.
There is nothing special about it; The entity is unique and precise.
It is any common word and nothing, and the entity is unambiguous.

For example, nothing can be an "id"-it's too vague and too generic (is it used to identify cars, squid or Britney Spears?). ）。 It is a property of no nature. A better name would be "vehicle_id" if that is the nature of our properties (attribute property). The best data element name is "VIN", which is universally used and defined by the ISO 4030 "Vehicle identification number (Vehicle identification number)" standard. Vin is accurate and well understood.

Similarly, you will see attributes that are not of a nature. My personal favorite is "sex", which can be "Sex_code" (as defined by ISO 5218, can refer to its marker "sex", although this is not a good idea). "Sex_type" (physiological options of an animal or plant: male or female) or "sex_frequency" (in my wishful thinking).

Perhaps the most ridiculous mistake is the chain of nature. What do you mean by "type_code_id"? If it's an identity, I'm in the data model it's unique and belongs to an entity (think "emp_id"). If it is a code, then it has an external authority (want to "ZIP code"). If it is a type, it has a test (think of "blood type"). It is like a series of adjectives without a noun to modify. Adding an attribute to the chain will not save the ambiguity problem.

Again, the basic rule is that a data element noun tells us what it is. The name does not tell us:

Where it's located in the schema with the table name
How it is used in a particular table (for example, there is no pk-,fk-or vw-such as affixes in the name)
How its data is physically stored (for example, no "I", "str-", etc.) for shaping, character type, etc.
Didn't tell us what the data element is. Be sure and accurate.

The properties component is selected from the standardized list, which you can add as needed. This list becomes part of your data dictionary and needs to be executed.

In the same table, when the same data elements appear in 2 or more roles, use < roles >. For example, in the organization report there will be "supervisor_emp_id" and "subordinate_emp_id", all from the employee identity data element "emp_id" in the personnel table.

Look at the length of the name. The data element name is too long or too short to be good. Too short a name is hard to understand unless it is a standard abbreviation. They can also be ambiguous; My favorite example is a system that has multiple student abbreviations in the metadata name, one of which is "Std"-often mistaken for "standard" abbreviations, and in some places even "sexually transmitted Disease (STD, disease caused by sexual intercourse) ".

Avoid all special characters in the name. Stick with Latin-1 letters, numbers and underscores. They are reused in other programming languages. Yes, the data element name is not just SQL. Similarly, avoid using only Microsoft-supported brackets "[" or Ansi/ios-supported double quotation marks at the same time. This is just a sloppy programming for displaying formatting in the database, not the front-end display. One possible exception would be a language translation problem, and the Latin-1 letter would not work correctly.

Enforce capitalization rules to avoid case-sensitive issues. My rule is that the SQL keyword is uppercase, the scalar is lowercase, and the schema object is uppercase. There's a study in my SQL Programming style (SQL Programming style) book, but I'll skip the details here.

A standardized suffix list of data element names is built on Teradata's internal standards and is common in industry journals (CMP,MKP and other publishers). These suffixes have precise meanings. If you need to invent your own, they are also accurate.

"_id" = identifier. It is unique in the architecture, anywhere in the architecture it refers to an entity. Uniqueness and identity are not the same thing. PI number it is unique, but he cannot identify an entity. Never use the "< table name >_id"; This is based on the location name and tells you it's probably not a real primary key. The simple "id" is too vague, and when you have an unlimited number of such names, you can confuse your data dictionary. Obviously, the self-growth value is also not an indicator. I'll talk about this fallacy.

"_date" or "_dt" = Dimension of DateTime. It is some time--work, birth, termination and so on. There is no such thing as a date, which is bad, because data is a registered keyword in SQL.

"_nbr" or "_num" = label number; This is the number character that is named. Do not use "_no" because it looks like a Boolean value. I prefer "NBR" or "num" because it is a common abbreviation in multiple European languages, and a combination of similar-shaped letters in "_num" is visually confusing.

"_name" or "_nm" = This is a letter name itself can be understood. It also becomes the name scale.

"_code" or "_cd" = code, trusted source, is usually maintained outside of the enterprise standardization. For example, ZIP code is maintained by a America postal service. In its context, the code is well understood, so you may not need to translate it.

"_size" = industry standard or company size for a product, such as clothing, shoes, envelopes or machine screws .

"_seq" = sequence, ordinal. and the number of tags is not the same thing, because it does not break.

"_tot" = total, an aggregated dimension, logically different parts.

"_tally" = number of values, an aggregated dimension. Also known as absolute scale.

"_status" = An internal code that reflects the state that will be modified in the known pattern. Consider the military position. You will be born, and then change the state to legal age. At the same time you can't marry multiple people. You can change from a married state to a divorce, from a divorce to a married one. If you die, you can't get married.

"_cat" = classification, from internal sources requiring official determination of the encoding. For example, a category of five hurricanes. It's not like a code, it needs an official verdict.

"_class" = internal encoding of external sources is not required. A category has a number of things in common, and you have rules for animal classification, breastfeeding or crawling. In some cases, it is difficult to use these rules, such as in Australia's mammals, but the exception becomes their own classification-the single hole mesh.

"_type" = code that has the same meaning both inside and outside. The blood type has a newly defined test procedure. It is usually less formal and overlapping than class,type. Type is the weakest of the three classes, and it requires a judgment. In some states, the tricycle is registered as a motorcycle. In other states, cars are used. In some states, if it has a reversing gear, it will also be used as a car.

In actual use, these 3 aspects will be confused. So, by industry standards, even if it violates the definition described above.

"_addr" or "_loc" = the address or location of the entity. There is a subtle difference between address and location. An address can refer to a street address or some external geographic system. A position refers to an internal structure, such as a warehouse number. The bin number can remain the same even if the physical location changes.

"_img" = picture data type, such as. jpg,. gif, and so on. It is important to use a specific format as a property.

You can add them if you want. But remember to track and standardize what you do.

Database Design (1/9): Metadata (data Elements)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More