File data organization

Last Update:2018-12-03 Source: Internet

Author: User

Tags sorts

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Basic concepts of databases (II)
Zeng shxiong
Original article: http://www.ascc.sinica.edu.tw/nl/83/1009/section3_3.html

Liu Jianwen sorting (http://blog.csdn.net/keminlau)

This article is based on 10th exams, 08, 68 pages

Ii. File introduction (I) file structure

Common computer files include executable program files, batch files, text files, and database files. Executable program files are mainly composed of a series of machine code commands that form programs that can handle specific tasks or solve specific problems. The batch file contains the command and the main file name of the executable program file, which is usually used to set the user environment of the computer system or to simplify the execution program. Text files mainly include file files and original program files, which are completely composed of characters (character. A data file is used to store input and output data during program execution. In computer systems, extended file names are often used to differentiate files of different nature. For example, executable program files are .exe or. com, the batch processing file is. BAT, the program source code file usually uses the name of the program language used as the sub-file name. Sub-file names of text files and database files, except for a few special characters, can be selected by the user according to regulations.

Data File structure: a data file is usually composed of a group of records with the same format but different content, and the number of records contained in any database file is not fixed. In details, each record in a file is composed of the same columns (fields). The data types of these columns can be integers, real numbers, or strings, sometimes some columns are further divided into smaller secondary columns.

By carefully comparing the file structure of the table data type and data file, we can find that the difference between the two is only the difference between the proprietary terms. In other words, if the table content is stored in a computer system, a data file with a specific record format will inevitably be formed. Figure 7 shows some proprietary terms in three fields. In the figure, any three different terms in the same column actually mean the same thing. Note: The "relational" in the rightmost column in Figure 7 is the adjective of the relationship. An exclusive term for relational mathematics: Refers to a subset of the Cartesian Product formed by a set of definite fields. Its definition is extremely rigorous, however, some people often mistakenly think that it refers to the relationship between tables (the correct statement is Association ). From 1970 to June, Dr. codd was published in the academic journal "Communication of ACM, he published his famous essay ''a relational model of data for large shared data banks '', proposing to describe the table based on the relationship of substitute mathematics, since then, it has started a boom in academic research on relational data models and relational databases developed by software providers. After many years, relational databases beat the most popular mesh and hierarchical databases and became the only problem in the field of table-based databases.

Because each record (or row in a table) in a data file has the same format but the number of records is not fixed, when defining a data file (or table, you only need to declare the table name and its record (or row) format. For example, Figure 8 shows the salary order record format for reference to DBASE rules. In figure 8, the straight columns from left to right are: column number (Num), column name (field name), data type (field type), width (width ), decimal places (DEC) and index ). In a data type column, "character" indicates that the content of the column is a character or string, and "Number" indicates that the content of the column is a usable number. Width Column records the length of each column (number of bytes). The principle is: Each English letter or Arabic number occupies 1 byte, and each text occupies 2 bytes. In the index column, "Y" indicates that another index file needs to be created for the column, and "N" indicates that no additional index file is required. The purpose of indexing is to help users find the desired records (or rows) from database files (or tables) through this column, this greatly improves the data retrieval speed. However, indexing has the following advantages: it increases the workload for maintaining the database system.

(2) document organization

The purpose of creating database files is very clear: to store and maintain data so that users can access it at any time. Obviously, how to read data from a file is subject to the predefined data storage method. Therefore, how to arrange the access method of records becomes an important topic in the establishment of database files. The so-called file organization refers to some techniques for storing records in files to facilitate retrieve data retrieval by applications. There are four main types of file organizations, including sequential access, direct access, and indexed sequential access) multi-key retrieval ).

Sequential access organization
: This is a file organization with the simplest structure but the slowest retrieval speed. It stores records in files one by one in the order of first arrival and last arrival without any special processing, 9. In a sequential access file organization, because there is only a difference between the first and second records, but there is no relevance, to read any record from this file, you must start from scratch, read and compare the specified columns of each record one by one until you find the desired record or read the entire file (that is, the query fails ). This approach is called linear search ). Assuming that the total number of records in the file is N and each record is equal to the opportunity to be searched, the linear lookup method is used to find the average number of comparisons required for individual records is n/2.

Sequential access organization
: Because the search efficiency of the simple sequential access files is too poor, the program designer usually sorts the records in the data files by small and large columns, columns used for sorting are called keys ). Ordered sequential access can be used to retrieve data (but only valid for key columns) using binary search ). For files with N records, the binary search method can be used to locate or locate the target record only by comparing (log n) times. Note that the log here is based on 2 instead of 10. For files with 1000 records, the average number of comparisons required by the linear search method is 500, and the number of comparisons required by the binary search method is up to 10. In terms of the large number of database files recorded, the data retrieval speed of the binary search method is much faster than that of the linear search method, but the cost of sorting data files must be paid beforehand. Another advantage of sequential file access is the key used for sorting, which is usually the key column used to classify data. Therefore, jobs that need to process data in sequence, such: file batch maintenance and statistical report creation.

Direct access to an organization
: This type of file organization is characterized by converting the key value of the target record into the address of the target record through the predefined program, and then reading the desired record directly based on the address. In other words, direct-access organizations can quickly retrieve individual records from database files. The main technologies of direct access file organization are as follows:

Directory lookup Method
(Directory lookup): create an additional directory file for the data master file in advance. The content of each directory item in the file is the key column content corresponding to the record and the address recorded in the Data master file. The practice is to read the key columns of each record one by one from the data master file and calculate the address of the record. At the same time, copy these directory items to another Created directory file. Sort the directory files by key value or arrange them into a tree (10 ). When retrieving data, first find the qualified directory items from the directory file based on the given key value, read the address, and then read the target records from the data Master File Based on the address. The advantage is that the retrieval speed is fast, but the cost is that the directory file requires additional memory space, and the directory file must be updated along with the maintenance of the Data master file.

Hash Method
(Hashing): Prepare enough database file space in advance. When each record is stored in a file, you must first use a preset rule (used to be called a hash function) convert a key into an address. For example, divide the key value by the preset Constant h and take the remainder as the address. When retrieving data, you can use the same hash function to calculate the original address of the target key value. Therefore, you can quickly retrieve individual records. The problem is that there will inevitably be a collision during address calculation, that is, different key values are used to calculate the same address. When a collision occurs, try to find a new available address. Based on previous experiences, when the amount of data loaded by database files exceeds 80% of its maximum capacity, the collision rate increases sharply due to space congestion. The key to the success or failure of the hash method is to find a good hash function to reduce the collision rate and properly handle the collision. The hash method does not support sorting between records, so it is not conducive to processing jobs in sequence.

Index sequential access organization
: As the name suggests, it is the file organization consisting of attaching the sorted data master file to the index file. In short, the index sequential access organization sorts the primary data files in figure 10, and the database files that have been sorted are stored in the same sequential order. Therefore, the first feature of such a file organization is that it has the advantages of both sequential access organizations that facilitate sequential processing of jobs and direct access to organizations that facilitate the retrieval of individual records. The second feature of the index sequence access organization is that when any record is added or deleted, the data master file and index file can be updated in a timely manner, and the data master files are still sorted. In order to achieve this, the organization of indexed sequential file access is much more complex than previously mentioned. A common practice is to arrange the primary data file into a block connected by a string, and arrange the index file into a B-tree or B * tree structure, as shown in 11. For details, see chapter 12th and Chapter 20th of data management and file processing by Mary es Loomis.

Multi-key retrieval organization: the aforementioned sequential access, direct access to files, index sequential access, and other three file organizations, you can only create a file organization for a single non-duplicate key (that is, the key values of different records are also different). By convention, the primary key of the database file will be selected ). If you need to retrieve records from other columns (commonly referred to as the secondary key and secondary key) in addition to the primary key, you must create a multi-key retrieval organization, A common practice is to create an inversion file. Take the data table of the Customer Account in Figure 12 as an example. Assume that the column ''group-code/branch-type is the ''sub-Key to create an anti-column file: first, each record in the Data master file is extracted from the columns that are pre-used as the secondary key and the primary key (column ID). Then, you can directly access the organization using the directory lookup method, in addition, the anti-column file of the original primary key pair is created, as shown in Figure 13. (To be continued)

1. http://www.ascc.sinica.edu.tw/nl/83/1008/subsection3_3_1.html
3. http://www.ascc.sinica.edu.tw/nl/83/1009/section3_3.html
4. http://www.ascc.sinica.edu.tw/nl/83/1011/section3_2.html

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More