Relational database Advanced (ii) Global overview

Source: Internet
Author: User
Tags dba memory usage readable sqlite cpu usage intel core i7 intel pentium

Previous article Relational database advanced (i) Database Foundation, we have learned the database foundation, now we need to come back to look at the database of the full picture.

A database is a collection of information that is easy to access and modify. However, a simple pile of files can achieve this effect. In fact, the simplest database such as SQLite is just a bunch of files, but SQLite is a well-designed bunch of files because it allows you to:

1 using transactions to ensure data security and consistency

2 fast processing of data above millions

3 databases can generally be understood in the form of:

In this article, I will not pay particular attention to how to organize a database or how to name various processes, because I chose my own way to describe these concepts to fit this article. The difference is different components, the overall idea is: The database is composed of a variety of interactive components.

Core components:

Process Manager: Many databases have a process/thread pool that needs to be managed properly. Furthermore, in order to achieve nanosecond-level operations, some modern databases use their own threads rather than operating system threads.

Network Manager: Networking I/O is a big problem, especially for distributed databases. So some databases have their own network manager.

File System Manager: Disk I/O is the primary bottleneck of the database. It is important to have a file system Manager that perfectly handles OS file systems and even replaces OS file systems.

Memory Manager: A large amount of memory is required to avoid the performance penalty of disk I/O. But if you want to deal with large memory you need an efficient memory manager, especially when you have many queries using memory at the same time.

Security Manager: Used to authenticate and authorize users.

Client Manager: Used to manage client connections.

......

Tools:

Backup manager: For saving and recovering data.

Recovery Manager (Recovery Manager): Used to restart the database after a crash to a consistent state.

Monitor Manager: A tool for logging database activity information and providing a monitoring database.

Administration Manager (Administration Manager): Used to hold metadata, such as the name and structure of a table, to provide tools for managing databases, schemas, and tablespaces. "The Translator notes: Well, I really don't know what the administration manager is going to translate into, have the trouble to know to inform, greatly appreciated ..."

......

Query Manager:

Query parser: Used to check whether a query is legitimate

Query rewriter: For pre-optimized queries

Query optimizer: For optimizing queries

Query executor: For compiling and executing queries

Data Manager:

Transaction manager (Transaction Manager): For processing transactions

Cache Manager: The data is put into memory before it is used, or the data is placed in memory before it is written to disk

Data Access Manager: Accessing data on disk

In the remainder of this article, I will focus on how the database manages SQL queries through the following processes:

Client Manager

Query Manager

Data Manager (with Recovery manager)

Client Manager

The client manager handles client-side communication. A client can be a (web) server or an end user or a final application. Client Manager through a series of well-known APIs (JDBC, ODBC, ole-db ... ) provides different ways to access the database.

The client Manager also provides a proprietary database access API.

When you connect to a database:

1 Manager first checks your authentication information (user name and password) and then checks if you have authorization to access the database. These permissions are assigned by the DBA.

2 Then, the manager checks to see if there are idle processes (or threads) to process your queries against.

The 3 Manager also checks that the database is heavily loaded.

The 4 manager may wait for a while to get the required resources. If the wait time reaches the time-out, it closes the connection and gives a readable error message.

5 The manager will then send your query to the query manager for processing.

6 because the query processing process is not "incomplete", once it gets the data from the query manager, it will save some of the results to a buffer and start sending to you.

7 If you encounter a problem, the manager closes the connection, sends you a readable explanation, and then frees the resource.

Query Manager

This is part of the power of the database, in which a poorly written query can be translated into a code that executes quickly, and the results of the execution of the code are sent to the client manager. This multi-step operation is as follows:

1 queries are parsed first and judged to be legitimate

2 is then rewritten, removing useless operations and adding pre-optimized portions

3 is then optimized to improve performance and be converted to executable code and data access plans.

4 then the plan is compiled

5 Finally, is executed

I'm not going to go into the last two steps here because they're not very important.

Query parser

Each SQL statement is sent to the parser to check the syntax, and if your query is wrong, the parser rejects the query. For example, if you write "Slect ..." Instead of "SELECT ...", there is no further context.

But that's not the end of the story, and the parser checks if the keyword is in the correct order, such as where it was rejected before the SELECT.

The parser then parses the tables and fields in the query, using the database metadata to check:

1 whether the table exists

2 whether the table field exists

3 Operations on a Type field are possible (for example, you cannot compare integers to strings, you cannot use the substring () function on an integer)

The parser then checks to see if you have permission to read (or write) the table in the query. Again: These permissions are assigned by the DBA.

During parsing, the SQL query is converted to an internal representation (usually a tree).

If everything works, internal representations are sent to the query rewrite.

Query rewrite

In this step, we have an internal representation of the query, and the object of the rewrite is:

1 Pre-optimized queries

2 Avoiding unnecessary operations

3 Help Optimizer find the right solution for the best

The rewrite performs a detection of the query in a series of known rules. If the query matches the rules of a pattern, the query is rewritten according to this rule. The following is a non-exhaustive list of (optional) Rules:

View merging: If you use a view in a query, the view is converted to its SQL code.

Sub-query flattening: Subqueries are difficult to optimize, so the rewrite attempts to remove subqueries

For example:

SELECT Person. *  from  Person WHERE inch (SELECT from mailsWHERElike'Christophe %');

will be converted to:

SELECT Person. *  from Person , mails WHERE = Mails.person_key  and  like ' christophe% ';

1 Remove unnecessary operators: For example, if you use DISTINCT and you have a UNIQUE constraint (which in itself prevents duplication of data), then the DISTINCT keyword is removed.

2 Exclude redundant joins: If the same join condition occurs two times, such as a join condition hidden in the view, or a useless join due to transitivity, it will be eliminated.

3 Constant Calculation assignment: If your query needs to be evaluated, the calculation executes once during the rewrite process. For example, where Age > 10+2 is converted to where age >, TODATE ("date string") is converted to date values in datetime format.

4 (Advanced) partition clipping (Partition pruning): If you use a partitioned table, the rewrite can find the partition you want to use.

5 (Advanced) materialized view rewrite (materialized view rewrite): If you have a materialized view that matches a subset of the query predicate, the renderer checks whether the view is up to date and modifies the query so that the query uses materialized views instead of the original table.

6 (Advanced) Custom rule: If you have custom rules to modify a query (like Oracle policy), the rewrite executes the rules.

7 (Advanced) OLAP conversion: Parse/Add Window function, star join, ROLLUP function ... Transformations occur (but I'm not sure if this is done by the rewrite or optimizer, because two processes are very tight and must look at what the database is).

The materialized view. predicate, predicate, the evaluation of a conditional expression returns a true or false process "

The rewritten query is then sent to the optimizer, and the fun begins.

Statistics

We need to talk about statistics before we look at how databases are optimized for queries, because databases without statistics are stupid. The database will not analyze your data unless you explicitly indicate it. No analysis can cause the database to make (very) bad assumptions.

But what type of information does the database need?

I have to talk (briefly) about how the database and operating system store data. The smallest unit used is called a page or block (default 4 or 8 KB). This means that if you only need 1KB, you will also occupy a page. If the page size is 8KB, you are wasting 7KB.

Come back and keep talking about statistics! When you ask the database to collect statistics, the database calculates the following values:

1 Number of rows and pages in the table

2 in each column of the table:

Unique values
Data Length (min, max, avg)
Data range (min, max, avg)

3 Index information for tables

These statistics help the optimizer estimate the disk I/O, CPU, and memory usage required by the query

The statistics for each column are very important.
For example, if a table person needs to join 2 columns: last_name, first_name.
Based on statistics, the database knows that first_name has only 1,000 different values, and last_name has 1,000,000 different values.
Therefore, the database is joined in accordance with Last_Name, first_name.
Because Last_Name is unlikely to repeat, in most cases it is enough to compare Last_Name's first 2 or 3 characters, which will significantly reduce the number of comparisons.

However, these are just basic statistics. You can let the database do an advanced statistic called a histogram. Histograms are statistical information about the distribution of column values. For example:

Most frequently occurring values

Number of Bits

...

These additional statistics will help the database find a better query plan, especially for equality predicates (for example: where age = 18) or range predicates (for example: where Age > 10and Age < 40) because the database can better understand these predicate-related numeric type data Line (note: The technical name of this concept is called the selection rate).

Statistics are saved in the database metadata, such as the (non-partitioned) table's statistics location:

Oracle:user/all/dba_tables and User/all/dba_tab_columns

Db2:syscat. TABLES and SYSCAT. COLUMNS

Statistical information must be updated in a timely manner. If a table has 1,000,000 rows and the database thinks it has only 500 rows, nothing is worse. The only downside to statistics is that it takes time to calculate, which is why most of the databases do not automatically calculate statistics by default. Statistics can become difficult when the data reaches millions, and you may choose to do only basic statistics or perform statistics on a database sample.

For example, I'm involved in a project that needs to deal with a library of billions of data per table, and I choose to count only 10%, resulting in a huge amount of time spent. This example proves to be a bad decision because sometimes the Oracle 10G is very different from the 10% selected in a particular column of a particular table (which rarely happens for a table with 100 million rows of data). This error statistic resulted in a query that was supposed to be completed in 30 seconds and finally executed for 8 hours, and the process of finding the root of the phenomenon was a nightmare. This example shows the importance of statistics.

Note: Of course, each database has its own specific, more advanced statistics. If you want more information, read the documentation for the database. That said, I've tried to understand how statistics are used, and the best official documentation I've found is from PostgreSQL.

Query optimizer

All modern databases are using cost-based optimization, or CBO, to optimize queries. The reason is to set a cost for each operation, through the application of the lowest cost of a series of operations, to find the best way to reduce query costs.

To understand the cost optimizer principle, I think it's best to use an example to "feel" the complexity behind the task. Here I'll give you 3 ways to join 2 tables, and we'll soon see that even a simple join query can be a nightmare for the optimizer. After that, we'll see how the real optimizer does it.

For these join operations, I focus on their time complexity, but the database optimizer calculates their CPU costs, disk I/O costs, and memory requirements. The difference between time complexity and CPU cost is that the time cost is an approximation (for lazy guys like me). And CPU cost, I include all the operations here, such as: addition, conditional judgment, multiplication, iteration ... What's more:

Each advanced code operation has a specific number of low-level CPU operations.

For Intel Core i7, Intel Pentium 4, AMD Opteron ... (in terms of CPU cycles) The computational cost of the CPU is different, which means it depends on the CPU's architecture.

It's much easier to use time complexity (at least for me), and I can understand the CBO concept with it. Because disk I/O is an important concept, I occasionally mention it. Keep in mind that most of the time the bottleneck is disk I/O rather than CPU usage.

Index

We talked about indexes when we were studying B + trees, and remember that indexes are already sequenced.

For informational purposes only: There are other types of indexes, such as bitmap indexes, that are not the same as the cost of B + tree indexes in terms of CPU, disk I/O, and memory.

In addition, many modern databases can dynamically generate temporary indexes for the current query only to improve the cost of the execution plan.

Access path

Before you apply the Join operator (joinoperators), you need to get the data first. Here's how to get the data.

Note: Because the real problem with all access paths is disk I/O, I will not discuss the complexity of the time too much.

"Introduction to four types of Oracle index scans"

1 Full Scan

If you have read the execution plan, you must have seen the word "full scan" (or just "scan"). Simply put, a full scan is a database complete reading of a table or index. In the case of disk I/O, it is clear that the cost of full-table scanning is higher than the index full scan.

2 Range Scan

Other types of scans have indexed range scans, such as when you use the verb "WHERE age > 40" and it happens.

Of course, you need to have an index on the age field to use the index range scan.

In the first part we already know that the time cost of a range query is approximately log (n) +m, where n is the amount of data indexed, and M is the estimated number of rows in the range. Thanks to the statistics, we can know the values of N and M (Note: M is the selection rate for the predicate "Age > and Age < 40"). In another range scan, you don't need to read the entire index, so it's not as expensive to have a full scan on disk I/O.

3 Unique Scan

If you only need to take a value from the index, you can use a unique scan.

4 access based on ROW ID

In most cases, if the database uses an index, it must look up the rows associated with the index, so that it uses the method of access based on the row ID.

For example, if you run:

SELECT LASTNAME, FIRSTNAME  from  Person WHERE  =

If the age column of the person table has an index, the optimizer uses the index to find all ages 28, and then it reads the related rows in the table, because only age information is in the index and you want the first and last names.

But if you change the way:

SELECT  from Person , Type_person WHERE = Type_person. Age

The index of the person table is used to join the Type_person table, but the person table is not accessed based on the row ID because you do not require the information in the table.

Although this method behaves well in a small amount of access, the real problem with this operation is disk I/O. If a large number of accesses are required based on the row ID, the database may choose a full scan.

Other paths

I did not enumerate all the access paths, and if interested, you can view documents related to each database.

Relational database Advanced (ii) Global overview

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.