Using partitioned Outer join for thickening reports

Last Update:2018-07-24 Source: Internet

Author: User

Tags hash joins

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

background: In a database table, the data stored is often sparse (sparse data) rather than dense (dense data). First to understand what is sparse data, such as a product sales table (such as product name, sales time (accurate to years), sales of 3 columns), assuming some time some products it does not sell, generally will not store the sales of these products as 0, but not storage, This results in a number of missing lines in the Product sales table (gap rows), resulting in a specific product sales data sorted by Time dimension, is discontinuous, or the product sales are missing in the time series. As the name implies, dense data is relative to sparse data, or with the above assumptions, that is, the product is not sold at a certain time, also must store the sales of this product, sales 0 storage, so for a particular product in the time series is continuous, but the fact is often not so, Therefore, the process of dense sparse data, data density in the application of data warehousing is very common.

Of course, the sales table is only a typical case where there are a variety of missing data situations in the actual application. If the decision makers look at the sales statistics, he does not want to have some products by time series intermittent, but should provide him with time series of continuous analysis report, he may need to see each product every time sales, even if not sold at a certain time, also must set 0, such a report for decision-makers only meaningful, and more granular analysis, such as using analytic functions to calculate sales offsets for each product by year and year, makes it easy to compare sales per month for each product, providing a strong guarantee for decision support.

In order to convert sparse data to dense data, ORACLE10G provides the partitioned Outer join syntax, similar to a generic Outer join (but does not support full Outer joins, Only left and right two are supported, only the syntax of partition by IS added, the table is logically partitioned according to partition by, and then a outer JOIN is performed on each partition, so that the missing rows can be filled to achieve data density, Also equivalent to the data in each partition outer JOIN after the union operation, understand this is very important, otherwise often do not know which table which columns that partition, do not know whether to use a LEFT join or right join, in the following example will be detailed analysis of how this syntax is used.

1.1 Partitioned Outer Join syntax
The partitioned Outer join syntax is as follows:

SELECT .....

From Table_reference

PARTITION by (expr [, expr] ...)

Right OUTER JOIN table_reference

SELECT .....

From Table_reference

Left OUTER JOIN table_reference

PARTITION by (expr [, expr] ...)

The partitioned Outer join syntax is simple, which is to add the partition by statement before the on condition of the join table. The above lists only the simplest two tables (inline view, view, and other result sets) connected, the connection of multiple objects is similar, the other complex grammatical structure is omitted, the grammatical structure of partition by can be placed behind any legitimate connection object, and the general partition by no difference, You can have multiple partition columns (expressions), and then use outer joins, and note that it is important to understand whether to use a LEFT join or right join, such as the 1th syntax structure before the join object using partition by, then the 1th object is populated with missing data, So you must use right JOIN, the 2nd syntax structure is similar.

Of course, you can also use join directly, do not outer join, but this can not fill the missing data, no meaning, but also note that you cannot use the 86 outer JOIN syntax +, this is not possible, you must use 92 syntax. In general, depending on the need to determine the key value of PARTITION by, the PARTITION by statement follows the object that needs to be partitioned, and then decides whether to use a LEFT JOIN or right join depending on the location of the PARTITION by, or else it may error or get incorrect results.

1.2 Partitioned Outer Join instance
This section mainly studies the use of partitioned Outer join from relevant examples, the main examples are filling one-dimensional missing data, filling multidimensional missing data, filling the data into the inventory table medium. For example, please refer to code POJ.SQL for the statement of the building table.
1) Fill in one-dimensional missing data
The T table is a product sales table with the following data:

Dingjun123>select * FROM T
2 ORDER by Years,months,product_name;
Years MONTHS Product_Name SALES
---------- ---------- -------------------- ----------
1 A 1000
1 B 1500
2 A 2000
2 B 3000
2 C 1000
3 A 3000
6 rows have been selected.

The table data above is very simple, in practical applications, this data may be the intermediate result of the statement. From the results can be seen, there are 2008 years 1, 2, 3 of the 3 months of sales data, but some products sales data in some months is missing, such as January 2008 product c no data. Now you need a report that can populate all products corresponding to the missing data from March to 2008 years ago, the sales field is 0, how do you want to implement such a report?

Let's take a look at the traditional approach: since populating each product with missing data for the month, it is necessary to construct a result set that stores the data for each time corresponding to each product, so that it can be used to populate the missing data for this purpose, in order to achieve this goal, It is easy to think of the time of year, month, and product in the table to be Cartesian (each part of the data is unique, this is the data to do the Cartesian product), the resulting data for each time of each product is generated, and then connected to the original table. The following is implemented with SQL:

The--1.with clause secures time and product_name, then generates a Cartesian product in the query, so that there is all the time period data for each product, of course
--The WITH clause here can also generate Cartesian product directly
--2. Cartesian product and the original table T do outer joins, thus realizing the data dense process, of course, here can use T right JOIN ... So it doesn't have to be written so
--Multi + number.
--the WITH clause here distinct from the original table, if the original table is large and inefficient, the actual application is often obtained from other related tables or constructed by itself
With
Year_month as
(SELECT DISTINCT years,months from T),
Product AS
(SELECT DISTINCT product_name from T)
SELECT M.YEARS,M.MONTHS,M.PRODUCT_NAME,NVL (t.sales,0) Sales
From T,
(SELECT Years,months,product_name
From Year_month,product) m
WHERE T.years (+) = M.years
and t.months (+) = M.months
and t.product_name (+) = M.product_name
ORDER by;

--Generate Cartesian product directly according to the improved Sql,with above, then use SQL92 new outer join syntax, omit many + numbers, easier to understand
With
M as
(SELECT Years,months,product_name
From
(SELECT DISTINCT years,months from T),
(SELECT DISTINCT product_name from T)
)
SELECT M.YEARS,M.MONTHS,M.PRODUCT_NAME,NVL (t.sales,0) Sales
From T
Right JOIN
M
On t.years = M.years
and t.months = M.months
and t.product_name = M.product_name
ORDER by;

Traditional fill missing data, often through the Cartesian product constructs the complete data set, and then joins the original table outside. According to the above SQL, this result should be generated all the products of all years of sales data, if not in the original table, then the corresponding missing year data is 0, execute the above SQL result is:

Years MONTHS Product_Name SALES
---------- ---------- -------------------- ----------
1 A 1000
1 B 1500
1 C 0
2 A 2000
2 B 3000
2 C 1000
3 A 3000
3 B 0
3 C 0
9 rows have been selected.

Now filled with 3 rows of missing data, to achieve all products corresponding to the 2008 years ago March time series of dense report purposes, do you find that the traditional approach is more complex, here is a very simple one-dimensional missing data filling, if it is multidimensional missing data fill? SQL is often complex in practice, This sales table T may be an intermediate result of SQL, so this approach requires a Cartesian product to generate all combinations, performance may not be good, and SQL is more complex.

Here's a look at 10g's dedicated improvements to populating data, which are easier to use with partitioned OUTER joins and tend to perform better than traditional practices. By analyzing the partituoned OUTER join and the traditional method of data dense, using the partitioned OUTER join only needs to partition the product and then connect with all the time, the missing data can be complete as follows:

The M here is obtained from the original table, the actual application of the general structure or from other related tables obtained from the original table to obtain distinct, may not be efficient.
SELECT M.YEARS,M.MONTHS,T.PRODUCT_NAME,NVL (t.sales,0) Sales
From T
PARTITION by (T.product_name)
Right JOIN
(SELECT DISTINCT years,months from T) m
On t.years = M.years
and t.months = M.months
ORDER by;

Be sure to understand the two syntactic structures of the partitioned OUTER join, where the partition by is immediately after the table T, equivalent to the row and intermediate result m outside of each partition per Product_Name partition, so that the data can be mended, Equivalent to the union all result of each row divided by product_name and the M outer joins, this example provides a good understanding of the use of the partitioned OUTER join, so that you can rewrite it correctly in a variety of ways. The result of this statement is consistent with the above and is no longer listed. If you understand what is said above, you can use the left join to rewrite:

SELECT M.YEARS,M.MONTHS,T.PRODUCT_NAME,NVL (t.sales,0) Sales
From
(SELECT DISTINCT years,months from T) m
Left
JOIN
T PARTITION by (T.product_name)
On t.years = M.years
and t.months = M.months
ORDER by;

The result of the execution is exactly the same as the right join above, why does it become a LEFT join? The reason is that the T PARTITION by is now moved behind the join, and of course it's time to connect all the times to the missing data, so use the 2nd syntax structure. Below is a look at the execution plan for this statement:

Execution plan
----------------------------------------------------------
Plan Hash value:1692607762
-------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU) | Time |
-------------------------------------------------------------------------------------
| 0 |    SELECT STATEMENT |    | 1 | 46 | 10 (40) | 00:00:01 |
|  1 |    SORT ORDER by |    | 1 | 46 | 10 (40) | 00:00:01 |
| 2 |    VIEW |    | 1 |    46 | 9 (34) | 00:00:01 |
| 3 |    MERGE JOIN PARTITION outer|    | 1 |    72 | 9 (34) | 00:00:01 |
|    4 |    SORT JOIN |    | 6 |    156 | 5 (40) | 00:00:01 |
|    5 |    VIEW |    | 6 |    156 | 4 (25) | 00:00:01 |
|    6 |    HASH UNIQUE |    | 6 |    156 | 4 (25) | 00:00:01 |
|       7 | TABLE ACCESS Full |    T | 6 |    156 | 3 (0) | 00:00:01 |
|* 8 |    SORT PARTITION JOIN |    | 6 |    276 | 4 (25) | 00:00:01 |
|    9 | TABLE ACCESS Full |    T | 6 |    276 | 3 (0) | 00:00:01 |
-------------------------------------------------------------------------------------
predicate information (identified by Operation ID):
---------------------------------------------------
8-access ("T". " Years "=" M "." Years ")

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More