server | sql | Data
Partitioned views join horizontally partitioned data from a group of members, making the data appear to come from the same table. SQL Server 2000 distinguishes between local partitioned views and distributed partitioned views. In a local partitioned view, all related tables and views reside on the same instance of SQL Server. In a distributed partitioned view, at least one of the related tables resides on some other (remote) server. It is recommended that you do not use distributed partitioned views for data warehouse applications.
Vector data warehouses are built around facts (scalars) and vectors. Physically, they are usually expressed as star architectures and snowflake architectures. There are very few completely non-orthogonal plane tables that contain both facts and vectors. Since vector architecture is the most common relational data warehouse structure, this article focuses on the partitioning of such architecture. The following suggestions also apply to other general data warehouse architectures.
Advantages of partitioning
Data pruning:
Many data warehouse administrators regularly archive stale data. For example, a one-click streaming data warehouse may only keep detailed data online for three to four months. Other common rules may be to keep online for 13 months, 37 months, or 10 years, when old data is not in the active window, it is archived and deleted from the database. This rolling window structure is a common practice in big data warehouses.
Without a partitioned table, the process of deleting old data from the database requires a large DELETE statement, for example:
DELETE FROM fact_tableWHERE date_key <19990101
The execution of this statement will be very expensive and may require more time than the loading process of the same table. Conversely, for partitioned tables, the administrator redefines the UNION ALL view to exclude the oldest table, and then deletes the table from the database (assuming that the table is guaranteed to be backed up). This process can be completed almost instantly.
As we will discuss later, the cost of maintaining partition tables is also high. If data pruning is the only reason for partitioning, designers should consider deleting old data from unpartitioned tables in a data decomposition manner. Run a script that deletes 1000 rows (using the "set rowcount 1000" command) each time on a low-priority process until all the data you want to delete is deleted. This technology can be effectively used on large systems and is more direct than creating the necessary partition management system. Depending on the load and system usage, this technique is suitable for some systems and should be considered for benchmarking on the system.
Loading speed:
The fastest way to load data is to load the data into an empty table or a table without indexes. By loading to a smaller partition table, the efficiency of the gradient loading process will be greatly improved.
Maintainability:
Once a data warehouse staged application that supports partitioning has been built, the entire system will become easier to maintain. Maintenance activities (including loading data, backing up and restoring tables) can be performed in parallel, which can greatly improve performance. The process of gradually filling the downstream data flow cube can be accelerated and simplified.
Query speed:
Query speed should not be used as a reason to partition a data warehouse relational database. For partitioned and unpartitioned fact tables, query performance is similar. In a properly designed partitioned database, the relational engine only includes the relevant partitions needed to resolve the query in the query plan. For example, if the database is partitioned by month and the query condition is January 2000, the query plan includes only the January 2000 partition. As a result, the query will be executed correctly on the partitioned table, which is about the same as the indexed merged table with the clustered index on the partition key.
Disadvantages of partitioning
Complexity:
The main disadvantage of partitions is that they require administrators to create applications to manage partitions. It is not appropriate to put a data warehouse that uses horizontal partitioning in a relational database into formal operation before designing, testing, and piloting applications to manage partitions. One of the purposes of this article is to discuss issues and design decisions related to partition management applications.
Query design constraints:
To get the best query performance, all queries should place the condition directly on the filter key in the fact table. Queries that place constraints on the second table (for example, a table with dates as vectors) will include all partitions.
Factors to consider when designing:
Vector data warehouses are built around facts (scalars) and vectors. Physically, they are usually expressed as star architectures and snowflake architectures. There are very few completely non-orthogonal plane tables that contain both facts and vectors. Typically, the administrator of the vector data warehouse only partitions the fact table; there is little benefit in partitioning the vector table. In some cases, it may be beneficial to partition a large vector table that contains more than 10 million members. Non-vector relational data warehouses can also be partitioned, and the general ideas in this article still apply.
Only by fully considering the system architecture and design goals can an effective zoning plan be formulated. Even with the same architectural design, a relational data warehouse that is only used to populate service analysis cubes may use a partition structure that is different from the data warehouse directly queried by the analyst. Systems with rolling windows must be partitioned by time, other systems are not.
If the data warehouse includes analysis service cubes, Microsoft recommends that the relational data warehouse and analysis service database partitions should be parallel structures. Maintaining the application is simplified: the application creates a new cube partition while creating a new table in the relational database. The administrator only needs to master one partitioning strategy. However, an application may also have good reasons to partition the two databases in different ways, and the only reduction will be the complexity of the database maintenance application.
Zoning design overview
The partitioned tables in the SQL Server database can use updatable or queryable (non-updatable) partitioned views. In both cases, table partitions are created by CHECK constraints where each partition contains correct data. An updatable partitioned view supports INSERT (or UPDATE or DELETE) operations on the view and pushes the operation to the correct underlying table. This is very beneficial, but data warehouse applications usually require bulk loading, which cannot be performed through views. The following table summarizes the requirements, advantages, and disadvantages of updatable and queryable partitioned views.
Learn: Use partitions in SQL Server 2000 data warehouse-
Microsoft recommends that you define the primary key and design the fact table as a local (on a single server) partitioned federated view. In most cases, this definition results in an updatable partitioned view, but the data warehouse maintenance application should be designed to bulk load most data directly into member tables (rather than through views).
Syntax example:
The following code example is used to illustrate the syntax for defining member tables and union views and inserting data into views:
Create the 1999 fact table:
CREATE TABLE [dbo]. [Sales_fact_19990101] ([date_key] [int] NOT NULLCHECK ([date_key] BETWEEN 19990101 AND 19991231), [product_key] [int] NOT NULL, [customer_key] [int] NOT NULL, [promotion_key] [ int] NOT NULL, [store_key] [int] NOT NULL, [store_sales] [money] NULL, [store_cost] [money] NULL, [unit_sales] [float] NULL) ALTER TABLE [sales_fact_19990101] ADD PRIMARY KEY ([date_key , [product_key], [customer_key], [promotion_key], [store_key])
Create the 2000 fact table:
CREATE TABLE [dbo]. [Sales_fact_20000101] ([date_key] [int] NOT NULLCHECK ([date_key] BETWEEN 20000101 AND 20001231), [product_key] [int] NOT NULL, [customer_key] [int] NOT NULL, [promotion_key] [ int] NOT NULL, [store_key] [int] NOT NULL, [store_sales] [money] NULL, [store_cost] [money] NULL, [unit_sales] [float] NULL) ALTER TABLE [sales_fact_20000101] ADD PRIMARY KEY ([date_key] , [product_key], [customer_key], [promotion_key], [store_key])
Create a UNION ALL view:
CREATE VIEW [dbo]. [Sales_fact] ASSELECT * FROM [dbo]. [Sales_fact_19990101] UNION ALLSELECT * FROM [dbo]. [Sales_fact_20000101]
Now insert a few rows of data, for example:
INSERT INTO [sales_fact] VALUES (19990125, 347, 8901, 0, 13, 5.3100, 1.8585, 3.0) INSERT INTO [sales_fact] VALUES (19990324, 576, 7203, 0, 13, 2.1000, 0.9450, 3.0) INSERT INTO [sales_fact ] VALUES (19990604, 139, 7203, 0, 13, 5.3700, 2.2017, 3.0) INSERT INTO [sales_fact] VALUES (20000914, 396, 8814, 0, 13, 6.4800, 2.0736, 2.0) INSERT INTO [sales_fact] VALUES (20001113 , 260, 8269, 0, 13, 5.5200, 2.4840, 3.0)
To verify that the partition is working properly, use Query Analyzer to display the query plan, for example:
SELECT TOP 2 * FROM sales_fact WHERE date_key = 19990324
You should see only table 1999 included in the query plan. Comparing the query plan with the query plan generated by the same table with the deleted primary key, we will find that table 2000 is still excluded. Compare these plans with the query plans generated on the architecture with the date_key constraint removed. When these constraints are deleted, both table 1999 and table 2000 are included in the query.
Note that under normal circumstances, when executing queries on large tables, it is good practice to use the "TOP N" syntax because it can return results quickly and use minimal server resources. This is especially important when viewing the query plan of a partitioned table, because the query plan generated by the "SELECT *" statement is difficult to parse. For those who occasionally make observations, although only the relevant tables are used in the query during query execution, it appears that the query plan includes all the component tables of the UNION ALL view.
Apply the condition directly to the fact table:
To get the best query performance, all queries should place the condition directly on the filter key in the fact table. Queries that place constraints on the second table (such as a date vector table) will include all partitions. The standard asterisk joint query on the UNION ALL fact table works well, placing the condition on the attributes of any unpartitioned vector table, and creating the asterisk query WHERE clause in a standard way, including the attributes of the partition vector (date).
Designing queries on a partitioned vector architecture is exactly the same as designing on an unpartitioned architecture, except that the date condition is most effective when placed directly on the date key in the fact table.
If the first column in the index of each partitioned table is a clustered index with a date, the overhead of going to all partitions to resolve a particular query is relatively small. When writing predefined queries, you should increase their efficiency as much as possible, such as those that generate standard reports or gradually update the downstream data flow database.
Selection of partition key
The fact table can be partitioned on multiple vectors, but most people may only partition by date. As previously described, date partitions can support simple "rolling window" management, and older partitions can even be saved in different locations, or reduce the number of indexes. Similarly, most queries to the data warehouse are filtered by date.
For applications partitioned by date, the decision variables are:
How much data to keep online? The main basis for this decision is the business requirements, while considering the cost-effectiveness of keeping large amounts of data online.
How to design the date key? It is best to use surrogate keys for vector tables and fact tables in data warehouses, which is widely recognized. For fact tables partitioned by date, the recommended practice is to use a "smart" integer surrogate key of the form yyyymmdd. As an integer, this key uses only 4 bytes compared to the 8-byte datetime. Many data warehouses use natural date keys of type datetime.
How to determine the size of the partition? Although the above example uses annual partitions, most systems will be more detailed, such as month, week, or day. Although we will notice that user queries are usually conducted on a monthly or weekly basis, the most important factor is the overall size and manageability of the system. You may remember that any SQL query can reference up to 256 tables.
For data warehouses that maintain data for more than a month, the UNION ALL view partitioned by day will exceed this limit. As a good rule, if the fact table is only partitioned by date, then it is best to partition by week.
Define the scope of the partition
The BETWEEN syntax is the most direct, the most readable and the most efficient. Take the following monthly partition as an example:
date_key <19990101 date_key BETWEEN 1990101 AND 19990131 date_key BETWEEN 19990201 AND 19990229 ... date_key BETWEEN 19991201 AND 19991231 date_key> 19991231
Please pay attention to the first and last partitions: even if you think that no data will ever enter these partitions, this is still a good way to define partitions so that all possible date values can be overwritten. Also, please note that although 1999 is not a leap year, the February partition still covers February 29. This structure eliminates the need to judge whether it is a leap year when designing applications that create partitions and constraints.
Over time, do you want to merge partitions? To minimize the number of active partitions, when creating a partitioned application, the database administrator can choose to merge the day partition into a week partition or a month partition. We will discuss this method in detail in the following section on filling and maintaining partitions.
The detailed discussion on how to partition by date also applies to the use of other possible partition keys.
Data loading: If the new data has a clear tendency to align with other vectors, or for example, if each storage or attachment is distributed by a different system, these are natural partition keys.
Cube data query: Although there is no technical reason for partitioning relational database and analysis service cubes in the same way, this is a common practice. If this assumption is made, the maintenance application will be simplified. In this way, even if the existence of the relational database is only used to populate the analysis service cube, the general query mode should be considered when selecting the partition key.
Naming convention: The rules for naming the member tables of the horizontal partition fact table should be derived naturally from the partition design. For maximum versatility, use the full partition start date in the title: Even if the partition is performed once a year, [sales_fact_yyyymmdd] is better than [sales_fact_yyyy].
If the database supports multiple size partitions, the naming convention should reflect the time range of each partition. For example, the monthly partition uses sales_fact_20001101m and the daily partition uses sales_fact_20001101d.
The name of the member table is hidden from end users who access the data through the view, so the name of the member table should be oriented to the application that performs maintenance.
Downstream data cube partitioning
If the relational database is only used to support analysis service cubes, there is no need to define a UNION ALL view. In this case, the application will not be limited by 256 tables, but it is recommended that you do not partition the relational data warehouse in such a way that UNION ALL views cannot be defined.
Manage partition fact table
Before partition management can be automated and tested, the partitioned data warehouse should not be officially put into use. The partition management system is a simple application, and the general requirements of the system are discussed below.
Suppose the partitioning is by date
Metadata
A stable partition management system should be driven by metadata. As long as you ensure programmatic access to metadata, you can store metadata in any location. Most data warehouse systems use custom metadata tables defined on the data warehouse SQL Server or Microsoft SQL Server Meta Data Services.
Regardless of the storage mechanism of the metadata, the content of the metadata must include the following information for each partition:
Partition name
1. The date the partition was created;
2. The date range of the data in the partition;
3. The date when the partition starts to be online (join the UNION ALL view);
4. The date when the partition is no longer online (discarded from the view);
5. The date of discarding the partition.
Other metadata tables that are part of the overall management system of the data warehouse should track when and how much data is loaded into each partition.
Create new partition
The first task of the partition management system is to create a new partition. Tasks that run periodically should be scheduled to create a new table to be used as the next partition.
There are many effective ways to perform this task. The recommended method is to use SQL-DMO (distributed management object) to create a new table with the same structure and index as the existing partition, but the new table has a new table name, index name, partition key constraint definition, file group, etc. :
1. Get the template table definition (usually the latest partition);
2. Modify the Name property of the table and index, check the constraint Text property and other properties;
Use the ADD method to instantiate the table, use smart naming conventions, and complete this task with a few lines of code.
As will be discussed later in this article, your application can use Analysis Services partitions for cubes in data warehouse systems. If so, scripts and programs that create partitions in the RDBMS can use Decision Support Objects (DSO) to continue to create the corresponding cube partitions.
Fill the partition
As mentioned earlier, data can be loaded into the UNION ALL view. In theory, this is a major function of the table partition structure, but in practice it is not recommended for data warehouse applications. It is not possible to bulk load data into the UNION ALL view; for data warehouses that are so large that the table must be partitioned, the loading process will be too slow.
On the contrary, the design of the data warehouse application must enable the data to be quickly loaded into the corresponding target table every cycle. If the data phased application is implemented in SQL Server Data Transformation Services (DTS), the dynamic attribute task can easily change the name of the target table of the data pump task or bulk insert task.
As long as the new partition does not join the UNION ALL view, there is no need to load data during system downtime.
The data warehouse staged application should be designed to handle new data that does not belong to the current partition. This particular situation may occur if the data warehouse loading process is not completed in one night. Other systems have to deal with the coming old data. The design of the system must take into account the possibility, frequency and data volume of these exceptions.
If old data arrives in a sufficiently low amount, the simplest design is to use the updateable UNION ALL view to load all data that does not belong to the current partition.
Define UNION ALL view
Once the gradient loading is successfully completed, the UNION ALL view must be revised. It is still recommended to use SQL-DMO to complete this task: use the ALTER method to change the TEXT property of the VIEW object. Exporting the list of partitions to be included in the view definition from the metadata table described above is the best way.
Merge partition
On the surface, merging several partitions into a single larger partition seems redundant. However, for a data warehouse with a large daily load and a small loading window, the following measures can significantly improve the loading performance:
1. Create a text file with the data to be loaded and sort it in the order of the cluster index;
2. Batch load to empty day partition;
3. Create all non-clustered indexes.
Keep the new partition online by recreating the UNION ALL view.
New weekly partitions are created and populated weekly by inserting, recreating indexes, and regenerating UNION ALL views from the daily partition. Then you can discard the daily partition.
After the data becomes stale, it is moved to weekly or even monthly partitions, so that more partitions can be kept online in the UNION ALL view.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.