Build Data Infrastructure for BI solutions

Source: Internet
Author: User
Tags types of tables management studio sql server management sql server management studio xquery snowflake schema ssis

Understanding requirements

Like all IT projects, the best way to start an ETL project is to understand the overall requirements of the BI solution you want to build, and then decide how to use the data to best meet these requirements. In the first series of articles, the author provides the background of BI solutions. In this case, the object is Adventure Works, a virtual company that needs BI solutions. This article lists the analysis requirements by describing several questions the company wants to answer. From these questions, we can see that Adventure Works needs to understand its product sales from the following perspectives: different distribution channels (dealers or Internet) the rate of return, the changes in product requirements over time, and the differences between the actual sales and the predicted sales by product, sales personnel, geographical region, and sales type. Answering these questions helps Adventure Works decide which distribution channel to focus on in order to increase profits, how to adjust the production process to best meet the demand, and how to change the sales strategy to achieve the sales target. After adding SQL Server Reporting Services (SSRS) to the BI solution, you can see how data support answers these business questions. Before designing a data market for Adventure Works following these requirements, I want to model these information requirements from a business perspective. In other words, the basis for designing the data market is how users ask questions, rather than obtaining data from data sources. To use the sample code in this article, you must first download the SQL Server 2008 Adventure Works OLTP Sample Database

Use Dimension Model

Dimensional Model Design is usually used to build a data market. The dimension model design is very suitable for the database architecture used for analysis. (Kimballgroup.com is an important resource for learning dimensional modeling .) The dimension model displays data in a way that you are familiar with and helps you build a data structure suitable for querying a large amount of data. You can query a large amount of data in a non-standard way. Non-Standardization enables the database engine to quickly select and efficiently aggregate a large amount of data during query. I will set two types of tables in the nonstandard architecture of the Adventure Works solution: dimension tables and fact tables. Dimension tables store information about business entities and objects such as vendors or products. The fact table is used to store the sales value to be aggregated. This table contains the measurement values and keys associated with the fact and dimension table. I will introduce the fact table in detail later. Two types of architecture can be used to implement a dimension model table: Star Schema and snowflake schema. In simple terms, each dimension in the star schema uses a table, so each query and fact table are single joins. In a snowflake schema, two or more tables are used for each dimension. Therefore, each query requires multiple joins to view all data. This cascading set means that the query speed of the snowflake architecture is usually slower than that of the star architecture. For the purpose of this article, I will use a star architecture to simplify the design.

Create a Bus Matrix

The focus of Adventure Works's BI solutions is the sales-related dimension. To determine the dimensions related to sales, I want to create a bus matrix, which is a step in the dimension modeling process. Remember that Adventure Works has two sales channels: wholesale to dealers and retail through the Internet. I also use the bus matrix to determine the relationship between each dimension and the two types of sales channels or one of them. Figure 1 shows the Adventure Works sales Bus Matrix example.
Figure 1: Adventure Works sales Bus Matrix

The next step is to determine the measurement value of the solution. The measurement value is the value required for analysis. These values can be directly obtained from data sources such as sales or product costs, or calculated, such as multiplying a certain number by a certain amount to get expanded sales. In addition, you need to determine which attributes should be included in each dimension. An attribute is a single element in a dimension (corresponding to a column in a table), such as the year in the country/region or date dimension of the sales region dimension. You can use attributes to group or filter data based on analysis needs. This document does not describe the metric values or dimension attributes of all identifiers in detail-but you must note that the identification process is necessary.

Create data ing

Before creating a physical table in the data market, I need to perform some other planning. Specifically, I need to build a data ing file, to map each target column in the Data marketplace architecture to the Adventure Works OLTP Source System (you can download and install the AdventureWorks2008 database according to the content in Stacia Misner article on page 1 ). You can use various applications to create data mappings. Content is more important than format. I am used to developing data ing in Microsoft Office Excel. Figure 2 shows the DimProduct tab I created in data ing. In addition, I have created a data ing between DimCustomer and FactInternetSales. Each worksheet in the workbook represents a table in the Data marketplace. Each worksheet has only two columns: one column as the source column and the other column as the target column.

Figure 2: DimProduct data ing Tab
Each dimension table (except the date dimension table) contains a primary key named the proxy key (usually the ID column. One of the advantages of using the proxy key is that duplicate keys do not appear when merging data from multiple systems.
The dimension table also has an optional key column, which represents a natural key, sometimes called a business key. The natural key is used to identify the source system. For example, the CustomerAlternateKey column in the Customer dimension maps to the AccountNumber field of the Sales. Customer table in the Adventure Works OLTP database. By storing these keys in a dimension table, I can match the existing records in the dimension table with the Records extracted from the data source each time I run the ETL process on each dimension.
Almost every data market contains a date dimension, because business analysis often compares metric changes by date, week, month, quarter, or year.Because the date dimension is rarely obtained from the source system, the SQL Server-based identification key should not be used. For this reason, I will change the smart key used to store the SQL Server Integer column in The YYYYMMDD format. A smart key is a key generated based on logic or script instead of an auto-incrementing key (for example, an ID column in SQL Server ).
Note that the date dimension is generally not mapped to the source table. Therefore, I will use scripts to generate data so that records can be loaded into the table.
Since the ELT process required for my small architecture is quite simple, such data ing is sufficient to meet your needs. In the actual project, I will add comments to the data ing TO INDICATE WHEN complicated conversions are required.

Build a data market

After completing the logical modeling, I need to create the physical table to be loaded during the ETL process and its primary database. I will use basic T-SQL scripts to create databases and dimension tables and fact tables related to them. You can find the full T-SQL script in the download item that comes with the BI solution sample in the 2009 Code download.
For the purpose of this article, I only constructed a subset of the entire sales data market architecture to cover the entire ETL process in SSIS. In my simplified architecture, I only added OrderQuantity and SalesAmount in the Internet sales fact table. In addition, in this simplified architecture, I have added simplified customer, product, and date dimension tables.

Deploy the data market

To deploy the data marketplace, I just need to execute a previously written T-SQL script to instantiate the new table on the SQL Server instance. To run the T-SQL, I click Start \ All Programs \ Microsoft SQL Server 2008 \ SQL Server Management Studio to start SQL Server Management Studio (SSMS ). After enabling SSMS, I type the name of the specified SQL Server instance, and then click use Windows Authentication connection in the connection dialog box ". Use SQL Server Management Studio to open the TECHNET_AW2008SalesDataMart. SQL file and execute this script.

ETL Development Process

The next step to build a BI solution is to design and develop the ETL process. Let's review,. In general, the ETL process in the BI solution extracts data from flat files and OLTP database operations, and then converts the data to adapt to the dimensional model (for example, star architecture ), finally, load the result data to the data market.

Create an SSIS project in BIDS


The first step in ETL Development is to create a project in Business Intelligence Development Studio (BIDS. SQL Server 2008 comes with BIDS. During installation, select the "workstation components" option to install BIDS. BIDS contains project templates for SSIS, SSAS, and SSRS. Like Visual Studio, BIDS also supports source code control integration.
To start BIDS, go to start \ All Programs \ Microsoft SQL Server 2008 \ Business Intelligence Development Studio ", select "file"> "new project ". You will see the "new project" template shown in figure 3.

 

Figure 3: "New Project" template in BIDS 2008
In the template pane, select Integration Services project, type ssis_TECHNET_AW2008 In the Name text box, and click OK ". Now, BIDS should display an open SSIS project.

Create a public data connection

Another outstanding feature in SSIS 2008 is the ability to create a data source connection outside of a single packet. You can define a data source connection once and then reference this connection in one or more SSIS data packets in the solution. To learn more about creating a BIDS data source, see "how to use the data source Wizard to define a data source (Analysis Services )".
Create two new data source connections: one for the TECHNET_AW2008SalesDataMart database and the other for the AdventureWorks2008 OLTP database. Names are AW_DM.ds and AW_OLTP.ds respectively.

Development-dimension ETL
It is very easy to load the product-dimension ETL. I need to extract data from the Production. Product table of Adventure Works and load the data to the TECHNET_AW2008SalesDataMart database. First, I need to rename BIDS as the default data packet created for my SSIS project. (The data packet stores all the steps in the workflow to be executed by SSIS .) Right-click the default data packet in Solution Explorer and select "RENAME ". Type "DIM_PRODUCT.dtsx" and press Enter.
Next, I need to use a pre-created data source to create a local data packet Connection Manager. Create two connection managers that reference the previously generated data source.

Define data streams to extract and Load

A Data Flow task in SSIS encapsulates all the data required to implement ETL for a simple dimension. I just need to drag a Data Flow task from the toolbox to the control flow designer drawing and rename it to EL for extraction and loading ). Right-click the data flow task in the designer and select edit ". The data flow designer is displayed in BIDS.
The extraction section of the Product Dimension package needs to query the AdventureWorks2008 Production. Product table. To set this task, I drag the OLE DB source component from the toolbox to the data flow designer drawing and rename it AW_OLTP.
Next, I define the loading part of the data packet to load it to the data market. I only drag a new instance of the OLE DB target component to the data flow designer drawing and rename it AW_OLTP. Then, I click the OLE DB source (AW_OLTP) component and drag the Green Arrow displayed on the OLE DB source to the AW_DM OLE DB target component to connect the two components.
At this point, I have added the required components to the data stream, but I still need to configure each component so that SSIS knows how to extract and load data. Right-click the target component of AW_DM ole db and select edit ". After opening the ole db target Editor, make sure that AW_DM is selected as the ole db Connection Manager. Then, expand the table name drop-down menu and select the dbo. DimProduct table. Finally, click the ing tab to confirm that the ing is correct. Click OK to confirm the ing. If you already have data ing that can be referenced, this process is much simpler, especially when processing large tables. Product-dimension ETL data packets are now complete.
You can easily execute this packet in BIDS. To test the Product Dimension package, open the data packet and press F5.

Develop other data packets
I create customer-dimension data packets by creating product data packets. The steps to create this new data packet will not be repeated here. Please generate this data packet on your own. Note that this data packet uses the XML data type column (Person. Person. Demographics) in the data source, which requires you to parse individual demographic attributes. To parse a single Value in the SQL Server XML data type column, you can use XQuery in the Value () method inherent in the XML data type. Name the parsed data packet DIM_CUSTOMER.dtsx.
It is optional to develop SSIS data packets for the date dimension. Since this dimension typically does not have source data, the easiest way to load it is to use basic T-SQL scripts. You can find the script in the completed solution.

Develop Internet sales fact data table packages
The Internet sales fact table package queries all Internet sales and returns the sales status based on the product, customer, and date (that is, the order date. Unlike dimension table data packets, a fact table data packet requires an additional step before loading data, that is, querying the proxy key and smart key in the corresponding dimension table. You can create a new data packet and name it FACT_INTERNET_SALES.dtsx.
The extract part of this packet needs to query the AdventureWorks2008 OLTP database using the T-SQL code shown in Figure 4.

Develop other data packets

I create customer-dimension data packets by creating product data packets. The steps to create this new data packet will not be repeated here. Please generate this data packet on your own. Note that this data packet uses the XML data type column (Person. Person. Demographics) in the data source, which requires you to parse individual demographic attributes. To parse a single Value in the SQL Server XML data type column, you can use XQuery in the Value () method inherent in the XML data type. Name the parsed data packet DIM_CUSTOMER.dtsx.
It is optional to develop SSIS data packets for the date dimension. Since this dimension typically does not have source data, the easiest way to load it is to use basic T-SQL scripts. You can find the script in the completed solution.

Develop Internet sales fact data table packages

The Internet sales fact table package queries all Internet sales and returns the sales status based on the product, customer, and date (that is, the order date. Unlike dimension table data packets, a fact table data packet requires an additional step before loading data, that is, querying the proxy key and smart key in the corresponding dimension table. You can create a new data packet and name it FACT_INTERNET_SALES.dtsx.
The extract part of this packet needs to query the AdventureWorks2008 OLTP database using the T-SQL code shown in Figure 4.

 

(,( (), (H.OrderDate) )  ( (((H.OrderDate) ),)  ( ( ((H.OrderDate) ),)) (D.OrderQty) (D.LineTotal)        .        . H  (D.SalesOrderID         . P  (D.ProductID         . C  (H.CustomerID   

 

The extract part of this packet needs to query the AdventureWorks2008 OLTP database using the T-SQL code shown in Figure 4.


Figure 3 T-SQL code for Internet sales by product, date, and customer

Create a new data flow in the control flow diagram of this data packet. Open the data flow designer and create an ole db source component. Name the component AW_OLTP and use the query in Figure 3 as its source. This query generates the aggregation (SUM) of OrderQuantity and SalesAmount metric values in the Adventure Works sales table ).

Now, you need to configure a query conversion. Drag two new instances of the query conversion component from the toolbox to the data flow designer drawing and rename it "product" and "customer ". Configure the first instance (product) to query the ProductKey in the product dimension table. The configuration method is to connect the AlternateKey of the dimension table with the ProductID Field passed in from the AW_OLTP Source Query.

Configure the second instance (customer) to query the CustomerKey in the customer dimension table. The configuration method is to connect the AlternateKey in the dimension table with the AccountNumber Field passed in from the AW_OLTP Source Query.

Last step

The last step is to load the data to the FactInternetSales fact table, and replace the natural keys of each dimension with the proxy keys found in the query conversion. Drag and Drop a new instance of the ole db target component and name it "AW_DM ". Edit the ole db target component and select AW_DM Connection Manager. Then, select the dbo. FactInternetSales table and click the ing tab. Make sure that the ing is shown in Figure 4. Click OK to complete the data packet logic.

 

Figure 4 ole db target ing for the Internet sales fact table

To test Internet sales fact data packets, open this data packet in BIDS and press f5.

Now you are familiar with dimensional modeling and using SSIS to build ETL-based data packets. In the third article series, you will learn how to use the filled data market to create dimensions and cubes for the SSAS database. After creating a multi-dimensional dataset, you can develop an SSIS data packet. In this way, you can continuously update these objects in the SSAS database every time new data is added to the data market. When a single query cannot meet the report requirements, SSIS can even prepare the data displayed in the SSRS report. As you can see, SSIS can do a lot of work to help you manage BI solutions, not just ETL processing.

Address: http://technet.microsoft.com/en-ca/magazine/2009.08.bipartii.aspx

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.