The difference between data product testing and other products:
Based on accurate and timely data analysis, users make decisions;
Consolidation, frequent retrieval of data greater than storage;
Data needs to be timely and accurate;
Need to maintain a large number of historical data;
The performance of the retrieval;
Security of the data.
Barriers to data product testing:
Performance issues, outdated data, functional issues, extensible issues;
The business focuses on end reports; Many integration implementations work, lack of professional documentation--) important business logic is hidden in a complex architecture, ignoring the importance of white-box testing, lack of test parameters in the design phase, and lack of knowledge sharing, process immaturity-------customer churn
The volume and complexity of data are constantly changing;
Upstream data changes directly affect the process of integration, need to modify the existing model, transformation logic;
Quality problems of upstream data;
Real-time analysis requires timely data;
Recommendations for the testing process:
Requirement & Analysis (analytical data source), Design & Coding (get data, implement business logic and latitude modeling, build populated multi-latitude datasets), QA & Deployment (report)
The process of designing tests
A) Requirements Review & Inspection:
- Validating The data required and the availability of the data sources they can acquired from.
- Data profiling:
- Understanding the Data : This exercise helps Test team Understa nd the nature of the data, which is critical to assess the choice of design.
- Finding issues early : Discovering data issues/anomalies early, so that late project surprises is avoided. Finding data problems early in the project, considerably reduces the cost of fixing it late in the cycle.
- Identifying realistic boundary Value Conditions : Current data trend can is used to determine minimum , maximum values for the important business fields to come up with realistic and good test scenarios.
- Redundancy identifies overlapping values between tables. Example:redundancy analysis could provide, the analyst with the fact, the ZIP field in table A contained the same valu Es as the Zip_Code field in table B, 80% of the time.
- Data quality and performance acceptance Criteria:
- Data Quality attributes (completeness, accuracy, validity, consistency etc) e.g. A customer expects at least 90% of DA Ta accuracy and 85% of data consistency.
- Performance Benchmarking & SLA (Service level agreements) e.g. report should is rendered in max seconds.
- Validation of Business transformation Rules:
A realistic example for this can is to acquire last 5 years product sales data from the states for a company This rule should is taken while designing the system as it doesn ' t make sense to acquire all the data if the customer want s to see reports based on only last 5 year data from the States)
- Test Planning
Every time there is movement of data the results has to be tested against the expected results. For every ETL process, test conditions for testing data is defined before/during design and development phase itself.
Key important areas to be focussed upon:
- Scope of Testing:functional & Non Functional Requirements like performance testing, Security testing etc
- Testing techniques and testing Types to be used.
- Test data preparation:sampling of data from data sources or data generation
b) Design & Code Review/inspection
- Reviewing Data Dictionary
Verifying metadata which includes constraints like Nulls, Default Values, PKs, Check constraints, referential Integrity (P K-FK relationship), surrogate keys/natural keys, cardinality (1:1, m:n) etc
- Validating Source to Target Mapping (STM)
Ensuring the traceability From:data Sources, Staging, data Warehouse, data Marts
- Validation & Selection of Data Model (dimensional vs. normalized)
- Dimensional Modelling:
Dimensional approach enables a relational database to emulate analytical functionality of a multidimensional database a ND makes the data warehouse easier for the user to understand & use. Also, the retrieval of data from the data warehouse tends to operate very quickly. In the dimensional approach, transaction data is partitioned into either "facts" or "dimensions".
For example, a sales transaction can is broken up to facts such as the number of products ordered and the price Paid for the products, and into dimensions such as order date, customer name, product number, order ship-to and Bill-to Lo Cations, and salesperson responsible for receiving the order .
- Star Schema:
- Dimension Tables has a simple primary key and while fact tables has a Compoundprima Ry key consisting of the aggregate of relevant dimension keys.
- Another reason for using a star schema is it simplicity from the users ' point of view:queries be never Co Mplex because the only joins and conditions involve a fact table and a single level of dimension tables, without the Indir ECT dependencies to other tables that is possible in a better normalized snowflake schema.
- Snowflake schema
- The snowflake schema is a variation of the star schema, featuring Norma Lization of dimension tables.
- closely related to the star schema, the snowflake schema was represented by centralized fact tables which was Connected to multiple dimensions.
- normalized approach :
In the normalized approach, the data in the data warehouse is stored as per The database normalization rules. Tables is grouped together by subject areas the reflect general data categories (e.g., data on customers, products, FINA NCE, etc.) The main advantage of this approach was that it was straightforward to add information into the database.
- Validation of bi/dw Architecture :
Ensuring that design is scalable, robust and as per the Requirements. Choosing the best approach for designing the system:
- bottom-up:
Data Marts is first created to provide Reporting and analytical capabilities for specific business processes. Data marts contain atomic data and, if necessary, summarized data. These data marts can eventually is used together to create a comprehensive data warehouse.
- Top-down:
Data Warehouse is defined as a centralized repository for the entire enterprise and suggests an approach in which the data warehouse is designed using a normalized enterprise data model. "Atomic" data, that's, data at the lowest level of detail, was stored in the Data warehouse.
- Archival/purge Strategy
Deciding on the appropriate archival and purge policy based on the business needs e.g. maintaining data history of Las T 5 yrs etc.
- Error logging/exception handling/recoverability
Ensuring appropriate data failure tracking & Prevention (schema changes, source unavailability etc), as well As the ability to resume from the point of failure.
- Parallel Execution & Precedence
Data warehousing procedures can subdivide an ETL process to smaller pieces running sequentially or in parallel in a spec Ific order. The opted path can has a direct impact on the performance and scalability of the system
- ETL Pull Logic–full/incremental (a.k.a. Delta pull)
Entire data can pulled from the source every time or only the delta since the last run can considered to reduce the Network movement of huge amount of data for each run.
c) BI/DW Testing
- Test Data Preparation
- Test Data Selection
Identifying a subset of production data to being used as test data (ensure that customer's confidential data is not used For such purposes). The selection can made on the following parameters:
- On percentage, fixed number, time basis etc.
- Generate new test data from scratch
- Identify the source tables, the constraints and dependencies
- Understand the range of possible values for various fields (Include boundary values)
- Use data generation tools to generate data keeping above rules
- Test case Design & execution
- ETL Testing :
- Validate data Extraction Logic
- Validate Data Transformation L Ogic ( including testing of dimensional model–facts, Dimensions, views etc )
- Validate Data Loading /li>
- Some data warehouses may overwrite existing information with cumulative, updated data every week DW (or even other parts of the same DW) could add new data in a incremental form, for example, hourly.
- Data Validation
- Test end to end data flow from source to Mart to reports (including calculation logics and business rules)
data quality Validation
- Check for accuracy, completeness (missing data, invalid data) and inconsistencies.< /li>
- OLAP & Cube Testing:
- Check whether the data from the Data Warehouse/data Mart are mapped & designed correctly in the OLAP Cube or reports.
- Validate all the measures and measure groups (including derived measures, aggregations)
- Validate all the dimensions (including slowly changing Dimension), attribute hierarchy etc
- Many OLAP tools provide on the fly computations features and provisions of customized SQLS, which can is prone to error.
- Reports Testing (Drill Down/drill Through)
- Verification of the layout format per the design mock-up, style sheets, prompts and filters attributes and metrics on the Report.
- Verification of drilling, sorting and export functions of the reports in the WEB environment.
- Verification of reports containing derived metrics (special focus should being paid to any subtotals or aggregates)
- Reports with "non-aggregate-able" metrics (e.g., inventory in hand) also need special attention to the subtotal row. It should not, for example, add up the inventory for each week and show the inventory of the month.
- The test team should target the lowest granularity that's present in the data warehouse if it's higher than the Transact Ion grain at the OLTP.
- Understand each report & the linkages of every field displayed in the report with the star schema and trace its origin Back to the source System.
Adventures with testing BI/DW application