Introduction
The backend relational database of the OLTP system is used to store different types of data. Theoretically, the values of each column in the database have specific meanings, data should also be normalized before being stored in the database, such as the "Age" column, used to store the age of the person. The data type is set to int. The value stored in the database is 2000. Although there seems to be no problem, such "noisy" data will cause serious distortion in the data analysis process in combination with business rules, for example, the average age of a large scale is increased. In a real OLTP system, this type of data that should not exist often exists in large quantities for various reasons. Data similar to this type will always be before the data enters the data warehouse, perform cleaning in the pre-processing phase.
The simple example above is just a violation of business rules, but the actual situation varies according to the specific business. It is not a simple and unified process, the text lists some typical data cleansing scenarios and implementation methods in SSIs.
Why not use SQL statements for processing?
It is feasible to use SQL statements to query and handle such problems, but the use of SQL statements has its own limitations, such:
- What if the data source is not a relational database?
- If the business logic is very complex and complicated SQL statements are required, how should I handle debugging and logs during the processing?
SSIS can effectively solve the above problems. SSIS provides graph-based data streams and the ability to debug, log, and view converted data during processing, the SSIS package is more suitable for Team jobs than SQL.
Use SSIS for data cleansing
Below are some typical data cleanup jobs that use SSIS for processing.
Default Value missing
Missing default values are the most common condition that requires data cleansing. Missing default values refer to "null" in the database column. in data analysis, if this type is null, the data must be replaced with a value that complies with the business logic. For example, if null appears in the sales region column, it can be replaced with "other ", in the sales data column, you can replace it with "unknown.
In SSIs, we can use the "derived column" task to add default values to columns with missing default values. In this example, we add the default values to the person in the adventureworks sample database. replace the addressline2 column of the address database. The null value in this column is replaced with "no additional address", as shown in 1.
Figure 1. Replace null with a derived Column
In addition to replacing the "null" value with constants, we can also fill in the null value based on the data from other data sources according to the business rules. In the example in Figure 2, some enddate columns in the productcosthistory table are null, search for the modifieddate column of the product table to complete the values in the enddate column with null.
Figure 2. Use conditional splitting to determine whether the value of the enddate column is null
Semantic Conversion
Sometimes, the values of some columns need to be converted in semantics. For example, if the "Origin" column of a product is 5, it is hard to know what 5 represents during analysis, therefore, you can search for the corresponding table or file to replace this 5. For example, 5 represents Hebei. In SSIs, this type of conversion can be directly searched in the table as shown in figure 2, or a cache can be created for the search table (5 represents the table in Hebei), and then referenced for search through the cache.
To implement semantic conversion through caching, we first set two steps. The first step is to import the data in the query table to the cache, and the second step is to search for the cache to implement semantic conversion.
Figure 3 shows the overall chart.
Figure 3. semantic Conversion
First, import the data in the query table to the cache, as shown in figure 4.
Figure 4. Import Cache
Then update the data through the cache, as shown in Figure 5.
Figure 5. Use cache for data search
Data Type Conversion
When data comes from different data sources, an error may be reported if the data types of different data sources are incompatible. In SSIs, you can use "derived columns" to convert data. derived columns allow you to write custom formulas to implement complex conversions, you can also use a data conversion task to convert data types to ensure data compatibility between different data sources. 6.
Figure 6. Data Type Conversion
Data Truncation
In some cases, data may be truncated due to data length restrictions during transmission. The solution in SSIS is to import the truncated or erroneous data to an intermediate table for further processing, 7.
Figure 7. Data truncation error output to another source
As shown in oledb source configuration 8.
Figure 8. oledb data source settings
An error occurred while searching for matching.
In some cases, when data from multiple data sources is integrated, the data in the same semantics may be stored in different ways. For example, the value stored in some data sources is "Beijing ", other data sources are stored in "Beijing ". The solution to this type of problem is the "fuzzy search" task. 9.
Figure 9. Fuzzy search for Data
In Figure 9, we perform a fuzzy search for the data, set matching for matching with a matching degree greater than 0.5, and not for matching with a matching degree less than 0.5, and output them to different data sources respectively. This helps unify data.
Violation of business rules
When designing a data warehouse, it is important to specify the type of values in the column. For example, whether the value is within the business scope, or whether the combination of the data in the column and its related data is valid (for example, the end log cannot be smaller than the start log ). This type of data should be cleaned up before it is imported into the target table. There are many cleaning methods in SSIs, for example, you can use a conditional split task to remove or process data that does not comply with business specifications, or import the stage table. This depends on the business type.
Summary
This article briefly introduces the concept of data cleansing and implements some common data cleansing scenarios. Compared with other data cleansing methods, SSIS provides great flexibility and performance.
Use SSIS for data cleansing