Microsoft BI's SSIS Series-detect data source data using SQL profilling Task (data probing)

Source: Internet
Author: User
Tags type null ssis

The opening introduction to SQL Profilling Task may be that many of us have not really used it in SSIS, so the use of this control may not be well understood. Let's put it another way, assuming that we have a need for some data analysis of some data in a database table, such as statistics on the length of the actual data in each column of the data table, the range of lengths, for example, statistics on the scale of non-empty fields in each data column, the number of rows in the table, repeating fields, etc. So if you do not do this data source data analysis, you may not know in what way can get this information very quickly. Write SQL statements? I think this process is also very time-consuming and energetic. In fact, there is such a control in SSIS 2012 that SQL profilling Task can help us do these things and quickly analyze the data source data. It's very quick to figure out what kind of data source we're dealing with, which will be very helpful to our BI developers during the actual ETL process. Also, there is a situation where we can send the results of this data analysis directly to the relevant database administrator, or to the data analysis and management staff in SSIS, so that they will understand the data more clearly, earlier to determine the data source of data anomalies and so on. Initial experience of SQL profilling Task Create a new package and create a new one, this time our test source object can be replaced by any database, then I will replace the data source with the ADVENTUREWORKLT2012 database. Note: The ADO link Manager must be used here, which is a special limitation and only supports the SQL Server version. Another is the need to have read and write permissions to the tempdb database, because there are a lot of calculations, aggregation, and so on. Drag and drop a SQL profilling task into the control flow. by double-clicking Edit, the first thing to do is to create an XML file to hold the SQL profilling Task to the specified data source, which is the result of the analysis of the table. Overwritedestination can specify to overwrite the previous old file each time, but another way is to generate different files based on the execution date, as mentioned in many previous examples. Select Quick profile  here to select the previously created ADO link Manager, and select the table or view;  we are here to first configure the table Product, after the configuration is completed for all subsequent tables can be seen. When the above configuration is complete, it jumps to the profileRequests, where you need to select each profile, and then check the entire field (*) or a field for each table as needed. Each of the profile types is differentiated, that is, different tables, different columns can be configured or not configured, and the Product table is configured by default. This experience is not very good, that is, if I need to re-add a new table, we need to create a new profile type-click the blank place below the profile type to create a new profile Request. We continue to make null statistics processing for the accountnumber of the SalesOrderHeader table. After saving, run the package and open the XML file after the build. Obviously, this XML file, even if you can not see anything, the correct way to open should be through the SQL Server 2012 comes with the Data Profile Viewer tool to view. After opening the Data profile Viewer, loading the XML file of the package output, you can see the graphical analysis results. For example, in the Product table, the number of NULL values has 50 bars, which accounts for 16.94 of the total number of rows. Let's verify the results and find that the results are consistent with the above analysis. Here is a detailed explanation of these Type of profile types. Single-column statistic type null Ratio profile (null ratio statistics) is generally used to check the percentage of null values in the data column, for example, by analyzing the results of a column to find that the null value of the relatively high, may exceed the expected, then this is not the data is not normal, Too many lost values. For example, in the customer relationship management system, as a very important contact-mobile phone number, e-mail, address and so on. For these three, the percentage of the null value should be the minimum number of mobile phone numbers, address and e-mail second, and mobile phone number should be very small proportion of NULL is normal. If the value of the cell phone number that is important for analysis is very high, then it indicates that there are some problems with our data source data, then we need to pay attention. It is possible that the business operation is not canonical or that there is a problem when the data source prepares the data. For example, in the following columns, Color, Size, Weight can get to the specific NULL Ratio scale through the SQL profilling Task. You can see the Weight in the data source, the Size, the Color NULL value comparison from high to low,There are 97 empty values in the Weight, which accounts for 32.88% of the population. As to whether this data is reasonable, it needs to be determined by the business people, that is, when we design our data warehouse as the basis for data analysis, this information should be understood beforehand.  column length distribution Profiles (  column length distribution statistics  ) for the length of the data in the column to go to the weight and statistics of each length of the total number of data rows and proportions. This profile helps us understand that the first is whether the data length definition in the business data source is reasonable, such as the length of the data in the business data source, which is the longest in the history of 5-10 years, is 35, but the length of 255 is defined. So this is the basis when we define the length of our table columns when we extract data, because we find that all the data in history is not more than 35 in length. The second is that you can see the exception data, such as the identity card number is generally the default is about 18 bits, then in this diagram can be very easy to see whether there is abnormal data, such as a large number of more than 18 bits, or less than 18 bits of data. For example, the minimum length in the Name column is 5, the maximum is 32, and the data with a length of 23 occupies the highest proportion. Of course, in the design of length detection, the processing of space is also very convenient. Ignoreleadingspaces-whether to ignore open spaces, ignoretrailingspaces-whether to ignore the trailing spaces. Statistics Profiles (column statistics) This statistic is very meaningful, especially for numbers, date Type fields, which can be very intuitive to see their maximum, minimum, average, and standard deviation values (time only maximum, minimum). These statistics can be used to see whether there are anomalies in the data, such as the maximum and minimum value of the price will be negative, the maximum deviation of the outrageous and so on. As for the average Mean and standard Deviation, please refer to the statistics for the average and poor calculation method. Value distribution (column value distribution statistics) The value distribution of the column after the deduplication, then this statistic on the one hand can help to understand what data in the source data can be used as classification, which data base is very large is not suitable for this classification. Another, this statistic can also help us to identify the quality of the data in the column. For example, China's provinces after the value distribution statistics found up to hundreds, then this data is problematic, it will be very easy to see. For example, you can see very clearly the allocation of values in individual columns, such as Color, which is well suited as a property of a dimension to parse the data in the fact table, but like PRoductnumber This cardinality is too large to be used to analyze factual data. And in the color column can also be very intuitive to see which color of the product is the largest, the proportion of how much is very clear. Pattern Profiles (column mode, regular expression allocation statistics) provides regular statistics for string types of data, such as regular expressions for mail addresses, and if you see other expressions, it might indicate that the previous validation was wrong. There is the analysis of the statistics can see which expression occupies a larger proportion, high, the more the higher the explanation of this proportion should be normal, you can consider all the verification following this kind of walk. As for the other can be considered if the application of the changes in the industry, or when the data are extracted when the data is considered abnormal, it is necessary to deal with. For example, in the product size of the expression is \d\d, occupies 84% of the data ratio, that most of the size of the content is numeric type, but also some filled in the value is M, L this character type, then like this data can be extracted to the data Warehouse before the discovery, So there is time to think about how this small amount of irregular data should be handled. Multi-column or table-level probe type candidate key profile (candidate primary key sniffing) counts the columns that select the statistics to be probed, to see if it conforms to the characteristics of the primary key, or to approximate the primary key. For example, we may think that a column should be the primary key, but the final statistic is that because duplicate values exist only as approximate primary keys, you should be careful when choosing. For example, in the three columns are non-repeating columns, 100% of the characteristics of the primary key, so these columns can be considered as the primary key candidate column. Of course, the final choice, this analysis can only be a reference, for example, we also need to consider the type of field, length and other factors, as far as possible to select integer data, even if it is a string, you should choose a shorter length of the column. But this value is not 100% completely in line with the actual, not necessarily, here the display and configuration candidate Key profile. For example, when you choose Specified, you can list all the columns (*), when the non-repetition rate of more than 95% should be considered as 100%. When you choose None, you need to select a fixed column, or a column of two column combinations to form. Functional Dependency Strength Profile (function dependency statistics) so you can understand what this statistic does, for example, in a sales statistic, there are countries, provinces and cities, a city must belong to a province, And a province must belong to a country. If we find the city in the data, for example, Hangzhou belongs to Zhejiang, it is in anotherRecords also belong to the province of Jiangsu, then it is clear that this data is problematic. function dependencies can detect this dependency, for example, in 1000 data with Hangzhou, there are 999 Hangzhou data corresponding to the province of Zhejiang, there are 1 is not, then Hangzhou's reliance on Zhejiang should reach more than 90%. For example, the following picture, salesperson to CompanyName "dependence", CompanyName to Salesperson "decisive" reached 99.76%. In other words, as long as you know that company Name has 99.76% of the potential to be able to determine who the Sales person is? As long as you know the city is Hangzhou, then there is a percentage of the likely to know who the province is, of course, the decision between the city and the province should reach 100%. There are 4 data in the friendly Bike shop, two of which are able to decide that Sales person is  adventure-works\david8, and two are not, so this dependency rate is 50%, but all the data on the whole is a dependency rate 99.76%. The value Inclusion profile contains statistics that are used primarily to calculate the coincidence of column values between two columns or between multiple combinations of columns, and the value inclusion values statistics can determine, for example, whether a column between two tables is suitable for establishing a foreign key reference relationship. Or, from another point of view, if a column of this table refers to another table, which forms the theoretical foreign Key Association, then it can be seen that the correlation is correct by means of a value inclusion statistic, such as the value of a column in table B is derived from a column of table A, but the value of a column of table B is found by probing  a is not found in the table. Because there are no particularly good examples here, there must be no problem with this probe in tables that have established a foreign key relationship, so we're just here to understand the use of Value inclusion. Let's look at the reference to the ProductCategoryID in the Product table for the ProductCategoryID in the ProductCategory table (value containment). The final result is not expected to be basically 100% correlation rate. Summarized in the BI system, the most important link is ETL, and the data source and ETL design and development are closely related. Especially for some non-good maintenance of the data source, its quality will be to a large extent affect the design of ETL. Therefore, it is very important to understand the data source and analyze the data of the data source to the later ETL and the design of the table.Role. We can use SQL profilling Task to have a general understanding of our data source, through the above analysis of these methods, basically can meet the majority of data detection needs. Especially for the application system side upgrade, BI system side upgrade of the project, the application system itself is not sound, the quality of its data may also be a big problem, so that our bi design brings a lot of bad impact. Therefore, in some ETL design, we can even use the message periodically in SSIS to send this detection report to the BI system maintenance personnel, as far as possible to do periodic preventive inspection of the source data, in advance to avoid some irregular data caused by unintended consequences.   Last mention, there is one more thing to do! It should not be possible to parse this XML all the time, since this data cannot be centrally presented for historical data. So the best way is to parse the XML file, save the results of the data parsing to the database, and present it in the form of a report. This can not only see the current results, but also can be compared to all previous historical results, to the data source detection management-data source monitoring management is also part of BI monitoring management, about this part is not detailed deployment!

For more bi articles, see the Bi-series essay list (SSIS, SSRS, SSAS, MDX, SQL Server) If you feel this article has helped you, please help recommend it to make it easy for others to see these articles quickly in the Biwork blog recommendations Bar.

Microsoft BI's SSIS Series-detect data source data using SQL profilling Task (data probing)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.