Website Data analysis: The premise of analysis-Data quality 3

Source: Internet
Author: User
Keywords Article through premises front two articles

Intermediary transaction SEO diagnosis Taobao guest Cloud host technology Hall

 

The first two articles--The premise of analysis--data quality 1 and the premise of analysis--Data quality 2 respectively, this paper introduces the statistic information of data profiling, and uses data auditing to evaluate the quality problem of data, and the quality problem of data can be obtained through integrality, Accuracy and consistency are reviewed in three aspects. This article introduces the last piece-data correcting.

Data audits help us find the problems in the data, and these problems sometimes can be used to modify the line, so as to improve the overall quality of data, data modification is to complete this task, can be modified from the following aspects:

Fill missing value

The easiest way to record missing problems is to get the data back up. In general, statistical indicator data loss can be retrieved from the original data, and the original data loss can be recovered from the extracted data source or backup data. If the original data is completely lost, the basic will be powerless.

For the lack of field values, a lot of data will be introduced to use some statistical methods to repair, in fact, is the missing value of the prediction or estimation, the general use of the average, the number of people, before and after the averaging method, or using regression analysis method to fit the trend of the indicators after the forecast These methods are not available in other ways to retrieve or recalculate the missing values, and are desirable in the absence of a change in the law, and when the value of a certain day is lost, this method can be used to estimate the number of the day according to the data of the previous days. But many times in web analytics if the underlying log has a missing value, it is difficult to predict a specific missing value because the details of the access are almost no trace, so the easiest way to do this is to discard the record when there is a missing value for the access record and the deletion of those fields obviously affects the calculation of some statistical metrics. But this direct filtering out of the missing records of some of the methods will only be used to access the log, and so do not need very accurate data, if the site's operations, transactions, such as the need to ensure that the full calculation of accurate data is absolutely not directly discarded, And for the access log missing or abnormal records of the filter also need to be based on the statistical basis for such data, the general principle is not very important fields if missing or abnormal records accounted for less than 1% or 5 of the case can choose to filter these records, if the higher the ratio, There is a problem with logging to further troubleshoot.

Delete duplicate records

The values of certain fields in the dataset are necessarily unique, for example, the Date field in the metric value of the day, the user ID of the user information table, etc., these need to ensure that only rules can set unique constraints on the database, but when we do ETL processing, sometimes in order to ensure that the data loading process can not be broken by the unique constraints (Sometimes the load process requires a long time or processing costs, ETL needs to have fault tolerance to ensure that the entire process is not interrupted) will first ignore duplicate records, the entire ETL process after the end of the need to ensure that the only field to be processed.

These duplicate records can be audited in a consistent way against the number of unique values and the total number of records in the data profiling, and the easiest method to fix is to keep only one record in the duplicate and delete the other records. This needs to be based on the actual situation, and sometimes it is possible to use the method of adding the statistics of repeated records to the weight.

Convert Inconsistent Records

The conversion of data is the most common process in data Warehouse extraction. Because of the "integration" characteristics of the Data warehouse, it is necessary to store the data from multiple data sources in the Data Warehouse, while the coding rules of different data sources for some fields of the same meaning will vary, such as the user ID, although the same user, But may be the ID of a system is the U1001,B system is the 1001,C system is 100100, from the three sets of system user ID needs to be unified, such as we will be a data source of the U prefix removal, c system ID in addition to 100 after the Unified B system encoding way to import the database together; Even if it comes from the same set of logs, there may be inconsistencies in the records, such as the mobile operating system in the log that previously encountered the earlier release of the product version is Android, and when the update is changed to Android, the old and new versions of the log are hit together, and the data is transformed, But the inconsistency of this kind of record will undoubtedly increase the processing cost of ETL.

The above examples of conversion rules are relatively simple, in the Data Warehouse ETL processing data conversion may encounter some very BT rules, this time the most critical or the data source record way enough familiarity, so as to ensure that the data into the data warehouse is consistent. The best practice is that the development engineers of the data warehouse and other foreground system developers can agree on a set of unified data recording and coding methods, which can reduce the later coordination communication and conversion processing costs.

Handling Exception data

Exception data Most of the case is difficult to correct, such as the character encoding and other problems caused by garbled, characters are truncated, the number of exceptions, these anomaly data if there is no law can be almost impossible to restore, can only be directly filtered.

Some data exceptions can be restored, for example, the original character is mixed with some other useless characters, you can use the method of substring, use the TRIM function can remove the space before and after the string, etc. if the character is truncated, if you can deduce the original complete string using the truncated character, you can also be restored, For example, mobile operating system records generally include Symbian, Android, IPhone, BlackBerry, etc., if some records are and, then can be restored to Android, because the other mobile operating systems are truncated not likely to have such records. A value that is unusually large or unusually small in a numeric record can be analyzed to see if the difference in the value unit is caused by for example, grams and kilograms are 1000 times times worse, the U.S. dollar and the renminbi exist exchange rate differences, time records may be the difference in timezone, the percentage is less than 1 decimal or has been multiplied by 100 and so on, These numerical anomalies can be processed by conversion, and the difference in the unit of value can be considered to be an inconsistency of the data, or some of the values are magnified or reduced by errors, such as the number of additional 0 added after the value of the data caused by the exception.

Finally, to sum up the data can be modified premise: 1 Data quality problems can be audited through the process of auditing; 2 data problems must be traceable, can be predicted through trends, or can be converted through a number of rules. No, for exception data can only be deleted directly, but data filtering must be evaluated before the proportion of abnormal records, when the ratio is too high to review the original data logging mode is problematic.

» This article uses the»in agreement, reprint please indicate the Source: Website Data analysis» "Analysis premise-Data quality 3"

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.