The premise of website data analysis

Source: Internet
Author: User
Keywords Data Warehouse website data analysis
Tags analysis cleaning data data entry data quality data warehouse distribution get

Data quality (Quality) is the basis of validity and accuracy of data analysis conclusion and the most important prerequisite and guarantee. Data quality assurance (Quality Assurance) is an important part of data Warehouse architecture and an important component of ETL.

We usually filter dirty data through data cleansing to ensure the validity and accuracy of the underlying data, data cleaning is generally the data into the data warehouse in the front, in general, once the data into the Data warehouse, then must ensure that these data are valid, The top of the statistical aggregation will be this batch of data as the underlying dataset, the upper layer will not do any checksum filtering, while using a stable underlying data set is also to ensure that all the upper level of aggregation and multidimensional aggregate results are strictly consistent. But at the moment when we are building the data warehouse, we generally do not put all the data cleaning steps before storage, generally will part of the data cleaning work after storage to implement, mainly due to data warehouse on the processing has its own advantages, part of the cleaning work in the warehouse will be more simple and efficient, And as long as the data cleaning occurs before the statistics and aggregation of the data, we can still guarantee the use of the final "clean" base data that is retained in the data warehouse after cleaning.

Some time ago with colleagues to discuss the issue of data quality assurance, before doing data warehouse related work has also contacted the relevant content, so here ready to systematically tidy up. Prior to building the data warehouse based on Oracle, Oracle was chosen for the Data Warehouse builder--OWB (Oracle Warehouse Builder), which provided a more complete operational process to ensure data quality, consisting mainly of three blocks:

1.Data Profiling

2.Data Auditing

3.Data correcting

Data Profiling

Data Profiling, in fact, currently has not found a very appropriate translation, Oracle is used in the "statistical profiling", but in fact, "Profiling" the word can not reflect its artistic conception, saw the United States drama Criminal Minds (Criminal psychology) Students should know that the FBI's Crime Analysis Team (BAU) Each episode will be a criminal Profiling, to analyze the offender's identity background, behavior patterns, mental state, so Profiling more is an analytical process. Wikipedia's interpretation of data profiling is as follows:

Data profiling is the process of examining the data available in a existing data source and collecting statistics and Information about that data.

Here we see that data profiling needs a process of collecting statistics (which is also the work of Garcia in criminal psychology), so how do we get statistics for the data?

Students who are familiar with the database should know that the database will do analyze for each table, on the one hand, so that the optimizer can choose the appropriate execution plan, on the other hand, for some queries can directly use the analysis of statistical information to return results, such as COUNT (*). This is actually a simple data profiling,oracle. Database Build Tool OWB provides a more comprehensive statistics information for each of the fields in the table that is set up, including:

Number of records, maximum value, minimum, maximum length, minimum length, number of unique values, number of null values, average and median, in addition OWB also provides SIX-SIGMA value, the value of 1-6, the higher the quality of data, when the Six-sigma value of 7 can be considered the data quality near perfect. At the same time, for the unique value of the field, the statistic gives the distribution frequency of each unique value, which is very useful for discovering some outliers, which will be described in detail later.

Looking at the statistics on these data profiles, we might associate statistics with statistical descriptions that use statistics to describe the characteristics of datasets or sample sets, and if we don't have such ETL tools like OWB, We can also use the knowledge of statistics to carry on the simple profiling of the data, here have to mention a very useful chart tool-box plot (box), also called Box line diagram, box chart. We can try to use the box diagram to represent the distribution characteristics of the data:

The box line chart has many kinds of manifestation, the above figure is the more common one kind of box line chart. The upper and lower sides of the general rectangular box are respectively the upper four-bit number (75%,Q3) and the lower four-digit (25%,Q1) of the dataset. The middle horizontal line represents the median of the dataset (50%,MEDIA,Q2), while some box charts use "+" to represent the mean of the dataset. Box-shaped upper and lower respectively extend two lines, the end of these two lines (also known as "tentacles") is generally the distance box-shaped 1.5 iqr (q3-q1, that is, box-shaped length), so the top of the tentacles should be Q3+1.5IQR, the bottom of the tentacles is Q1-1.5IQR; If the minimum value of the dataset is greater than Q1-1.5IQR, we use the minimum value to replace the Q1-1.5IQR as the end of the bottom extension line, and if the maximum is less than Q3+1.5IQR, use the maximum as the end of the extension line, if the maximum or minimum value exceeds the Q1-1.5IQR to q3+ 1.5IQR This range, we refer to these excess data as outliers (outlier), which are printed out in the diagram, which is the point outside the upper tentacles. In addition, we sometimes use the standard deviation σ based on the dataset to select the range of up and down 3 σ, or use confidence intervals with a confidence level of 95% to determine the end value of the upper and lower bounds.

In fact, the box-line diagram does not show the full picture of the dataset, but by the graphical representation of several key statistics of the dataset, we can see the overall distribution and dispersion of the data.

Since we have been able to get the data statistics on this through data profiling, how to use these statistics to audit the quality of the data, to find out the possible anomalies and problems of the data, and to modify the data effectively, or to clean, and then get "clean" data, Let's go to the next article.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.