Big data: Good friends with data quality? Source Data quality Issues

Source: Internet
Author: User
Keywords Large data quality issues data quality

If you want to summarize your dataset to a http://www.aliyun.com/zixun/aggregation/14417.html ">apache Hadoop cluster that you cannot coexist in a similar database, If you expect to build a unified view between them, you may feel a rude awakening. Quality issues are not uncommon when starting to address sources of information that have not been fully utilized in the past.

When exploring underutilized data, quality problems can become a rat's nest full of filth, almost wasting energy predicting unpredictable problems. For example, a few years ago, we launched a complex 8206.html "> System Usability Predictive Analysis project, which found that the system data provided as a reference was very easy to mutate and distinct from the characteristics described in the specification." "Standards" are by no means merely "recommendations". In this case, you need to trace and process the core system data generation or resolve these quality issues. This is a fairly common phenomenon, because, by definition, when you are dealing with underutilized sources of information, these sources are likely to be used strictly for the first time.

When you combine structured data with new and large numbers of unstructured sources, the complexity of the problem will rise to a new level (almost certainly), according to official Records, that the issue is rarely properly managed. In fact, when dealing with unstructured information, which is the most important new large data source, the data is expected to blur, contradict, and clutter. More and more large data sources are beginning to provide non-transactional data (including events, geospatial, behavioral, click Flow, social and sensor, etc.), and fuzzy distortion and noisy clutter are essential features of these data. It is a good idea to establish official standards and shared methods for processing operations for such data through a single system.

Large data may have more quality problems just because of the larger amount of data

When discussing large data, it is usually mentioned that the volume is large, fast and variety. Of course, it also means that you are likely to find far more bad data records than small datasets.

However, this is only a problem caused by the large size of the data set, and does not lead to a higher probability of quality problems. Although in both quantitative and managerial terms, 1% of the data fidelity problem in 1 billion samples is much worse than that of 1% in 1 million samples, although the overall ratio remains unchanged and the impact on result analysis is consistent. In this case, data cleanup may take a lot more effort, but, as we said earlier, this is actually a workload scaling issue, and large data platforms are very good at dealing with such problems.

Interestingly, large data is well suited for data quality issues, and this is a long-standing problem for the statistical analysis World: Traditional methods need to build training sample models rather than models for overall data records. This view is very important, but not enough attention. For a long time, the scalability constraints of the analysis data platform forced the modeler to abandon the data set granularity analysis in order to accelerate the model construction, execution and scoring process. Not having enough perfect data for you to push means that you may completely ignore outlier records, so documenting the analysis distorts the risk that it will slip through the cracks.

When you filter out sparse/Outlier records with joy, it is not so much the data quality problem (the source and the data in the sample may be completely correct and up-to-date) that it causes the downstream data to resolve the loss problem. However, the effect may not be the same. Simply put, the noise risk for the entire dataset is less than the distortion or compression/artificial result risk caused by the error or restricted sample. We're not saying that sampling is bad, but typically, when you can choose to remove restrictions that prevent all data from being used, you should choose this approach.

We are not saying that all such operations are easy. Let's look at a specific customer case that causes confusion in the social listening arena. Noise or error data management is easy when dealing with general discussions about a topic. The amount of activity here usually needs to be considered outliers, as the name suggests, you need to listen to the customer. Data comes from all directions, so you may believe (but need to verify through sensitivity analysis) that missing or corrupted data does not lead to misunderstandings. However, when you judge what a particular customer is saying and then determine how to respond to that customer, the problems caused by missing or corrupted data are expanded. The problem may or may not be the terminal used to run the analysis, but in essence this poses a greater challenge. You need to understand the impact of data errors and design accordingly. We'll find out more about this topic in a few columns later.

Big data can be a good friend of data quality, or at least an innocent bystander who comes from other locations with quality problems. Do you agree?

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.