Big data: Good friends with data quality? Transactional sources

Last Update:2014-12-22 Source: Internet

Author: User

Keywords Large data transactional sources data issues

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Many people have a misconception that there is an intrinsic balance between the number of datasets and the quality of the data they maintain internally. This problem appears frequently and becomes Tom's Financial Services information http://www.aliyun.com/zixun/aggregation/16967.html ">sharing A major topic recently discussed by FS-ISAC and analysis Center and other local panels.

Based on this mindset, you cannot scale to the PB level without filling out the Apache Hadoop cluster, large-scale parallel data warehouses, and other nodes that contain inconsistent, inaccurate, redundant, obsolete, or uncertain 17830.html "> Junk data." But we disagree with that view. This is why we think this concept is too simplistic for the actual situation.

Large data is not a transactional source of most data problems

Data quality issues in most organizations can often be attributed to the source transaction system, whether it is a customer relationship management (CRM) system, a common account application, or other programs. These systems are usually at a TB level.

In this discussion, Jim rightly points out that any IT administrator who fails to ensure that the logging system is clean, universal, and consistent is actually half lost. Of course, you can fix the problem from the downstream by aggregating, matching, merging, and clearing the data in the staging database (to a certain extent). But there is a close relationship between quality problem and the lack of control of data transactional source, but it has nothing to do with the absolute quantity of source.

By massively parallel deployment of ibm®infosphere®qualitystage® (or using IBM biginsights™ to impersonate this feature), you can extend the data cleanup operation from the downstream of the problem source, but you cannot "cure" A disease is blamed on the disease not because of it.
Large data can now aggregate new types of data sources that have never been erased before

In the traditional data warehouse system, people have a clear understanding of the problem of data quality (even if it is still a challenge), but at that time, the main concern is the maintenance of the core records system, including customers, finance, human resources, supply chain and so on. But what about the big data space?

Many large data programs are used to drill down on aggregated data sources such as social marketing intelligence, real-time sensor data sources, data extracted from external sources, browser-click Stream sessions, it system logs, and more. Historically, these sources have not been linked to official reference data for transactional systems. It has never been necessary to purge them, because professional teams that often take the offline approach to problems tend to look at these issues in isolation and do not record the results in the Official Records system. However, cross-information type analysis (common in large data spaces) has changed this mechanism.

While individual data points may have an isolated marginal value, the patchwork may be considerable. They help provide context for problems that occur (or are about to occur).

Unlike business reference data, these new sources do not provide data that needs to be loaded directly into the enterprise Data Warehouse and offline archive, or do not provide data that needs to be retained for electronic search. Instead, you need to delve into them to extract key patterns, trends, and root causes, and once you reach your core tactical goals, you can clean up most of them. This usually requires a lot of digging, slicing, and cutting operations.

In this case, the problem of data quality will be reflected in two forms. First, you cannot lose sources, protagonists, participants, or actions that need to be consistent with the definition of the rest of the data. Second, you cannot discard the lineage method that handles transactions. People, events, times, places, and ways of discovering and replicating.

As John McPherson, a colleague at the IBM Institute, said, "Remember, many times when you talk about big data, the data we're talking about is some data that wasn't used well in the past, so we're usually trying to solve different problems." We are not trying to define the profitability of the stores. We should have done this with high quality data from the recording system and do everything we can to standardize and reshape data in the Data Warehouse. "Here, in John's case, what we have to do is to find some factors that improve the profitability of the store."
This article continues our discussion in part 2nd. At the same time, please tell us in the comments that you have some experience in maintaining large data quality.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More