Big data: Good friends with data quality? Transactional sources
Source: Internet
Author: User
KeywordsLarge data transactional sources data issues
Many people have a misconception that there is an intrinsic balance between the number of datasets and the quality of the data they maintain internally. This problem appears frequently and becomes Tom's Financial Services information http://www.aliyun.com/zixun/aggregation/16967.html ">sharing A major topic recently discussed by FS-ISAC and analysis Center and other local panels.
Based on this mindset, you cannot scale to the PB level without filling out the Apache Hadoop cluster, large-scale parallel data warehouses, and other nodes that contain inconsistent, inaccurate, redundant, obsolete, or uncertain 17830.html "> Junk data." But we disagree with that view. This is why we think this concept is too simplistic for the actual situation.
Large data is not a transactional source of most data problems
Data quality issues in most organizations can often be attributed to the source transaction system, whether it is a customer relationship management (CRM) system, a common account application, or other programs. These systems are usually at a TB level.
In this discussion, Jim rightly points out that any IT administrator who fails to ensure that the logging system is clean, universal, and consistent is actually half lost. Of course, you can fix the problem from the downstream by aggregating, matching, merging, and clearing the data in the staging database (to a certain extent). But there is a close relationship between quality problem and the lack of control of data transactional source, but it has nothing to do with the absolute quantity of source.
By massively parallel deployment of ibm®infosphere®qualitystage® (or using IBM biginsights™ to impersonate this feature), you can extend the data cleanup operation from the downstream of the problem source, but you cannot "cure" A disease is blamed on the disease not because of it. Large data can now aggregate new types of data sources that have never been erased before
In the traditional data warehouse system, people have a clear understanding of the problem of data quality (even if it is still a challenge), but at that time, the main concern is the maintenance of the core records system, including customers, finance, human resources, supply chain and so on. But what about the big data space?
Many large data programs are used to drill down on aggregated data sources such as social marketing intelligence, real-time sensor data sources, data extracted from external sources, browser-click Stream sessions, it system logs, and more. Historically, these sources have not been linked to official reference data for transactional systems. It has never been necessary to purge them, because professional teams that often take the offline approach to problems tend to look at these issues in isolation and do not record the results in the Official Records system. However, cross-information type analysis (common in large data spaces) has changed this mechanism.
While individual data points may have an isolated marginal value, the patchwork may be considerable. They help provide context for problems that occur (or are about to occur).
Unlike business reference data, these new sources do not provide data that needs to be loaded directly into the enterprise Data Warehouse and offline archive, or do not provide data that needs to be retained for electronic search. Instead, you need to delve into them to extract key patterns, trends, and root causes, and once you reach your core tactical goals, you can clean up most of them. This usually requires a lot of digging, slicing, and cutting operations.
In this case, the problem of data quality will be reflected in two forms. First, you cannot lose sources, protagonists, participants, or actions that need to be consistent with the definition of the rest of the data. Second, you cannot discard the lineage method that handles transactions. People, events, times, places, and ways of discovering and replicating.
As John McPherson, a colleague at the IBM Institute, said, "Remember, many times when you talk about big data, the data we're talking about is some data that wasn't used well in the past, so we're usually trying to solve different problems." We are not trying to define the profitability of the stores. We should have done this with high quality data from the recording system and do everything we can to standardize and reshape data in the Data Warehouse. "Here, in John's case, what we have to do is to find some factors that improve the profitability of the store." This article continues our discussion in part 2nd. At the same time, please tell us in the comments that you have some experience in maintaining large data quality.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.