Data Integration in the Era of Big Data and Backward ETL Technology
Source: Internet
Author: User
Keywordsbig data data integration data integration techniques
It is a very difficult problem to ensure the consistency of business data in large enterprises. Generally speaking, data related to customers or products of multinational companies, for example, often come from multiple sources. As a result, even the simplest questions are sometimes difficult to answer. In this case,
data integration can be a solution.
Data set provides a unified view of data stored in multiple data sources, and extraction, transformation and loading (ETL) technology is an early attempt of data integration.
With ETL, you can extract, transform, and load data from multiple source transaction systems to a single location, such as a company data warehouse. The extraction and loading parts are relatively mechanical, but the conversion parts are not so easy. To achieve this, you need to define business rules to explain which transformations are valid.
A major difference between ETL and data integration is that data integration is a broader area. It may also include data quality and the process of defining primary reference data, such as defining customers, products, suppliers, and other key information related to business transaction delivery on a company wide basis.
Data classification and consistency
Let's look at an example. A large operating company may need to classify products and customers from several levels to launch marketing activities in different levels. For its smaller subsidiaries, this can be achieved through a simple product and customer classification hierarchy. In this case, a larger organization might classify a can of coke as part of the carbonated beverage, a beverage, food, and beverage sales. However, smaller subsidiaries may classify the same coke as food and beverage sales, without an intermediate classification. That's why classification consistency - or at least an understanding of differences - is needed to get a global view of the company's overall sales.
Unfortunately, it's not always that easy to know who you're doing business with. Shell U.K., for example, is a subsidiary of oil giant Royal Dutch Shell. Companies like aera energy and bonny gas transport are shell entities, some with other investors. Therefore, business transactions with these companies need to be added to the global view of shell company as customers, but from the perspective of company name, this relationship is not obvious.
The famous vice president of investment banking once told the author that they did not know how many businesses they had done in the world, for example, Deutsche Bank, let alone whether the enterprise was profitable. The answers to these questions are embedded in the systems of various global investment banking departments.
Data quality issues
ETL Technology is an early attempt to solve this problem. But to get the transformation steps right, you need to define business rules and what kind of transformation is effective -- for example, how to summarize sales transactions or map a database field, when "m" is used to define male customers and "male" is used for another meaning. The development of technology is helpful to this process.
It has been proved that data integration is more extensive than ETL and
data integration itself. Data quality is also an important factor. What if there are duplicate contents in the customer or product documents? In a project in which the author participated, 80% of the customer records are duplicate. That means the company has only one-fifth of its business customers.
In raw materials, the repetition rate of the master file is usually 20% to 30%. When a company overview needs to summarize data, Asian servers should eliminate these exceptions.
Growing data volume
Although data integration has its advantages for large companies, it is not without challenges. Such as the continuous growth of unstructured data generated by the company.
Moreover, because data is stored in different formats - sensor data, web logs, call logs, documents, images, and Videos - ETL tools can be ineffective in this environment because they are not designed with these factors in mind. When there is a large amount of data or big data, these tools will also encounter difficulties. For example, Apache Kafka and other similar tools try to solve this problem through real-time streaming data. Hong Kong servers rent, which enables them to overcome the limitations of the previous message bus method on real-time data integration.
From the early ETL to now, the related technologies and concepts of data integration have changed a lot. But it still needs to keep evolving to keep up with the changing needs of enterprises and the emerging new challenges in the era of big data.
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.