Principles of Data Integration

Source: Internet
Author: User
Keywords data integration data integration techniques data integration principles

Data integration is to logically or physically centralize data from different sources, formats, and characteristics to provide comprehensive data sharing for enterprises.

This analogy is used multiple times on Experian Data Quality, but this is only because it makes sense when standardizing reference data.


Data standardization is only one step in building a good data management strategy, but it is a basic step in making data actionable.

Why is it in your best interest to invest in data standardization when conducting data quality or data management projects? Just as a solid foundation is essential for a strong house, data standardization is necessary to build a strong data management plan. This is how organizations that are committed to data driving can make decisions quickly and efficiently.

What exactly is data standardization?

Data standardization is the process of converting or manipulating data into a consistent format. These data are likely to exist in many different systems, and the data storage rules and formats of all these systems may be slightly different. These small differences can lead to misunderstandings and misunderstandings of the organization’s data, causing people who rely on the data to distrust it and conduct multiple checks to ensure that the conclusions drawn from the data are actually correct.



For ordinary Joe, information technology (IT) is a mysterious world, full of incomprehensible programming languages and expensive hardware. Eavesdropping on IT technicians is almost like hearing a foreign language. But despite this seemingly incomprehensible language barrier, it is critical for decision makers in businesses and organizations to understand the IT world. One of the most important IT concepts is data integration.


On the surface, data integration sounds like a simple idea. Since many organizations store information in multiple databases, a method is needed to retrieve data from different sources and assemble them in a unified manner. For example, let us imagine an electronics company is preparing to launch a new mobile device. The marketing department may want to take customer information from a sales department’s database and compare it with the product department’s comparison information to create a targeted sales list. A good data integration system allows the marketing department to view information from two sources in a unified manner, ignoring any information that is not suitable for search.


In fact, data integration is a complex discipline. There is no universal data integration method, and many technologies used by IT experts are still evolving. Some data integration methods may be better than others in the organization, depending on the needs of the organization. We will pay close attention to some general strategies used by IT experts to integrate multiple data sources and enter the world of database management.


Data integration basics


Data integration mainly focuses on databases. A database is an organized collection of data. It is similar to the file system, it is the organizational structure of the file, so it is easy to find, access and operate.


There are different ways to classify databases. Some people like to classify them based on the type of data stored in the database. For example, if all the information stored there is contained in a video or sound file, the database can be classified as a media database.


Another classification method looks at how the database organizes the data. The organization of the database is called a pattern. A common organization technique is to use tables to show the relationship between different data points. Forms are like spreadsheets. Columns define data categories, and rows are records. The database using this method is a relational database.


Object-oriented programming (OOP) databases use different methods to organize data. The OOP language deviates from the traditional programming method, it follows the pattern of inserting data into a set of instructions and then generating output. The focus of the OOP language is to define data as objects, and then determine how different objects relate to and interact with each other.


To create an OOP database, you must first define all the objects you plan to store in the database. Then, you will define how each object is related to every other object in the database. After identifying the object, put it into a class or group of objects. To define a class, you must determine what data each object in the class must have, and which logical sequence (called methods) will affect those objects. Objects in the system can communicate with you or other objects using interfaces called messages.


An example is easier to understand. Suppose you are building a database containing information about American sports. You decide to start by defining the baseball team. Once you have created a baseball team definition, you can generalize it to a class in the database. Atlanta Braves will be a specific instance of this class, also known as an object. The category of baseball teams belongs to the super category of American sports teams, which also includes other categories such as football and soccer teams.


To access the information in the database (regardless of how it organizes the data), you can use queries. The query is just a request for information. People and applications can submit queries to the database. The database responds to queries by sending data that meets the original request parameters. Queries depend on special computer languages, such as Structured Query Language (SQL). If you have ever used an internet search engine, then you have submitted a query-your search term.


Data integration method


Based on the above, you may think that the database is quite complicated. This is a fair assumption, and it helps explain why data integration is still a developing discipline, even if it is more than 30 years old. The goal of data integration is to collect data from different sources, combine them and present them in a way that appears to be a unified whole.


Suppose you want to leave the trip and you want to see what kind of traffic there is before deciding to go out of town. Here are how different methods of data integration handle your queries.


The manual integration program will make all the work for you. First, you must know where to find the data. You need to know the actual location of traffic reports and town maps. You need to retrieve traffic reports and map data directly from their respective databases, and then compare these two sets of data with each other to find the best route out of the city.


If you use a common user interface method, you must do less. You can use an interface such as the World Wide Web for inquiries. The query result will be displayed as a view on the interface. You still need to compare the traffic report with the map to determine the best route, but at least the interface will be responsible for finding and retrieving data.


Some integration methods rely on the application to do all the work for you. These applications are specialized computer programs that can locate, retrieve, and integrate information for you. During the integration process, the application must manipulate the data so that information from one source is compatible with information from other sources. In our example, this means that you want to submit a query to the application, which displays a view that combines a map of your town with data from traffic reports. The problem with this approach is that as the number of data sources and formats increases, applications become complex and difficult to program.


Then there is the common data storage method, also known as data warehouse. Using this method, all data from various databases to be integrated can be extracted, converted and loaded. This means that the data warehouse first extracts all data from various data sources. The data warehouse then converts all data to a common format so that one set of data is compatible with another. Then it loads these new data into its own database. When submitting a query, the data warehouse will look up the data, retrieve the data and present it to you in an integrated view. Using our example, the data warehouse will find its latest information on town traffic reports and maps. Then it will integrate the two and send the view to you. This system has several advantages and disadvantages, which we will introduce in the next section.


Most data integration system designers believe that the ultimate goal is to create jobs for end users as much as possible, so they tend to focus on application and data warehouse technologies.


database


As mentioned earlier, a data warehouse is a database that uses a common format to store information from other databases. This is about the same as when describing the data warehouse. There is no uniform definition to define what data warehouses are or how designers should build them. Therefore, there are several different ways to create a data warehouse, and the appearance and behavior of one data warehouse may be different from the other.


Generally, queries to the data warehouse require very little time to resolve. That is because the data warehouse has completed the main work of extracting, transforming and combining data. The user end of the data warehouse is called the front end, so from the front end perspective, the data warehouse is an effective way to obtain integrated data.


From a back-end perspective, this is a different story. Database administrators must think a lot about data warehouse systems to make them effective and efficient. Converting data collected from different sources into a common format can be particularly difficult. The system needs a consistent method to describe and encode data.


The warehouse must have a large enough database to store data collected from multiple sources. Some data warehouses include an additional step called a data mart. The data warehouse takes over the responsibility of aggregating data, and the data mart responds to user queries by retrieving and combining appropriate data from the warehouse.


One problem with data warehouses is that the information in them is not always up to date. This is because of the way data warehouses work-they regularly extract information from other databases. If the data in these databases changes between extractions, queries to the data warehouse will not produce up-to-date and accurate views. If the data in the system rarely changes, this is not a big problem. However, for other applications, this is problematic.


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.