Data Integration Examples

Source: Internet
Author: User

As shown in 1-5, the following example illustrates a complete data integration scenario.

Data sources and mediation patterns

In this example, we have 5 data sources. The first one is the leftmost S1, which stores the movie data, including the name of the movie, the actor, the director, and the genre. The next 3 data sources S2~S4 store data about the session. The data source S2 includes the entire country's cinemas, while S3 and S4 store only typical cinema data in New York and San Francisco. It is important to note that although these 3 data sources store the same type of data, they use different property names. The right-most data source, S5, stores the movie critic data.

The mediation pattern consists of 4 relationships: Movie (film), Actors (actor), Plays (session), and reviews (comments). Note Review (review) does not contain the date attribute in mediation mode, but the data source S5 holds the relevant information.

The semantic mapping in the data source description describes the relationship between the data source and the mediation pattern. For example, a mapping of a data source S1 to mediation mode would describe it as containing table movies, and the property name in table movies corresponds to the attribute title of table movie in mediation mode. It also specifies that the table actors in the mediation mode is a projection of the name and actors two columns in the data source S1 table movies.

Similarly, the semantic mapping will specify that tuples in the plays relationship of the mediation pattern can be obtained from the data source S2, S3, S4, and that the attribute location of the tuple in S3 is New York (similarly, the attribute location of the S4 tuple is San Francisco).

In addition to semantic mapping, the data source description also indicates some information about other aspects of the data source. First, they indicate whether the data source is complete. For example, the data source S2 may not contain all movie screenings for the entire country, and the data source S3 contains all the movie screenings in New York. Second, the data source description can specify restrictions on how the data source is accessed. For example, the description of S1 specifies that, to get the results of a query, at least one property value is given as a constraint in the input query statement. Similarly, to query for other data sources that provide movie playback, you must enter a movie name.

Query processing

We use the data in mediation mode to initiate queries to the data integration System. The following query statement wants to find Woody Allen (Woody Allen) director of the film in New York for the screening time.

As shown in 1-6, the process of querying is done in the following steps.

Query rewriting as mentioned earlier, user queries are made up of terms in mediation mode. Therefore, the first step of the system is to rewrite the query statements into the form corresponding to the data source schema. To do this, the data-integration system will use the data source description. The result of the rewrite is a set of query statements that correspond to the data source schema, combining the results of their execution to get the results of the original query. We call this rewritten result a logical query plan.

The rewrite process is as follows:

Movie tuples can be obtained directly from the data source S1, but you need to convert the property title to name in S1.

Plays (session) tuples can be obtained from the data source S2 or S3. Since the known S3 contains the complete data for New York, we chose S3 instead of S2.

Because the data source S3 needs to enter the name of the movie to query, the query plan must first access the data source S1 and then input the movie name from S1 as the S3 query.

Therefore, the first logical query plan produced by the query rewrite engine is to access S1 and S3 to get the results of the query. However, the second logical query plan is also correct, that is, to access S1 first, and then access S2, although the results may not be complete.

Query optimization As with traditional database systems, the next step in query processing is query optimization. Query optimization transforms a logical query plan into a physical query plan that specifies the exact order in which the data sources are accessed, what algorithms are used to manipulate the data when the query results are combined (for example, the connection between data sources), and how many resources are allocated for each operation. As mentioned earlier, the system must also address the challenges posed by the distribution of data.

In our case, the optimizer will decide which algorithm to use to connect S1 and S3. For example, a connection algorithm might import the movie name from S1 to S3 in a pipelined format, or it might cache the results first and then send the whole batch to S3.

Query execution Finally, the execution engine is responsible for the actual execution of the physical query plan. The execution engine dispatches individual data sources through the wrapper and then combines the returned results in the form specified by the query plan.

Another notable difference between a data integration system and a traditional database system is that the execution engine of a traditional database system simply executes the query plan sent to it by the query optimizer, and the execution engine of the data integration System may require the optimizer to reconsider the query plan based on the progress of the query plan it monitors. In our example, the execution engine may find that the data source S3 is unusually slow, so you might be asked if the optimizer can use a different data source instead of S3.

Of course, another option is to set some unexpected events for the original plan in the optimizer. However, if there are many events to be executed unexpectedly, the original plan could become very large. Therefore, when designing the query processing engine, an interesting technical challenge is how to balance the complexity of the plan with the ability to respond to unexpected execution events.

Data Integration Examples

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.