Joins of data

Source: Internet
Author: User

SSIS Engine implements the inner join and outer join functions in SQL statements using the Merge Join transformation and Lookup transformation. Any data flow that flows through it is loaded into the memory data buffer, and the amount of data the data buffer can hold determines the performance of the transformation component. The Lookup transformation has an input, a reference table (or cache table, lookup table), and multiple outputs. For each data row in the input, if the corresponding key value can be found in the reference table, the data row matches successfully from the lookup match output path to the downstream component, and if the corresponding key value cannot be found in the reference table, the data row match fails from the lookup No Match Output "Path outputs to downstream components. The Lookup transformation blocks the data stream when it loads the cached data. When the cache load is complete, the data flow begins to execute.

The merge Join transformation has two ordered inputs (sorted using the sort component or using the ORDER BY clause in db) and one output, and the merge Join transformation uses the join condition to output the corresponding data, which can be converted downstream using the conditional split. Gets the data that the match succeeded or failed to match. The Merge Join does not cache data.

One, Lookup conversion

1, the characteristics of the stream

The Lookup transform is a non-blocking transformation that has the characteristics of a stream, can load data at the edge, and transforms the data while processing. When a new data row enters the transformation component, the data rows that have already been processed are passed to the downstream components without being intercepted.

When the lookup transformation is in full cache cache mode, the lookup transformation blocks the data flow when the cache table is loaded into memory, until all the lookup data is loaded into the buffer, so that data flow Task is started, so make sure that the small table is used as the cache table. Instead, the data from the large table is streamed to the Lookup transformation component.

When the lookup transformation is in the partial cache or no cache mode, the Looup conversion is recognized as a row-based conversion, which flows through the transformed rows and needs to interact with an external input to be processed individually. Because of the row-based processing, in most cases it is not possible to keep up with the speed of data flow processing, and the buffer will be intercepted until the transformation component finishes processing all the data in the buffer.

2,cache Mode

Lookup conversion has three cache mode:full cache,partial cache and No cache.

Full cache means that the Lookup transformation loads all the data in the cache table into the data buffer in memory, and each row of data in the other input flows through the buffer to perform the join operation.

No Cache refers to the lookup transformation of each data row in the upstream input, which executes the query once and whether the check data exists. In no Cache mode, when each input line flows through the Lookup transformation, the component sends a request to the reference table in the database to see if the key values match, which can be very inefficient and slow.

The Partial cache means that the Lookup transformation caches the most recently used data within the maxmemoryusage limit, and once the cache grows too large, the least-used cache data is discarded. If the reference table data volume is too large to load all of its data into the cache, you can select No cache.

When the package starts, as with no cache mode, the data is not preloaded into the lookup cache, and when each input row enters the component, the component uses the specified join key and the specified query to try to find the matching record. If a match is found, the lookup key value is added to the cache in a timely manner, and if the same key value is searched again, the matching key value can be obtained from the cache, saving the query time to access the external input source. If no matching key value is found in the cache, the component accesses the external input source, queries it, and if not in the external input, the key value does not match.

The MaxMemoryUsage property specifies the maximum memory that the lookup transformation uses in partial cache mode.

Two, Merge Join transformation

1,merge join is a semi-blocking conversion

Before passing data to downstream components, the data needs to be intercepted in the buffer for a period of time until the key-value match from two inputs succeeds, and the Merge join transformation passes the data rows to the downstream components.

2,merge join uses a small amount of memory

The Merge join uses less memory than the lookup transformation, and basically does not cache the data because it only needs to maintain the small amount of data needed to join two inputs in memory. When memory is limited or the data is too large, the Merge join is a very useful component, provided the input stream is ordered.

Third, Cache Connection Manager

The Lookup transformation is the only component that uses the CCM cache data, and the CCM is able to populate the lookup cache from any data source. If the lookup transformation is in full cache mode, using CCM to load the cache will improve conversion performance.

In the same package, if multiple lookup transformations use the same reference dataset, these lookup transformations can share the same cache, so that the same cached data is not loaded multiple times.

Joins of data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.