Fuzzy Lookup Transformation Usage

Source: Internet
Author: User
Tags ssis

Fuzzy Lookup pre-loads a reference table, and when executed, Fuzzy Lookup extracts the source data and blurs each row of data in rows and reference tables , The index of the output matching degree: Similarity and trust (similarity and Confidence). The matching algorithm of Fuzzy Lookup is simply described as: splitting a standard string in a reference table into multiple substring (the relative position of a single character in a string is the same), as long as the input string contains any one of the Substring,fuzzy Lookup It is considered that the matching is successful, according to Fuzzy logic, output similarity and Confidence, the implementation of fuzzy matching is similar to like '%substring% '. When the Fuzzy Lookup indexes a token like committee, it also indexes Sub-token elements comm, Ommi , mmit, mitt, itte, ttee. This scheme helps the speed up retrieval and to recover from input errors.

Similarity (similiarity) refers to the similarity between the input string and any substring, while the Trust degree (Confidence) is the percentage of the input string and the substring successful inclusion.

Fuzzy logic is a approach to computing based on "degrees of truth" rather than the usual "true or false" (1 or 0) Boolean Logic. When the reference table has a close matches for a input tuple, the similarity is high. If there is a single record among all reference tuples this closely matches the input tuple, the confidence score is also High.

Columnsimilarity is the similarity between the input string and the standard string in the reference table that is directly compared, not the similarity between the input string and the substring.

Fuzzy Lookup returns Column-level similarity scores which measure how similar the values of a particular column for the in Put and match result. The Column-level scores can be used to fine tune the match quality and for further downstream processing of match results.

One, Tab

Fuzzy Lookup Transformation performs a lookup operation between an input dataset and a reference dataset using a BEST-MATC H algorithm.

1,reference Table Tab

The Reference Table Store is used to find matching standard data to match the input data.

In the Generate new index or Using existing index option, this "index" is error-tolerant index (ETI). If you tick store New index, the SSIS Engine implements the ETI as a table, and the default name is dbo. Fuzzylookupmatchindex. Fuzzy Lookup uses the error-tolerant Index (ETI) to find matching rows in the reference table.

Understanding the Error-tolerant Index

Fuzzy Lookup uses the error-tolerant Index (ETI) to find matching rows in the reference table. Each record in the reference table was broken up to words (also known as tokens), and the ETI keeps track of all the Plac ES in the reference table where a particular token occurs. In the address example, if your reference data contains13831 n.e. 8th St, the ETI would contain entries for13831,N,E,8th, andSt.. In addition, the Fuzzy Lookup indexes substrings, known as q-grams, so the It can better match records that contain errors. The more unique tokens and the most rows in the reference table, the most entries and longer the occurrence lists in the E TI. The ETI would be roughly as big as your reference table. The tokenization process is controlled by the Fuzzy Lookup custom propertydelimiters. For example, if-want to indexN.E.Instead ofNandE, remove the period from the list of delimiters. The consequence is thatN.E.appears as a single token in the ETI and would be looked up as a unit at run time. Because delimiters is applied globally,first.avenueWould also appears as a single token. When the Fuzzy Lookup indexes a token likeCommittee, it also indexes sub-token elementsComm,Ommi,Mmit,Mitt,Itte,Ttee. This scheme helps the speed up retrieval and to recover from input errors.

because ETI construction for a very large reference table could take a non-trivial amount Of time, Fuzzy Lookup offers the option of storing an ETI in the server and re-using it at a later date. This option takes a snapshot of the reference table and allows your to avoid re-building an ETI every time you run the Fuzzy Lo Okup. If your ETI takes too long to re-build for every run, consider creating it once and re-using it in subsequent runs. To does this, select Store new Index on the Reference Table tab, and then specify a Table Name.

If you would like to store your ETI, but your reference data changes from time to time, you can also enable maintain s Tored index. This feature installs a trigger on the reference table then detects modifications and propagates them to the ETI, keeping It up-to-date. If You do not install table maintenance, Fuzzy Lookup would match against a snapshot of the reference table as it existed w Hen the ETI was created.

Note: Q-grams is a mainstream string similarity query method that does not change the relative position of a single character in a string, but splits a long string into multiple substring, for example, splitting the committee into multiple substring: Comm, ommi, mmit, mitt, itte, ttee , if a string contain any one substring, Then the lookup fuzzy thinks that the string and committee are matched and outputs similarity and Confidence.

2,columns Tab

set Available Input Columns (from the input data) and the mapping relationship between Available lookup Columns (from the reference table), the Fuzzy lookup will have Mappin The two column value of the G relationship is matched. The output of the Fuzzy lookup is divided into two parts: check the input Columns of pass through and check the checkbox's lookup Columns.

3,advanced Tab

Configure maximum number of matches to output per lookup,similarity Threshold and Token delimiters options.

Two, what happens at Run time

At run time, takes a input row and tries to find the best match or matches in the reference table as efficiently as PO Ssible. By default, this was done by using the ETI to find candidate reference records that share tokens or q-grams in common with The input. The best candidates was retrieved from the reference table and A to more careful comparison is made between the both records. Once There is no more candidates that could being better than any match found so far, Fuzzy Lookup stops and moves on to the Next input row.

For Fuzzy Lookup-to-find a match in the reference table using the ETI, the input and target record must share at least one Token or Q-gram in common which are stored in the ETI. For reference table records consisting of only a single short word, it's possible that Fuzzy Lookup could be unable to MATC H a dirty input record that contains a spelling mistake because there are no common token or q-gram stored in the ETI. Be aware, also, that Fuzzy Lookup indexes only a subset of the all the possible q-grams in a given record for efficiency R Easons. Fuzzy Lookup may fail-to-find a match due to this sampling process, although matches would be found with a high degree of P Robability if the records contain many q-grams. for datasets which has attributes whose values is predominantly a single short token, one alternative, if Fuzzy Lookup i s have trouble finding matches which you think it should find, was to set theexhaustive Component property to True. This would cause Fuzzy Lookup to ignore the ETI and instead directly compare the input record to each and every record in T He reference table. This approach are prohibitively time-consuming for large reference tables, so only attempt exhaustive search on small refer ence tables.

Each match returned by Fuzzy Lookup have multiple scores associated with it which quantitatively describe how good a match The returned reference table record is to the input record. Each score is a number between 0.0 and 1.0. Perhaps the most important score, is the Record-level similarity score. This score measures the overall similarity between the records across all fuzzy match columns that were defined. A Score of 1.0 means the records matched exactly on each of the match columns, while scores less than 1.0 indicate PR ogressively more dissimilar matches. A record-level similarity score of 0.0 indicates that Fuzzy Lookup is unable to find a match in the reference table. As mentioned above, this could be because there were no common tokens or q-grams between the input and target reference TA BLE record that were indexed in the eti.  in addition to the Record-level similarity, Fuzzy Lookup returns Column-lev El similarity scores which measure how similar the values in a partiCular column for the input and match result. As described below, these column-level scores can be used to fine tune the match quality and for further downstream proces Sing of Match results.

Fuzzy Lookup Additionally returns an estimate of the confidence for each match returned. This can is used to decide whether or not to automatically accept or reject a match. For instance, the reference table might contain values "E. Virginia" and "W. Virginia". If the dirty input was simply "Virginia", both reference records would has very high similarity, but the difference betwe En their similarity values would be very small. This indicates, that's Fuzzy Lookup could not find a clear winner and that's need to manually review the match results For this input record. Such decisions can be automated by adding a Conditional Split transformation after the Fuzzy Lookup which makes an accept/ Reject decision based upon the values of similarity and confidence.

In determining best matches, the most important parameter is the minsimilarity threshold. You can set this custom property by using the Fuzzy Lookup UI. A reference tuple would be returned only if it had a similarity that's greater than or equal to the minsimilarity Threshol D. By setting a high similarity requirement, Fuzzy Lookup would consider fewer candidates and, as a result, may not return Any matches. If you set minsimilarity Low, the Fuzzy Lookup would consider more candidates and May is more likely to find a Match, but the search could take longer.

Note that you can set minsimilarity and both the Record-level and also on the column-level for each individual column. Any match result return must meet the thresholds set at all levels and for all columns. For instance, you might set a record-level minsimilarity of 0.5, but require that ZipCode have minsimilarity 0.9 and Name H As MinSimilarity 0.4. The Fuzzy Lookup would only return results this meet all of the those criteria.

The factors that determine similarity scores include:

    • The number of the token or character insertions, deletions, substitutions, and re-orderings that need to being made to the input Tuple to match a given reference tuple. For example, the input 122 first Lane would likely be considered closer to the reference 122 first Ln tha n the input n.e. 1st Ln & Leary.
    • The token frequencies from the reference table. Highly frequent tokens is generally considered to give little information about the goodness of the match. Relatively rare tokens is considered characteristic of the row in which they occur.

Setting the right threshold depends on the nature of your application and of your data. If you require a close match between your inputs and your reference, you should consider setting a high value forminsimilarity, such as0.90. If you are doing a exploratory project, you may be interested in examining weak matches as well as close matches Should setminsimilarityTo a lower value, such as0.1. There is no firm rule that if you can use the determine this range, so it's recommended that you experiment with your data SE T. Looking at the output from several runs can suggest optimal values to set. For example, you perform a first run by using a threshold of0.1. You observe, a certain input is matched with a certain output with similarity0.2. If The tuples is too dissimilar for your application, you can set theminsimilarityTo0.3For your next run and exclude the match as too dissimilar. Repeating this process for a few iterations in a small test set can help you determine what's appropriate for your applic ation.

If you want to view more than the single best match for each input, set the maxoutputmatchesperinput property to A value larger than one.  The Fuzzy Lookup would then return the many matches for each input row. Note that increasing the value of the increase, the time it takes to process, each input row.

Third, Custom property

1,warmcaches

By default, the Reference table is loaded into the cache by the Fuzzy Lookup transformation, and if the Reference table data is large and the input data is very small, the fuzzy lookup The conversion uses a lot of memory to load the reference Table, consuming time and memory. You can set the Custom propertity:warmcaches in the Advanced editor to set it to false, and the Fuzzy lookup does not load the reference table into memory until data Flow executes.

By default, the Fuzzy Lookup would load the ETI and reference table into available memory before starting to process rows. If you are only having a few rows to process with a particular run, you can reduce this time by setting the Warmcaches False.

2,copyreferencetable

3,delimiters

4,dropexistingmatchindex

5,exhaustive

6,matchindexoptions

Recommended Blog:

Using SQL Server to implement Fuzzy Lookup and Fuzzy Grouping (Grouping)

Fuzzy Lookup and Fuzzy Grouping in SQL Server integration Services 2005

Ssis-fuzzy Lookup for cleaning dirty data

Fuzzy Lookup Transformation

Fuzzy Lookup Transformation Usage

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.