Python code for Fuzzy query

Last Update:2016-11-09 Source: Internet

Author: User

Tags sublime text

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Python code Implementation Fuzzy Query 1, lead:

Fuzzy matching can be considered as a necessary feature of modern editors (such as various Ides such as Eclipse), what it does is to guess the user's desired file name according to some content entered by the user, and provide a list of recommendations for the user to choose.

Examples are as follows:

Vim (CTRL-P)

Sublime Text (cmd-p)

' Fuzzy matching ' is an extremely useful feature and is also very easy to implement.

2, problem analysis:

We have a collection of strings (filenames), which we filter according to the user's input, which may be part of the string. Let's take the following set as an example:

>>> collection = [‘django_migrations.py‘,                ‘django_admin_log.py‘,                ‘main_generator.py‘,                ‘migrations.py‘,                ‘api_user.doc‘,                ‘user_group.doc‘,                ‘accounts.txt‘,                ]

When the user enters the 'djm' string, we assume that it is matched to 'DJango_migrations.py ' and 'DJango_admin_ Log.py ', and the simplest way to implement this is to use regular expressions.

3. Solutions: 3.1 Regular regular match

Convert "DJM" to "d.*j.*m" and use this regular attempt to match each string in the set, and if it matches, it will be listed as a candidate.

>>> import re>>> def fuzzyfinder (user_input, collection):         suggestions = []         pattern =  '. * '. Join (user_input)  # Converts  ' djm '  to  ' d.*j.*m '          regex = re.compile (pattern)      #  Compiles a regex.        for item in  collection:            match =  Regex.search (item)   # checks if the current item matches the  regex.            if match:                 suggestions.append (item)      &nbsP;  return suggestions>>> print fuzzyfinder (' djm ',  collection) [' Django_ migrations.py ',  ' django_admin_log.py ']>>> print fuzzyfinder (' mig ',  collection) [' django_migrations.py ',  ' django_admin_log.py ',  ' main_generator.py ',  ' migrations.py ']

Here we get a list of recommendations based on the user's input, but the strings in the recommendation list are not distinguished. It is possible that the most appropriate match is put in the final case.

In fact, this is the case, when the user enters 'mig', the best option 'migrations.py ' is put to the end.

3.2 Matching list with rank sort

Here we sort the results by the first occurrence of the match.

‘main_generator.py‘     - 0‘migrations.py‘         - 0‘django_migrations.py‘  - 7‘django_admin_log.py‘   - 9

Here is the relevant code:

>>> import re>>> def fuzzyfinder (user_input, collection):         suggestions = []         pattern =  '. * '. Join (user_input)  # Converts  ' djm '  to  ' d.*j.*m '          regex = re.compile (pattern)      #  Compiles a regex.        for item in  collection:            match =  Regex.search (item)   # checks if the current item matches the  regex.            if match:                 suggestions.append (( Match.start (),  item)         return [x for _, x in sorted (suggestions)] >>> print fuzzyfinder (' mig ',  collection) [' main_generator.py ',  ' migrations.py ',   ' django_migrations.py ',  ' django_admin_log.py ']

This time we generated a list of two-tuple tuples, that is, each element in the list is a two-tuple, and the first value of the two tuple is the match to the starting position , the second value is the corresponding file name , Then use the list derivation to sort by the match to the location and return the file name list.

Now we are close to the end result, but not perfect-the user wants ' migration.py ', but we put ' main_generator.py ' as the first recommendation.

3.3 Sorting according to the compactness of the match

When the user begins to enter a string, they tend to enter continuous character utilises for exact matching. For example, when the user inputs 'mig' They are more inclined to look for 'migrations.py ' or ' Django_migrations.py ' instead of 'maI n_generator.py ', so the change we've made here is to find the most compact item to match.

The problem just mentioned is nothing to Python, because when we use regular expressions for string matching, the matching string is already stored in Match.group (). The following assumes that the input is ' MiG ', and the matching results for the originally defined ' collection ' are as follows:

regex = ‘(m.*i.*g)‘‘main_generator.py‘    ->  ‘main_g‘‘migrations.py‘        ->  ‘mig‘‘django_migrations.py‘ ->  ‘mig‘‘django_admin_log.py‘  ->  ‘min_log‘

Here we make a list of three tuple lists, that is, each element in the recommendation list is a ternary tuple, and the first value of the ternary tuple is the length of the matched content , the second value is the starting position to match to, The third value is the corresponding file name , and then it is sorted and returned by the matching length and start position.

>>> import re>>> def fuzzyfinder (user_input, collection):         suggestions = []         pattern =  '. * '. Join (user_input)  # Converts  ' djm '  to  ' d.*j.*m '          regex = re.compile (pattern)      #  Compiles a regex.        for item in  collection:            match =  Regex.search (item)   # checks if the current item matches the  regex.            if match:                 suggestions.append (Len ( Match.group ()),  match.start (),  item))         return [x for _, _, x  in sorted (suggestions)]>>> print fuzzyfinder (' MiG ',  collection) [' migrations.py ',  ' django_migrations.py ',  ' main_generator.py ',  ' django_admin_log.py ']

For our input, this time the matching results have tended to be perfect, but not yet.

3.4 Non-greedy matching

This delicate problem was discovered by Daniel Rocco: When there are two elements of [' Api_user ', ' User_group '] in the set, the expected match result (relative order) should be [' User _group ', ' api_user'], but the actual result is:

>>> print fuzzyfinder(‘user‘, collection)[‘api_user.doc‘, ‘user_group.doc‘]

In the test results above: ' Api_user ' to be in front of ' user_group '. In depth, we find that this is because the regular is expanded to ' u.*s.*e.*r ' when searching for ' user ', considering that ' user_gROUP ' has 2 ' r ', so the pattern matches to 'user_gr ' Instead of our expected 'user'. A longer match results in a lower rank when the last match rank is violated, but the problem is easy to solve, and the regular is modified to ' non-greedy match '.

>>> import re>>> def fuzzyfinder (user_input, collection):         suggestions = []         pattern =  '. *? '. Join (user_input)     # Converts  ' djm '  to  ' d.*?j.*?m '          regex = re.compile (pattern)           # compiles a regex.        for item  in collection:            match =  regex.search (item)       # Checks if the current  item matches the regex.             if match:              &Nbsp; suggestions.append (Len (Match.group ()),  match.start (),  item)          return [x for _, _, x in sorted (suggestions)]> >> fuzzyfinder (' user ',  collection) [' User_group.doc ',  ' Api_user.doc ']>>>  Print fuzzyfinder (' mig ',  collection) [' migrations.py ',  ' django_migrations.py ',  ' Main_ generator.py ',  ' django_admin_log.py ']

Now, Fuzzyfinder is ready (in the case above) to work, and we've just written 10 lines of code to implement a fuzzy finder.

3.5 Conclusion:

These are the process records that I designed to implement ' fuzzy matching ' in my Pgcli project (a PostgreSQL command line implementation with auto-complete functionality).

I have extracted fuzzyfinder into a standalone Python package that you can install and use in your project using the command ' pip install Fuzzyfinder '.

Thanks to Micah Zoltu and Daniel Rocco for checking the algorithm and fixing the problem.

If you are interested in this, you can come to me on Twitter.

4. Conclusion:

When I first considered using Python to achieve "fuzzy matching", I knew a good library called Fuzzywuzzy, but Fuzzywuzzy's approach is not the same as here, it uses "Levenshtein distance" (edit distance) to find the most matched string from the collection. "Levenshtein distance" is a great technique for correcting spelling errors automatically, but does not perform well when matching long filenames from partial substrings (so it is not used here).

Python code for Fuzzy query

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More