Python code for Fuzzy query

Source: Internet
Author: User
Tags sublime text

Python code Implementation Fuzzy Query 1, lead:

Fuzzy matching can be considered as a necessary feature of modern editors (such as various Ides such as Eclipse), what it does is to guess the user's desired file name according to some content entered by the user, and provide a list of recommendations for the user to choose.

Examples are as follows:

    • Vim (CTRL-P)

    • Sublime Text (cmd-p)

' Fuzzy matching ' is an extremely useful feature and is also very easy to implement.

2, problem analysis:

We have a collection of strings (filenames), which we filter according to the user's input, which may be part of the string. Let's take the following set as an example:

>>> collection = [‘django_migrations.py‘,                ‘django_admin_log.py‘,                ‘main_generator.py‘,                ‘migrations.py‘,                ‘api_user.doc‘,                ‘user_group.doc‘,                ‘accounts.txt‘,                ]

When the user enters the 'djm' string, we assume that it is matched to 'DJango_migrations.py ' and 'DJango_admin_ Log.py ', and the simplest way to implement this is to use regular expressions.

3. Solutions: 3.1 Regular regular match

Convert "DJM" to "d.*j.*m" and use this regular attempt to match each string in the set, and if it matches, it will be listed as a candidate.

>>> import re>>> def fuzzyfinder (user_input, collection):         suggestions = []         pattern =  '. * '. Join (user_input)  # Converts  ' djm '  to  ' d.*j.*m '          regex = re.compile (pattern)      #  Compiles a regex.        for item in  collection:            match =  Regex.search (item)   # checks if the current item matches the  regex.            if match:                 suggestions.append (item)      &nbsP;  return suggestions>>> print fuzzyfinder (' djm ',  collection) [' Django_ migrations.py ',  ' django_admin_log.py ']>>> print fuzzyfinder (' mig ',  collection) [' django_migrations.py ',  ' django_admin_log.py ',  ' main_generator.py ',  ' migrations.py ']

Here we get a list of recommendations based on the user's input, but the strings in the recommendation list are not distinguished. It is possible that the most appropriate match is put in the final case.

In fact, this is the case, when the user enters 'mig', the best option 'migrations.py ' is put to the end.

3.2 Matching list with rank sort

Here we sort the results by the first occurrence of the match.

‘main_generator.py‘     - 0‘migrations.py‘         - 0‘django_migrations.py‘  - 7‘django_admin_log.py‘   - 9

Here is the relevant code:

>>> import re>>> def fuzzyfinder (user_input, collection):         suggestions = []         pattern =  '. * '. Join (user_input)  # Converts  ' djm '  to  ' d.*j.*m '          regex = re.compile (pattern)      #  Compiles a regex.        for item in  collection:            match =  Regex.search (item)   # checks if the current item matches the  regex.            if match:                 suggestions.append (( Match.start (),  item)         return [x for _, x in sorted (suggestions)] >>> print fuzzyfinder (' mig ',  collection) [' main_generator.py ',  ' migrations.py ',   ' django_migrations.py ',  ' django_admin_log.py ']

This time we generated a list of two-tuple tuples, that is, each element in the list is a two-tuple, and the first value of the two tuple is the match to the starting position , the second value is the corresponding file name , Then use the list derivation to sort by the match to the location and return the file name list.

Now we are close to the end result, but not perfect-the user wants ' migration.py ', but we put ' main_generator.py ' as the first recommendation.

3.3 Sorting according to the compactness of the match

When the user begins to enter a string, they tend to enter continuous character utilises for exact matching. For example, when the user inputs 'mig' They are more inclined to look for 'migrations.py ' or ' Django_migrations.py ' instead of 'maI n_generator.py ', so the change we've made here is to find the most compact item to match.

The problem just mentioned is nothing to Python, because when we use regular expressions for string matching, the matching string is already stored in Match.group (). The following assumes that the input is ' MiG ', and the matching results for the originally defined ' collection ' are as follows:

regex = ‘(m.*i.*g)‘‘main_generator.py‘    ->  ‘main_g‘‘migrations.py‘        ->  ‘mig‘‘django_migrations.py‘ ->  ‘mig‘‘django_admin_log.py‘  ->  ‘min_log‘

Here we make a list of three tuple lists, that is, each element in the recommendation list is a ternary tuple, and the first value of the ternary tuple is the length of the matched content , the second value is the starting position to match to, The third value is the corresponding file name , and then it is sorted and returned by the matching length and start position.

>>> import re>>> def fuzzyfinder (user_input, collection):         suggestions = []         pattern =  '. * '. Join (user_input)  # Converts  ' djm '  to  ' d.*j.*m '          regex = re.compile (pattern)      #  Compiles a regex.        for item in  collection:            match =  Regex.search (item)   # checks if the current item matches the  regex.            if match:                 suggestions.append (Len ( Match.group ()),  match.start (),  item))         return [x for _, _, x  in sorted (suggestions)]>>> print fuzzyfinder (' MiG ',  collection) [' migrations.py ',  ' django_migrations.py ',  ' main_generator.py ',  ' django_admin_log.py ']

For our input, this time the matching results have tended to be perfect, but not yet.

3.4 Non-greedy matching

This delicate problem was discovered by Daniel Rocco: When there are two elements of [' Api_user ', ' User_group '] in the set, the expected match result (relative order) should be [' User _group ', ' api_user'], but the actual result is:

>>> print fuzzyfinder(‘user‘, collection)[‘api_user.doc‘, ‘user_group.doc‘]

In the test results above: ' Api_user ' to be in front of ' user_group '. In depth, we find that this is because the regular is expanded to ' u.*s.*e.*r ' when searching for ' user ', considering that ' user_gROUP ' has 2 ' r ', so the pattern matches to 'user_gr ' Instead of our expected 'user'. A longer match results in a lower rank when the last match rank is violated, but the problem is easy to solve, and the regular is modified to ' non-greedy match '.

>>> import re>>> def fuzzyfinder (user_input, collection):         suggestions = []         pattern =  '. *? '. Join (user_input)     # Converts  ' djm '  to  ' d.*?j.*?m '          regex = re.compile (pattern)           # compiles a regex.        for item  in collection:            match =  regex.search (item)       # Checks if the current  item matches the regex.             if match:              &Nbsp; suggestions.append (Len (Match.group ()),  match.start (),  item)          return [x for _, _, x in sorted (suggestions)]> >> fuzzyfinder (' user ',  collection) [' User_group.doc ',  ' Api_user.doc ']>>>  Print fuzzyfinder (' mig ',  collection) [' migrations.py ',  ' django_migrations.py ',  ' Main_ generator.py ',  ' django_admin_log.py ']

Now, Fuzzyfinder is ready (in the case above) to work, and we've just written 10 lines of code to implement a fuzzy finder.

3.5 Conclusion:

These are the process records that I designed to implement ' fuzzy matching ' in my Pgcli project (a PostgreSQL command line implementation with auto-complete functionality).

I have extracted fuzzyfinder into a standalone Python package that you can install and use in your project using the command ' pip install Fuzzyfinder '.

Thanks to Micah Zoltu and Daniel Rocco for checking the algorithm and fixing the problem.

If you are interested in this, you can come to me on Twitter.

4. Conclusion:

When I first considered using Python to achieve "fuzzy matching", I knew a good library called Fuzzywuzzy, but Fuzzywuzzy's approach is not the same as here, it uses "Levenshtein distance" (edit distance) to find the most matched string from the collection. "Levenshtein distance" is a great technique for correcting spelling errors automatically, but does not perform well when matching long filenames from partial substrings (so it is not used here).

Python code for Fuzzy query

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.