1. Introductory Remarks:
Fuzzy matching can be considered as a necessary feature of modern editors (such as various Ides such as Eclipse), what it does is to guess the user's desired file name according to some content entered by the user, and provide a list of recommendations for the user to choose.
Examples are as follows:
' Fuzzy matching ' is an extremely useful feature and is also very easy to implement.
2, problem analysis:
We have a bunch of strings (file names) set, we filter according to the user's input, the user's input may be part of the string. Let's take the following collection as an example:
>>> collection = [' django_migrations.py ', ' django_admin_log.py ', ' main_generator.py ', ' migrations.py ', ' api_user.doc ', ' user_group.doc ', ' account S.txt ',]
When the user enters the 'djm' string, we assume that it is matched to 'DJango_migrations.py ' and 'DJango_admin_ Log.py ', and the simplest way to implement this is to use regular expressions.
3. Solution: 3.1 regular regular match
Convert "DJM" to "d.*j.*m" and use this regular attempt to match each string in the set, and if it matches, it will be listed as a candidate.
>>> import re>>> def fuzzyfinder (user_input, collection): suggestions = [] pattern = '. * '. Join (user_input) # Converts ' djm ' to ' d.*j.*m ' regex = re.compile (pattern) # Compiles a regex. for item in collection: match = Regex.search (item) # checks if the current item matches the regex. if match: suggestions.append (item) &nbsP; return suggestions>>> print fuzzyfinder (' djm ', collection) [' Django_ migrations.py ', ' django_admin_log.py ']>>> print fuzzyfinder (' mig ', collection) [' django_migrations.py ', ' django_admin_log.py ', ' main_generator.py ', ' migrations.py ']
Here we get a list of recommendations based on the user's input, but the strings in the recommendation list are not distinguished. It is possible that the most appropriate match is put in the final case.
In fact, this is the case, when the user enters 'mig', the best option 'migrations.py ' is put to the end.
3.2 Matching list with rank sort
Here we sort the results by the first occurrence of the match.
' main_generator.py '-0 ' migrations.py '-0 ' django_migrations.py '-7 ' django_admin_log.py '-9
Here is the relevant code:
>>> import re>>> def fuzzyfinder (user_input, collection): suggestions = [] pattern = '. * '. Join (user_input) # Converts ' djm ' to ' d.*j.*m ' regex = re.compile (pattern) # Compiles a regex. for item in collection: match = Regex.search (item) # checks if the current item matches the regex. if match: suggestions.append (( Match.start (), item) return [x for _, x in sorted (suggestions)] >>> print fuzzyfinder (' mig ', collection) [' main_generator.py ', ' migrations.py ', ' django_migrations.py ', ' django_admin_log.py ']
This time we generated a list of two-tuple tuples, that is, each element in the list is a two-tuple, and the first value of the two tuple is the match to the starting position , the second value is the corresponding file name , Then use the list derivation to sort by the match to the location and return the file name list.
Now we are close to the end result, but not perfect-the user wants ' migration.py ', but we put ' main_generator.py ' as the first recommendation.
3.3 Sorting According to the compactness of the match
When the user begins to enter a string, they tend to enter continuous character utilises for exact matching. For example, when the user inputs 'mig' They are more inclined to look for 'migrations.py ' or ' Django_migrations.py ' instead of 'maI n_generator.py ', so the change we've made here is to find the most compact item to match.
The problem just mentioned is nothing to Python, because when we use regular expressions for string matching, the matching string is already stored in Match.group (). The following assumes that the input is ' MiG ', and the matching results for the originally defined ' collection ' are as follows:
Regex = ' (m.*i.*g) ' main_generator.py ', ' main_g ', ' migrations.py ', ' mig ' django_migrations.py ' ' MiG ' django_admin_log.py ' Min_log '
Here we make a list of three tuple lists, that is, each element in the recommendation list is a ternary tuple, and the first value of the ternary tuple is the length of the matched content , the second value is the starting position to match to, The third value is the corresponding file name , and then it is sorted and returned by the matching length and start position.
>>> import re>>> def fuzzyfinder (user_input, collection): suggestions = [] pattern = '. * '. Join (user_input) # Converts ' djm ' to ' d.*j.*m ' regex = re.compile (pattern) # Compiles a regex. for item in collection: match = Regex.search (item) # checks if the current item matches the regex. if match: suggestions.append (Len ( Match.group ()), match.start (), item)) return [x for _, _, x in sorted (suggestions)]>>> print fuzzyfinder (' MiG ', collection) [' migrations.py ', ' django_migrations.py ', ' main_generator.py ', ' django_admin_log.py ']
For our input, this time the matching results have tended to be perfect, but not yet.
3.4 Non-greedy matching
This delicate problem was discovered by Daniel Rocco: When there are two elements of [' Api_user ', ' User_group '] in the set, the expected match result (relative order) should be [' User _group ', ' api_user'], but the actual result is:
>>> Print fuzzyfinder (' user ', collection) [' Api_user.doc ', ' User_group.doc ']
r _g R OUP ' has 2 ' r ', so the pattern matches ' user_gr ' Instead of our expected ' user '. A longer match results in a lower rank when the last match rank is violated, but the problem is easy to solve, and the regular is modified to ' non-greedy match '.
>>> import re>>> def fuzzyfinder (user_input, collection): suggestions = [] pattern = '. *? '. Join (user_input) # Converts ' djm ' to ' d.*?j.*?m ' regex = re.compile (pattern) # compiles a regex. for item in collection: match = regex.search (item) # Checks if the current item matches the regex. if match: &Nbsp; suggestions.append (Len (Match.group ()), match.start (), item) return [x for _, _, x in sorted (suggestions)]> >> fuzzyfinder (' user ', collection) [' User_group.doc ', ' Api_user.doc ']>>> Print fuzzyfinder (' mig ', collection) [' migrations.py ', ' django_migrations.py ', ' Main_ generator.py ', ' django_admin_log.py ']
Now, Fuzzyfinder is ready (in the case above) to work, and we've just written 10 lines of code to implement a fuzzy finder.
3.5 Conclusion:
The above is what I am in my pgcli Project (a PostgreSQL command line implementation with auto-complete functionality) designed to implement the ' fuzzy matching ' process record.
I have fuzzyfinder extracted into a separate Python package, you can use the command ' pip install Fuzzyfinder ' to install and use it in your project.
thank micah zoltu and daniel Rocco Check the algorithm and fix the problem.
If you are interested in this, you can come Twitter look for me.
4. Conclusion:
when I first considered using Python to achieve "fuzzy matching", I knew a name called Fuzzywuzzy Good library, but Fuzzywuzzy's approach is not quite the same as here, it uses " Levenshtein Distance (edit distance) to find the most matched string from the collection. "Levenshtein distance" is a great technique for correcting spelling errors automatically, but does not perform well when matching long filenames from partial substrings (so it is not used here).
Refer:
[1] fuzzyfinder-in lines of Python
Http://blog.amjith.com/fuzzyfinder-in-10-lines-of-python
[2] MYCLI: MySQL client that supports auto-completion and syntax highlighting
http://hao.jobbole.com/mycli-mysql/
Https://github.com/dbcli/mycli
[3] Postgres CLI with autocompletion and syntax highlighting
Https://github.com/dbcli/pgcli
10 line Python code for fuzzy query/Smart hints