What is stem extraction?
In language morphology and information retrieval, stemming is the process of removing the root of the affix and getting the most general wording. For the morphological roots of a word, the stems do not need to be exactly the same, and the related words are generally satisfied with the same stem, even if the stem is not a valid root of the word. The corresponding algorithm of stemming has appeared in the field of computer science since 1968. Many search engines use the same stem as the query extension when dealing with words, and the process is called merging.
An English-oriented stemming, for example, to identify the string "cats", "catlike", and "Catty" are based on the root "cat"; "Stemmer", "stemming", and "stemmed" are based on stem "stem". A stemming algorithm can simplify the word "fishing", "fished", "fish" and "Fisher" for the same root "fish".
Selection of technical solutions
Python and R are the two main languages of data analysis, and are more suitable for beginners of data analysis with a large number of programming backgrounds than R,python, especially programmers who have mastered the Python language. So we chose Python and NLTK library (natual Language tookit) as the basic framework for text processing. In addition, we need a data display tool, for a data analyst, database cumbersome installation, connection, build table and other operations is not suitable for fast data analysis, so we use pandas as a structured data and analysis tools.
Environment construction
We are using Mac OS X, which is pre-loaded with Python 2.7.
Installing NLTK
sudo pip install NLTK
Installing Pandas
sudo pip install pandas
For data analysis, the most important is the analysis of the results, IPython Notebook is a necessary tool, its role is to save the code execution results, such as data tables, the next time you open without re-run to view.
Installing Ipython Notebook
sudo pip install Ipython
Create a working directory, start Ipython notebook in the working directory, and the server will open the http://127.0.0.1:8080 page and save the created code document under the working directory.
mkdir CODESCD Codesipython Notebook
Text Processing
Data table creation
Using pandas to create a data table we use the resulting sample data to build a 2D data structure that supports rows and columns in the Dataframe--pandas.
From pandas import Dataframeimport pandas as PDD = [' Pets insurance ', ' Pets insure ', ' pet insurance ', ' pet insur ', ' pet Insura ' nCE "', ' pet insu ']DF = DataFrame (d) df.columns = [' Words ']DF
Show results
NLTK Word Breaker Introduction
Regexptokenizer: The regular expression of the word breaker, using regular expressions to process the text, not much to introduce.
Porterstemmer: The baud stemming algorithm word breaker, the principle can be seen here: http://snowball.tartarus.org/algorithms/english/stemmer.html
In the first step, we create a regular expression word breaker that removes special characters such as punctuation:
Import Nltktokenizer = nltk. Regexptokenizer (R ' w+ ')
Next, the prepared data table is processed, add the column to be written by the word Gan, and the statistic column, the default value is 1:
df["Stemming Words"] = "" df["Count"] = 1
Read the words column in the data table and use the baud stem extractor to get the stem:
j = 0while (J <= 5): For word in tokenizer.tokenize (df["Words"][j]): df["stemming Words"][j] = df["Stemming Wor DS "][j" + "+ nltk". Porterstemmer (). Stem_word (Word) j + = 1DF
good! In this step, we have basically implemented the text processing, the results are shown as follows:
Group statistics
Group statistics in Pandas, save the statistics table in a new dataframe structure uniquewords:
Uniquewords = Df.groupby ([' Stemming Words '], As_index = False). sum (). Sort ([' Count ']) uniquewords
Have you noticed? There is still a pet insu not successfully processed.
Spell check
The first thing we think about in terms of user spelling errors is the spelling checker, which we can use enchant for Python:
sudo pip install enchant
Use enchant for spelling error checking to get the recommended words:
Import enchantfrom nltk.metrics Import Edit_distanceclass spellingreplacer (object): def __init__ (self, dict_name= ' En ', max_dist=2): self.spell_dict = enchant. Dict (dict_name) self.max_dist = 2 def replace (self, word): if Self.spell_dict.check (word): return Word suggestions = self.spell_dict.suggest (word) if suggestions and edit_distance (Word, suggestions[0]) <= self.max_dist: return suggestions[0] else: return wordfrom replacers Import Spellingreplacerreplacer = Spellingreplacer () replacer.replace (' Insu ') ' Insu '
However, the results are still not our expected "Insur". Can you think of another idea?
Algorithm specificity
The very important particularity of user input comes from the industry and usage scenarios. It is no doubt that the use of a general English dictionary to check spelling is not feasible, and some words are spelled correctly, but they should be another word. But how do we correlate these background information with data analysis?
After some thinking, I think the most important reference library is just in the data analysis results, we come back to see:
The existing 5 "pet Insur", in fact, has provided us with a data reference, we have been able to cluster this data, further noise removal.
Calculation of similarity
A similarity calculation is performed on the existing results, and the data satisfying the minimum deviation is classified into a similar concentration:
Import levenshteinmindistance = 0.8distance = -1lastword = "" J = 0while (J < 1): Lastword = uniquewords["Stemming W Ords "][j] distance = levenshtein.ratio (uniquewords[" stemming Words "][j], uniquewords[" stemming Words "][j + 1]) if (Distance > mindistance): uniquewords["stemming Words"][j] = uniquewords["Stemming Words"][j + 1] j + = 1uniqueWords
View the results, have been matched successfully!
The final step is to re-group the results of the data:
Uniquewords = Uniquewords.groupby ([' Stemming Words '], As_index = False). SUM () uniquewords
To this, we have completed the preliminary text processing.