In Python, The NLTK library is used to extract the stem.

Last Update:2015-04-09 Source: Internet

Author: User

Tags install pandas nltk

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In Python, The NLTK library is used to extract the stem.

What is stem extraction?

In terms of linguistic morphology and information retrieval, stem extraction is the process of removing suffixes to obtain the root word-the most common way to get words. For the morphological root of a word, the stem does not need to be exactly the same; the corresponding ing of the word to the same stem generally produces satisfactory results, even if the stem is not the effective root of the word. Since 1968, algorithms for stem extraction have emerged in the computer science field. When processing words, many search engines use the same stem as synonyms for query expansion. This process is called merging.

An English-oriented stem Extraction Tool. For example, to identify strings "cats", "catlike", and "catty", it is based on the root "cat "; "stemmer", "stemming", and "stemmed" are based on the Root stem ". A stem extraction algorithm can simplify the word "fishing", "fished", "fish" and "fisher" as the same root "fish ".
Technical solution Selection

Python and R are two major data analysis languages. Compared with R, Python is more suitable for beginners who have a large programming background, especially programmers who have mastered the Python language. So we chose Python and NLTK Library (Natual Language Tookit) as the basic framework for text processing. In addition, we also need a data display tool. For a data analyst, database installation, connection, table creation, and other operations are not suitable for fast data analysis, so we use Pandas as a structured data and analysis tool.
Environment Construction

We use Mac OS X, which is pre-installed with Python 2.7.

Install NLTK

sudo pip install nltk

Install Pandas

sudo pip install pandas

For data analysis, the most important thing is the analysis results. iPython notebook is an essential tool. It can save the code execution results, such as data tables, you do not need to run it again when you open it next time.

Install iPython notebook

sudo pip install ipython

Create a working directory and start iPython notebook under the working directory. The server will enable the http: // 127.0.0.1: 8080 page and save the created code document under the working directory.

mkdir Codescd Codesipython notebook

Text Processing

Create a data table

Create a data table using Pandas and create a 2D data structure that supports rows and columns in DataFrame-Pandas.

from pandas import DataFrameimport pandas as pdd = ['pets insurance','pets insure','pet insurance','pet insur','pet insurance"','pet insu']df = DataFrame(d)df.columns = ['Words']df

Display result

Introduction to NLTK word Divider

RegexpTokenizer: Regular Expression analyzer that uses regular expressions to process text.
PorterStemmer: Potter stem algorithm word divider, the principle can be seen here: http://snowball.tartarus.org/algorithms/english/stemmer.html
Step 1: Create a regular expression splitter that removes special characters such as punctuation marks:

import nltktokenizer = nltk.RegexpTokenizer(r'w+')

Next, process the prepared data table, add the columns to be written by the stem, and the statistical column. The default value is 1:

df["Stemming Words"] = ""df["Count"] = 1

Read the Words column in the data table and use the porter stem extract to obtain the stem:

j = 0while (j <= 5):  for word in tokenizer.tokenize(df["Words"][j]):    df["Stemming Words"][j] = df["Stemming Words"][j] + " " + nltk.PorterStemmer().stem_word(word)  j += 1df

Good! In this step, we have basically implemented text processing and the results are shown as follows:

Group statistics

Perform grouping statistics in Pandas and save the statistical table to a new DataFrame structure uniqueWords:

uniqueWords = df.groupby(['Stemming Words'], as_index = False).sum().sort(['Count'])uniqueWords

Have you noticed? Another pet insu still fails to be processed.

Spelling check

For words with misspelled characters, we first think of spelling check. For Python, we can use enchant:

sudo pip install enchant

Use enchant to check spelling errors and get the recommended words:

import enchantfrom nltk.metrics import edit_distanceclass SpellingReplacer(object):  def __init__(self, dict_name='en', max_dist=2):    self.spell_dict = enchant.Dict(dict_name)    self.max_dist = 2  def replace(self, word):    if self.spell_dict.check(word):      return word    suggestions = self.spell_dict.suggest(word)    if suggestions and edit_distance(word, suggestions[0]) <=      self.max_dist:      return suggestions[0]    else:      return wordfrom replacers import SpellingReplacerreplacer = SpellingReplacer()replacer.replace('insu')'insu'

However, the results are still not our expected "insur ". Can you change your mind?
Algorithm particularity

The special characteristics of user input come from the industry and application scenarios. Using a general English dictionary for spelling check is undoubtedly not feasible, and some words are precisely spelled, but should have been another word. However, how can we associate these background information with data analysis?

After some thought, I think the most important reference database is exactly in the existing data analysis results. Let's take a look:

The five existing "pet insur" actually provide us with a data reference. We can cluster the data to further remove noise.

Similarity Calculation

Calculate the similarity of existing results and classify the data that meets the minimum deviation into the similarity set:

import LevenshteinminDistance = 0.8distance = -1lastWord = ""j = 0while (j < 1):   lastWord = uniqueWords["Stemming Words"][j]   distance = Levenshtein.ratio(uniqueWords["Stemming Words"][j], uniqueWords["Stemming Words"][j + 1])   if (distance > minDistance):    uniqueWords["Stemming Words"][j] = uniqueWords["Stemming Words"][j + 1]  j += 1uniqueWords

Check the result. The matching is successful!

In the last step, perform grouping statistics on the data results again:

uniqueWords = uniqueWords.groupby(['Stemming Words'], as_index = False).sum()uniqueWords

At this point, we have completed the preliminary text processing.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More