A tutorial on using NLTK Library in Python to extract the dry word

A tutorial on using NLTK Library in Python to extract the dry word _python

Last Update:2017-01-19 Source: Internet

Author: User

Tags install pandas nltk

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is stem extraction?

In language morphology and information retrieval, stem extraction is the process of removing affixes from the root and getting the most general wording of the word. For the morphological root of a word, the stem does not need to be exactly the same; the associated word mapping to the same stem can generally obtain satisfactory results, even if the stem is not a valid root of the word. In the field of computer science, the corresponding algorithm of stemming extraction has appeared since 1968. Many search engines in the processing of words, the synonym used the same stem as query expansion, the process is called merge.

A stem extractor for English, for example, to recognize that the string "Cats", "catlike" and "Catty" are based on the root "cat", "stemmer", "stemming" and "stemmed" are based on the root "stem". A stemming algorithm can simplify the word "fishing", "fished", "fish" and "Fisher" as the same root "fish".
selection of technical solutions

Python and R are the two main languages of data analysis, and are more suitable for data analysis beginners with a large number of programming backgrounds than R,python, especially programmers who have mastered the Python language. So we chose the Python and NLTK libraries (natual Language Tookit) as the basic framework for text processing. In addition, we need a data display tool; For a data analyst, the database omissions installation, connection, table, etc. are not suitable for fast data analysis, so we use pandas as a structured data and analysis tool.
Environment Construction

We are using Mac OS X, which is pre-loaded with Python 2.7.

Install NLTK

sudo pip install NLTK

Install Pandas

sudo pip install pandas

For data analysis, the most important is the analysis of the results, IPython Notebook is a necessary tool, its role is to save the execution of the code results, such as data tables, the next time you open it without having to rerun to view.

Install Ipython Notebook

sudo pip install Ipython

Create a working directory, start Ipython notebook in the working directory, and the server will open the http://127.0.0.1:8080 page and save the created code document under the working directory.

mkdir codes
CD codes
Ipython Notebook

Text Processing

Data table creation

Using pandas to create a datasheet we use the sampled data to build a 2D data structure in Dataframe--pandas that supports rows and columns.

From pandas import dataframe
import pandas as PD
d = [' Pets insurance ', ' Pets insure ', ' pet insurance ', ' pet Insur ' , ' pet insurance ', ' pet insu ']
df = dataframe (d)
df.columns = [' Words ']
DF

Show results

NLTK Word Breaker Introduction

Regexptokenizer: Regular expression word breaker, the use of regular expressions to deal with the text, do not introduce more.
Porterstemmer: Porter stem algorithm word breaker, the principle can be seen here: http://snowball.tartarus.org/algorithms/english/stemmer.html
In the first step, we create a regular expression word breaker that removes special characters such as punctuation marks:

Import NLTK
tokenizer = nltk. Regexptokenizer (R ' w+ ')

Next, the prepared datasheet is processed, the columns to be written, and the statistic columns, the default defaults are 1:

df["Stemming Words"] = ""
df["Count" = 1

Reads the words column in the datasheet and uses the porter stem extractor to get the stem:

j = 0 While
(J <= 5): For
  word in tokenizer.tokenize (df["Words"][j]):
    df["stemming Words"][j] = df["Stemmi ng Words "][j] +" "+ NLTK. Porterstemmer (). Stem_word (Word)
  j = 1
DF

good! By this step, we have basically implemented text processing, and the results are as follows:

Group statistics

In pandas, the statistics table is saved to a new dataframe structure uniquewords:

Uniquewords = Df.groupby ([' Stemming Words '], As_index = False). sum (). Sort ([' Count '])
uniquewords

Have you noticed? There is still a pet Insu failed to process successfully.

Spell check

The first thing we think about is the spelling checker, and for Python we can use enchant for the misspelled word:

sudo pip install enchant

Use enchant for spelling error checking to get the recommended word:

Import enchant
from nltk.metrics import Edit_distance
class Spellingreplacer (object):
  def __init__ (self, Dict_name= ' en ', max_dist=2):
    self.spell_dict = enchant. Dict (dict_name)
    self.max_dist = 2
  def replace (self, word):
    if Self.spell_dict.check (word):
      return Word
    suggestions = self.spell_dict.suggest (word)
    if suggestions and edit_distance (Word, suggestions[0)) < =
      self.max_dist: Return
      suggestions[0]
    else: return
      word from

replacers import Spellingreplacer
replacer = Spellingreplacer ()
replacer.replace (' Insu ')

' Insu '

However, the result is still not the "Insur" we expected. Can you think of another way of thinking?
algorithm specificity

User input is very important in the industry and usage scenarios. The use of a common English dictionary to check spelling is no doubt impossible, and some words are precisely spelled correctly, but should be another word. But how do we correlate these background information with data analysis?

After some thinking, I think the most important reference library is precisely in the existing data analysis results, we come back to see:

The existing 5 "pet Insur", in fact, has provided us with a data reference, we have been able to cluster the data to further eliminate noise.

Calculation of similarity

The similarity of the existing results is calculated, and the data satisfying the minimum deviation is classified into similar concentration:

Import Levenshtein
mindistance = 0.8
Distance =-1
lastword = ""
j = 0 While
(J < 1):
   Lastword = uniquewords["Stemming Words"][j]
   distance = levenshtein.ratio (uniquewords["stemming Words"][j], uniquewords[" Stemming Words "][j + 1]"
   if (Distance > mindistance):
    uniquewords["stemming Words"][j] = uniquewords[" Stemming Words "][j + 1]
  j = 1
uniquewords

See the results, the match has been successful!

In the final step, the data results are grouped into statistics:

Uniquewords = Uniquewords.groupby ([' Stemming Words '], As_index = False). SUM ()
uniquewords

By this we have completed the preliminary text processing.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More