Tutorial on using the NLTK library to extract the dry words in Python

Source: Internet
Author: User
Tags install pandas nltk
What is stem extraction?

In language morphology and information retrieval, stemming is the process of removing the root of the affix and getting the most general wording. For the morphological roots of a word, the stems do not need to be exactly the same, and the related words are generally satisfied with the same stem, even if the stem is not a valid root of the word. The corresponding algorithm of stemming has appeared in the field of computer science since 1968. Many search engines use the same stem as the query extension when dealing with words, and the process is called merging.

An English-oriented stemming, for example, to identify the string "cats", "catlike", and "Catty" are based on the root "cat"; "Stemmer", "stemming", and "stemmed" are based on stem "stem". A stemming algorithm can simplify the word "fishing", "fished", "fish" and "Fisher" for the same root "fish".
Selection of technical solutions

Python and R are the two main languages of data analysis, and are more suitable for beginners of data analysis with a large number of programming backgrounds than R,python, especially programmers who have mastered the Python language. So we chose Python and NLTK library (natual Language tookit) as the basic framework for text processing. In addition, we need a data display tool, for a data analyst, database cumbersome installation, connection, build table and other operations is not suitable for fast data analysis, so we use pandas as a structured data and analysis tools.
Environment construction

We are using Mac OS X, which is pre-loaded with Python 2.7.

Installing NLTK

sudo pip install NLTK

Installing Pandas

sudo pip install pandas

For data analysis, the most important is the analysis of the results, IPython Notebook is a necessary tool, its role is to save the code execution results, such as data tables, the next time you open without re-run to view.

Installing Ipython Notebook

sudo pip install Ipython

Create a working directory, start Ipython notebook in the working directory, and the server will open the http://127.0.0.1:8080 page and save the created code document under the working directory.

mkdir CODESCD Codesipython Notebook

Text Processing

Data table creation

Using pandas to create a data table we use the resulting sample data to build a 2D data structure that supports rows and columns in the Dataframe--pandas.

From pandas import Dataframeimport pandas as PDD = [' Pets insurance ', ' Pets insure ', ' pet insurance ', ' pet insur ', ' pet Insura ' nCE "', ' pet insu ']DF = DataFrame (d) df.columns = [' Words ']DF

Show results

NLTK Word Breaker Introduction

Regexptokenizer: The regular expression of the word breaker, using regular expressions to process the text, not much to introduce.
Porterstemmer: The baud stemming algorithm word breaker, the principle can be seen here: http://snowball.tartarus.org/algorithms/english/stemmer.html
In the first step, we create a regular expression word breaker that removes special characters such as punctuation:

Import Nltktokenizer = nltk. Regexptokenizer (R ' w+ ')

Next, the prepared data table is processed, add the column to be written by the word Gan, and the statistic column, the default value is 1:

df["Stemming Words"] = "" df["Count"] = 1

Read the words column in the data table and use the baud stem extractor to get the stem:

j = 0while (J <= 5): For  word in tokenizer.tokenize (df["Words"][j]):    df["stemming Words"][j] = df["Stemming Wor DS "][j" + "+ nltk". Porterstemmer (). Stem_word (Word)  j + = 1DF

good! In this step, we have basically implemented the text processing, the results are shown as follows:

Group statistics

Group statistics in Pandas, save the statistics table in a new dataframe structure uniquewords:

Uniquewords = Df.groupby ([' Stemming Words '], As_index = False). sum (). Sort ([' Count ']) uniquewords

Have you noticed? There is still a pet insu not successfully processed.

Spell check

The first thing we think about in terms of user spelling errors is the spelling checker, which we can use enchant for Python:

sudo pip install enchant

Use enchant for spelling error checking to get the recommended words:

Import enchantfrom nltk.metrics Import Edit_distanceclass spellingreplacer (object):  def __init__ (self, dict_name= ' En ', max_dist=2):    self.spell_dict = enchant. Dict (dict_name)    self.max_dist = 2  def replace (self, word):    if Self.spell_dict.check (word):      return Word    suggestions = self.spell_dict.suggest (word)    if suggestions and edit_distance (Word, suggestions[0]) <=      self.max_dist:      return suggestions[0]    else:      return wordfrom replacers Import Spellingreplacerreplacer = Spellingreplacer () replacer.replace (' Insu ') ' Insu '

However, the results are still not our expected "Insur". Can you think of another idea?
Algorithm specificity

The very important particularity of user input comes from the industry and usage scenarios. It is no doubt that the use of a general English dictionary to check spelling is not feasible, and some words are spelled correctly, but they should be another word. But how do we correlate these background information with data analysis?

After some thinking, I think the most important reference library is just in the data analysis results, we come back to see:

The existing 5 "pet Insur", in fact, has provided us with a data reference, we have been able to cluster this data, further noise removal.

Calculation of similarity

A similarity calculation is performed on the existing results, and the data satisfying the minimum deviation is classified into a similar concentration:

Import levenshteinmindistance = 0.8distance = -1lastword = "" J = 0while (J < 1):   Lastword = uniquewords["Stemming W Ords "][j]   distance = levenshtein.ratio (uniquewords[" stemming Words "][j], uniquewords[" stemming Words "][j + 1])   if (Distance > mindistance):    uniquewords["stemming Words"][j] = uniquewords["Stemming Words"][j + 1]  j + = 1uniqueWords

View the results, have been matched successfully!

The final step is to re-group the results of the data:

Uniquewords = Uniquewords.groupby ([' Stemming Words '], As_index = False). SUM () uniquewords

To this, we have completed the preliminary text processing.

  • Related Article

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.