Natural language 12_tokenizing Words and sentences with NLTK

Source: Internet
Author: User
Tags nltk

Tokenizing Words and sentences with NLTK

Welcome to a Natural Language processing tutorial series, using the Natural Language Toolkit, or NLTK, module with Python.

The NLTK module is a massive tool kit, aimed at helping you with the entire Natural Language processing (NLP) methodology. NLTK would aid you and everything from splitting sentences from paragraphs, splitting up words, recognizing the part of S Peech of those words, highlighting the main subjects, and then even with helping your machine to understand what the text is all about. In this series, we ' re going to tackle the field of opinion mining, or sentiment analysis.

In our path to learning how to do sentiment analysis with NLTK, we ' re going to learn the following:

    • tokenizing-splitting sentences and words from the body of text.
    • Part of Speech tagging
    • Machine learning with the Naive Bayes classifier
    • How to the tie in Scikit-learn (Sklearn) with NLTK
    • Training classifiers with datasets
    • Performing live, streaming, sentiment analysis with Twitter.
    • ... and much more.

In order to get started, you is going to need the NLTK module, as well as Python.

If you don't have a python yet, go to and download the latest version of Python if is on Windows. If you is on Mac or Linux, you should is able to run anapt-get install python3

Next, you ' re going to need NLTK 3. The easiest method to installing the NLTK module was going to being with PIP.

For all users, which is do by opening-up Cmd.exe, bash, or whatever shell you use and typing:
pip install nltk

Next, we need to install some of the components for NLTK. Open python via whatever means you normally do, and type:

  Import nltk  nltk.  Download()   

Unless you operating headless, a GUI would pop up like this, only probably with red instead of green:

Choose to download "All" for all packages, and then click ' Download. ' This would give you all of the tokenizers, Chunkers, other algorithms, and all of the corpora. If space is a issue, you can elect to selectively download everything manually. The NLTK module would take up about 7MB, and the entire Nltk_data directory would take up about 1.8GB, which includes your C Hunkers, parsers, and the corpora.

If You is operating headless, like on a VPS, you can install the everything by running Python and doing:

import nltk

d(for download)

all(for download everything)

That'll download everything for you headlessly.

Now so you had all the things so need, let's knock out some quick vocabulary:

    • Corpus -Body of text, singular. Corpora is the plural of this. Example:a collection of medical journals.
    • Lexicon -Words and their meanings. Example:english dictionary. Consider, however, that various fields would have different lexicons. For example:to a financial investor, the first meaning for the word "Bull" was someone who was confident about the market, As compared to the common 中文版 lexicon, where the first meaning for the word "Bull" was an animal. As such, there is a special lexicon for financial investors, doctors, children, mechanics, and so on.
    • Token -Each "entity", that is, a part of whatever were split up based on rules. For examples, each word was a token when a sentence was "tokenized" into words. Each sentence can also is a token, if you tokenized the sentences out of a paragraph.

These is the words you'll most commonly hear upon entering the Natural Language processing (NLP) space, but there ar E Many more that we'll be covering in time. With this, let's show an example to how one might actually tokenize something to tokens with the NLTK module.

from Nltk.import Sent_tokenize, Word_ Tokenizeexample_text = print (sent_tokenize ( example_text          

At first, if you are think tokenizing by things like words or sentences are a rather trivial enterprise. For many sentences it can be. The first step would is likely doing a simple split ('. '), or splitting by period followed by a space. Then maybe your would bring in some regular expressions to split by period, space, and then a capital Lett Er. The problem is this things like Mr. Smith would cause you trouble, and many other things. Splitting by word was also a challenge, especially when considering things like concatenations like we and was to we ' re. NLTK is going to go ahead and just save you a ton of time with this seemingly simple, yet very complex, operation.

The above code would output the sentences, split up to a list of sentences, which you can do things like iterate through With a for loop.
[‘Hello Mr. Smith, how are you doing today?‘, ‘The weather is great, and Python is awesome.‘, ‘The sky is pinkish-blue.‘, "You shouldn‘t eat cardboard."]

So there, we have created tokens, which is sentences. Let's tokenize by word instead this time:


Now we output is:[‘Hello‘, ‘Mr.‘, ‘Smith‘, ‘,‘, ‘how‘, ‘are‘, ‘you‘, ‘doing‘, ‘today‘, ‘?‘, ‘The‘, ‘weather‘, ‘is‘, ‘great‘, ‘,‘, ‘and‘, ‘Python‘, ‘is‘, ‘awesome‘, ‘.‘, ‘The‘, ‘sky‘, ‘is‘, ‘pinkish-blue‘, ‘.‘, ‘You‘, ‘should‘, "n‘t", ‘eat‘, ‘cardboard‘, ‘.‘]

There is a few things to note here. First, notice that punctuation is treated as a separate token. Also, notice the separation of the word "shouldn ' t" into "should" and "N ' t." Finally, notice that ' pinkish-blue ' is indeed treated like the ' one word ' it was meant to being turned into. Pretty cool!

Now, the looking at these tokenized words, we had to begin thinking on what we next step might be. We start to ponder on how might we derive meaning by looking at these words. We can clearly think of ways to put value to many words, but we also see a few words that is basically worthless. These is a form of "stop words," which we can also handle for. That's what we ' re going to being talking about in the next tutorial.

Natural language 12_tokenizing Words and sentences with NLTK

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.