Natural language 20_the Corpora with NLTK

Source: Internet
Author: User
Tags nltk

https://www.pythonprogramming.net/nltk-corpus-corpora-tutorial/?completed=/lemmatizing-nltk-tutorial/

The corpora with NLTK




In this part of the tutorial, I want us to take a moment to peak into the corpora we all downloaded! The NLTK corpus is a massive dump of all kinds of natural language data sets, is definitely worth taking a look at.

Almost all of the files in the NLTK corpus follow the same rules for accessing them by using the NLTK module, and nothing is magical about them. These files is plain text files for the most part, some is XML and some is other formats, but they is all accessible B Y you manually, or via the module and Python. Let's talk about viewing them manually.

Depending on your installation, your Nltk_data directory might is hiding in a multitude of locations. To figure out where it was, head to your Python directory, where the NLTK module is. If you don't know where is, use the following code:

Import nltkprint(nltk.  __file__)      

Run that, and the output would be is the location of the NLTK module ' s __init__.py. Head into the NLTK directory, and then look for the data.py file.

The important blurb of code is:

IfSys.Platform.StartsWith(' Win '): # Common Locations on Windows:Path+= [Str(R' C:\nltk_data '),Str(R' D:\nltk_data '),Str(R' E:\nltk_data '),Os.Path.Join(Sys.Prefix,Str(' Nltk_data ')),Os.Path.Join(Sys.Prefix,Str(' Lib '),Str(' Nltk_data ')),Os.Path.Join(Os.Environ.Get(Str(' APPDATA '),Str(' C:\\ ')),Str ( ' nltk_data '  ]else: # Common locations on UNIX & OS x: path += [ Str '/usr/share/nltk_data '  Str ( '/usr/local/share/nltk_data '  str  ( '/usr/lib/nltk_data '  str  ( '/usr/local/lib/nltk_data '               /span>                

There, you can see the various possible directories for the Nltk_data. If you're on Windows, chances is it's in your AppData, in the local directory. To get there, you'll want to open your file browser, go to the top, and type in%appdata%

Next Click on Roaming, and then find the Nltk_data directory. In there, you'll have your corpora file. The full path was something like:
C:\Users\yourname\AppData\Roaming\nltk_data\corpora

Within here, you had all of the available corpora, including things like books, chat logs, movie reviews, and a whole lot More.

Now, we ' re going-talk about accessing these documents via NLTK. As can see, these is mostly text documents, so you could just use normal Python code to open and read documents. That said, the NLTK module have a few nice methods for handling the corpus, so could find it useful to use their metholog Y. Here's an example of us opening the Gutenberg Bible, and reading the first few lines:

FromNltk.TokenizeImportSent_tokenize, PunktsentencetokenizerFromNltkcorpus import Gutenberg# sample Textsample = Gutenberg. ( "bible-kjv.txt" tok = Sent_tokenize (sample for x in  Range (5  (tok[x   

One of the more advanced data sets in this is "WordNet." Wordnet is a collection of words, definitions, examples of their use, synonyms, antonyms, and more. We ' ll dive into using WordNet next.

Natural language 20_the Corpora with NLTK

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.