https://www.pythonprogramming.net/nltk-corpus-corpora-tutorial/?completed=/lemmatizing-nltk-tutorial/
The corpora with NLTK
In this part of the tutorial, I want us to take a moment to peak into the corpora we all downloaded! The NLTK corpus is a massive dump of all kinds of natural language data sets, is definitely worth taking a look at.
Almost all of the files in the NLTK corpus follow the same rules for accessing them by using the NLTK module, and nothing is magical about them. These files is plain text files for the most part, some is XML and some is other formats, but they is all accessible B Y you manually, or via the module and Python. Let's talk about viewing them manually.
Depending on your installation, your Nltk_data directory might is hiding in a multitude of locations. To figure out where it was, head to your Python directory, where the NLTK module is. If you don't know where is, use the following code:
Import nltkprint(nltk. __file__)
Run that, and the output would be is the location of the NLTK module ' s __init__.py. Head into the NLTK directory, and then look for the data.py file.
The important blurb of code is:
IfSys.Platform.StartsWith(' Win '): # Common Locations on Windows:Path+= [Str(R' C:\nltk_data '),Str(R' D:\nltk_data '),Str(R' E:\nltk_data '),Os.Path.Join(Sys.Prefix,Str(' Nltk_data ')),Os.Path.Join(Sys.Prefix,Str(' Lib '),Str(' Nltk_data ')),Os.Path.Join(Os.Environ.Get(Str(' APPDATA '),Str(' C:\\ ')),Str ( ' nltk_data ' ]else: # Common locations on UNIX & OS x: path += [ Str '/usr/share/nltk_data ' Str ( '/usr/local/share/nltk_data ' str ( '/usr/lib/nltk_data ' str ( '/usr/local/lib/nltk_data ' /span>
There, you can see the various possible directories for the Nltk_data. If you're on Windows, chances is it's in your AppData, in the local directory. To get there, you'll want to open your file browser, go to the top, and type in%appdata%
Next Click on Roaming, and then find the Nltk_data directory. In there, you'll have your corpora file. The full path was something like:
C:\Users\yourname\AppData\Roaming\nltk_data\corpora
Within here, you had all of the available corpora, including things like books, chat logs, movie reviews, and a whole lot More.
Now, we ' re going-talk about accessing these documents via NLTK. As can see, these is mostly text documents, so you could just use normal Python code to open and read documents. That said, the NLTK module have a few nice methods for handling the corpus, so could find it useful to use their metholog Y. Here's an example of us opening the Gutenberg Bible, and reading the first few lines:
FromNltk.TokenizeImportSent_tokenize, PunktsentencetokenizerFromNltkcorpus import Gutenberg# sample Textsample = Gutenberg. ( "bible-kjv.txt" tok = Sent_tokenize (sample for x in Range (5 (tok[x
One of the more advanced data sets in this is "WordNet." Wordnet is a collection of words, definitions, examples of their use, synonyms, antonyms, and more. We ' ll dive into using WordNet next.
Natural language 20_the Corpora with NLTK