Natural language 27_converting words to Features with NLTK

Source: Internet
Author: User
Tags shuffle nltk

https://www.pythonprogramming.net/words-as-features-nltk-tutorial/

Converting words to Features with NLTK




In this tutorial, we ' re going to be building off the previous video and compiling feature lists of words from positive rev Iews and words from the negative reviews to hopefully see trends in specific types of words in positive or negative review S.

To start, our code:

ImportNltkImportRandomFromNltk.CorpusImportMovie_reviewsdocuments= [(List(Movie_reviews.Words(Fileid)),Category) ForCategoryInchMovie_reviews.Categories() ForFileidInchMovie_reviews.Fileids(Category)]Random.Shuffle(Documents)All_words= []for W in movie_reviews< Span class= "pun". words ():  All_words. (w. Lower () all_words = NLTK freqdist (all_words= List (all_words keys ()) [: 3000]       

Mostly the same as before, only with now a new variable, word_features, which contains the top 3,000 most common words. Next, we ' re going to build a quick function that'll find these top 3,000 words in our positive and negative documents, M Arking their presence as either positive or negative:

DefFind_features (document words =< Span class= "PLN" > set (document features = {} for W in Word_features: Features[w]=  (w in  Words)  return features   

Next, we can print one feature set like:

Print(find_features (movie_reviews.  Words(' neg/cv000_29416.txt '         ))))

Then we can does this for all of our documents, saving the feature existence Booleans and their respective positive or Negat Ive categories by doing:

=[(find_features(rev), category)for(Rev, category) In  documents]               

Awesome, now. We have our features and labels, what is next? Typically the next step is to go ahead and train a algorithm, then test it. So, let's go ahead and do this, starting with the Naive Bayes classifier in the next tutorial!

#-*-Coding:utf-8-*-"" "Created on Sun Dec 4 09:27:48 2016@author:daxiong" "" Import nltkimport randomfrom Nltk.corpus I Mport movie_reviewsdocuments = [(List (Movie_reviews.words (Fileid)), category) for category in movie_reviews.ca Tegories () for Fileid in Movie_reviews.fileids (category)]random.shuffle (documents) All_words = []for w in movie _reviews.words (): All_words.append (W.lower ()) #dict_allWords是一个字典, stores the frequency distribution of all text Dict_allwords = NLTK. Freqdist (All_words) #字典keys () lists all words, [: 3000] Indicates the first 3,000 words listed word_features = List (Dict_allwords.keys ()) [: 3000]"' Combating ', ' mouthing ', ' markings ', ' directon ', ' PPK ', ' vanishing ', ' victories ', ' Huddleston ', ...] "def find_features (document): words = Set (document) features = {} for W in word_features:features[w] = (w In words) return featureswords=movie_reviews.words (' Neg/cv000_29416.txt ')[' out[78]: [' plot ', ': ', ' a ', ' teen ', ' couples ', ' go ', ' to ', ...] Type (words) out[65]: Nltk.corpus.reader.util.StreamBackedCorpusView "#去重, words1 as set WORDS1 = set (words)' "'words1{'! ', ' ' ' ', ' & ', ' ' ', ' (', ') ',....... ' Witch ', ' with ', ' world ', ' would ', ' wrapped ', ' write ', ' world ', ' would ', ' wrapped ', ' write ', ' years ', ' I ', ' your '} ' features = {} #victories单词不在words1, output False (' victories ' in WORDS1)"' out[73]: False "features[' victories ' = (' victories ' in WORDS1)' featuresout[75]: {' victories ': False} 'Print ((Find_features (movie_reviews.words (' Neg/cv000_29416.txt ')))' ' ' Schwarz ': false, ' supervisors ': false, ' geyser ': false, ' site ': false, ' fevered ': false, ' acknowledged ': false, ' r Onald ': false, ' wroth ': false, ' degredation ': false, ...} "Featuresets = [(Rev. Find_features (rev), category) for (rev., category) in documents]

Featuresets feature set a total of 2000 files, each file is a tuple, the tuple contains the dictionary ("Glory": False) and the Neg/pos category

Natural language 27_converting words to Features with NLTK

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.