Gensim Corpora and dictionary use (i)

Source: Internet
Author: User

Corpora is a basic concept in Gensim, the manifestation of document set and the basis of further processing.

Lib:

From Gensim import corpora from
collections import  defaultdict
Data:

Documents = ["Human Machine interface for lab ABC computer Applications",
             "A Survey of user opinion of computer" system Response Time ",
             " the EPS user Interface Management system ",
             " system and Human system engineering of EPS ", "
             Relation of user perceived response time to error measurement",
             "generation of random binary ",
             " the intersection graph of paths in trees ",
             " graph minors IV widths of trees and very quasi ordering ","
             Gra ph Minors A Survey "]
# deactivate the word
stoplist=set (' for a ' and to in '. Split ())
Filter Stop Word:

texts=[[Word for  word in Document.lower (). Split () If Word isn't in stoplist] for document in documents]
Print (texts)
[[' Human ', ' Machine ', ' interface ', ' lab ', ' abc ', ' Computer ', ' applications '], [' Survey ', ' user ', ' Opinion ', ' computer ', ' system ', ' response ', ' time '], [' EPS ', ' User ', ' interface ', ' management ', ' system '], [' System ', ' Hu Man ', ' system ', ' engineering ', ' testing ', ' EPS '], [' Relation ', ' user ', ' perceived ', ' response ', ' time ', ' error ', ' measure ment '], [' Generation ', ' random ', ' binary ', ' unordered ', ' trees '], [' Intersection ', ' graph ', ' paths ', ' trees '], [' graph ', ' Minors ', ' IV ', ' widths ', ' trees ', ' OK ', ' quasi ', ' ordering ' ', [' graph ', ' minors ', ' survey ']]
To create a statistical dictionary:

frequency=defaultdict (int)
Statistics:

For text in texts: for
    token in text:
        frequency[token]+=1
#去掉只出现一次的单词
texts=[[token for token in text if Frequency[token] > 1] for text in texts]
Print (texts)
[[' Human ', ' interface ', ' computer '], [' Survey ', ' user ', ' computer ', ' system ', ' response ', ' time '], [' EPS ', ' User ', ' interface ', ' System '], [' System ', ' human ', ' System ', ' EPS '], [' User ', ' response ', ' time '], [' Trees '], [' GRA ph ', ' trees '], [' graph ', ' minors ', ' trees '], [' graph ', ' minors ', ' survey ']]
# Save the document in a dictionary, the dictionary has many functions:

Print (Dictionary.id2token) #id2word
{0: ' Human ', 1: ' Interface ', 2: ' Computer ', 3: ' Survey ', 4: ' User ', 5: ' System ', 6: ' Response ', 7: ' Time ', 8: ' EPS ', 9: ' t Rees ', A: ' Graph ', one: ' Minors '}

Print (DICTIONARY.TOKEN2ID) #word2id
{' Human ': 0, ' interface ': 1, ' computer ': 2, ' Survey ': 3, ' user ': 4, ' System ': 5, ' response ': 6, ' time ': 7, ' EPS ': 8, ' tree S ': 9, ' graph ': Ten, ' Minors ': 11}

Print (DICTIONARY.DFS) #词频
{0:2, 1:2, 2:2, 3:2, 4:3, 5:3, 6:2, 7:2, 8:2, 9:3, 10:3, 11:2}

CORPUS = [Dictionary.doc2bow (text) for text in texts] #语料库 can also write documents to Allow_update=true, default is False
Print (corpus) #id2词频
[[(0, 1), (1, 1), (2, 1)], [(2, 1), (3, 1), (4), 1), (5, 1), (6, 1), (7, 1)], [(1, 1), (4, 1), (5, 1), (8, 1)], [0, 1], (5 , 2), (8, 1)], [(4, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(3, 1), (10, 1), (11, 1) )]]

Other:

#dictionary. Save ('/tmp/deerwester.dict ') 
#dictionary. Load ('/tmp/deerwester.dict ')
#corpora. Mmcorpus.serialize (' vector.mm ', corpus) 
#加载mm文件 
#corpus =corpora. Mmcorpus (' vector.mm ')


Some other uses of dictionary


Dictionary There are some other uses, now listed in part

Dictionary.filter_n_most_frequent (N)
Filter out n words that appear most frequently

Dictionary.filter_extremes (no_below=5, no_above=0.5, keep_n=100000)
1. Remove the number of occurrences less than no_below
2. Remove the occurrence times higher than no_above. Note that this decimal point is a percentage.
3. On the basis of 1 and 2, keep the keep_n words before the frequency of occurrence

Dictionary.filter_tokens (Bad_ids=none, Good_ids=none)
There are two uses, one is to remove the bad_id corresponding words, the other is to retain good_id corresponding words and remove other words. Note here Bad_ids and good_ids are list forms

Dictionary.compacity ()
After performing the preceding filtering, you may have gaps in the number of words, and you can use the function to reorder the dictionaries and remove the voids.










Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.