Corpora is a basic concept in Gensim, the manifestation of document set and the basis of further processing.
Lib:
From Gensim import corpora from
collections import defaultdict
Data:
Documents = ["Human Machine interface for lab ABC computer Applications",
"A Survey of user opinion of computer" system Response Time ",
" the EPS user Interface Management system ",
" system and Human system engineering of EPS ", "
Relation of user perceived response time to error measurement",
"generation of random binary ",
" the intersection graph of paths in trees ",
" graph minors IV widths of trees and very quasi ordering ","
Gra ph Minors A Survey "]
# deactivate the word
stoplist=set (' for a ' and to in '. Split ())
Filter Stop Word:
texts=[[Word for word in Document.lower (). Split () If Word isn't in stoplist] for document in documents]
Print (texts)
[[' Human ', ' Machine ', ' interface ', ' lab ', ' abc ', ' Computer ', ' applications '], [' Survey ', ' user ', ' Opinion ', ' computer ', ' system ', ' response ', ' time '], [' EPS ', ' User ', ' interface ', ' management ', ' system '], [' System ', ' Hu Man ', ' system ', ' engineering ', ' testing ', ' EPS '], [' Relation ', ' user ', ' perceived ', ' response ', ' time ', ' error ', ' measure ment '], [' Generation ', ' random ', ' binary ', ' unordered ', ' trees '], [' Intersection ', ' graph ', ' paths ', ' trees '], [' graph ', ' Minors ', ' IV ', ' widths ', ' trees ', ' OK ', ' quasi ', ' ordering ' ', [' graph ', ' minors ', ' survey ']]
To create a statistical dictionary:
frequency=defaultdict (int)
Statistics:
For text in texts: for
token in text:
frequency[token]+=1
#去掉只出现一次的单词
texts=[[token for token in text if Frequency[token] > 1] for text in texts]
Print (texts)
[[' Human ', ' interface ', ' computer '], [' Survey ', ' user ', ' computer ', ' system ', ' response ', ' time '], [' EPS ', ' User ', ' interface ', ' System '], [' System ', ' human ', ' System ', ' EPS '], [' User ', ' response ', ' time '], [' Trees '], [' GRA ph ', ' trees '], [' graph ', ' minors ', ' trees '], [' graph ', ' minors ', ' survey ']]
# Save the document in a dictionary, the dictionary has many functions:
Print (Dictionary.id2token) #id2word
{0: ' Human ', 1: ' Interface ', 2: ' Computer ', 3: ' Survey ', 4: ' User ', 5: ' System ', 6: ' Response ', 7: ' Time ', 8: ' EPS ', 9: ' t Rees ', A: ' Graph ', one: ' Minors '}
Print (DICTIONARY.TOKEN2ID) #word2id
{' Human ': 0, ' interface ': 1, ' computer ': 2, ' Survey ': 3, ' user ': 4, ' System ': 5, ' response ': 6, ' time ': 7, ' EPS ': 8, ' tree S ': 9, ' graph ': Ten, ' Minors ': 11}
Print (DICTIONARY.DFS) #词频
{0:2, 1:2, 2:2, 3:2, 4:3, 5:3, 6:2, 7:2, 8:2, 9:3, 10:3, 11:2}
CORPUS = [Dictionary.doc2bow (text) for text in texts] #语料库 can also write documents to Allow_update=true, default is False
Print (corpus) #id2词频
[[(0, 1), (1, 1), (2, 1)], [(2, 1), (3, 1), (4), 1), (5, 1), (6, 1), (7, 1)], [(1, 1), (4, 1), (5, 1), (8, 1)], [0, 1], (5 , 2), (8, 1)], [(4, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(3, 1), (10, 1), (11, 1) )]]
Other:
#dictionary. Save ('/tmp/deerwester.dict ')
#dictionary. Load ('/tmp/deerwester.dict ')
#corpora. Mmcorpus.serialize (' vector.mm ', corpus)
#加载mm文件
#corpus =corpora. Mmcorpus (' vector.mm ')
Some other uses of dictionary
Dictionary There are some other uses, now listed in part
Dictionary.filter_n_most_frequent (N)
Filter out n words that appear most frequently
Dictionary.filter_extremes (no_below=5, no_above=0.5, keep_n=100000)
1. Remove the number of occurrences less than no_below
2. Remove the occurrence times higher than no_above. Note that this decimal point is a percentage.
3. On the basis of 1 and 2, keep the keep_n words before the frequency of occurrence
Dictionary.filter_tokens (Bad_ids=none, Good_ids=none)
There are two uses, one is to remove the bad_id corresponding words, the other is to retain good_id corresponding words and remove other words. Note here Bad_ids and good_ids are list forms
Dictionary.compacity ()
After performing the preceding filtering, you may have gaps in the number of words, and you can use the function to reorder the dictionaries and remove the voids.