Gensim Corpora and dictionary use (i)

Last Update:2018-07-28 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Corpora is a basic concept in Gensim, the manifestation of document set and the basis of further processing.

Lib:

From Gensim import corpora from
collections import  defaultdict

Data:

Documents = ["Human Machine interface for lab ABC computer Applications",
             "A Survey of user opinion of computer" system Response Time ",
             " the EPS user Interface Management system ",
             " system and Human system engineering of EPS ", "
             Relation of user perceived response time to error measurement",
             "generation of random binary ",
             " the intersection graph of paths in trees ",
             " graph minors IV widths of trees and very quasi ordering ","
             Gra ph Minors A Survey "]

# deactivate the word
stoplist=set (' for a ' and to in '. Split ())

Filter Stop Word:

texts=[[Word for  word in Document.lower (). Split () If Word isn't in stoplist] for document in documents]

Print (texts)
[[' Human ', ' Machine ', ' interface ', ' lab ', ' abc ', ' Computer ', ' applications '], [' Survey ', ' user ', ' Opinion ', ' computer ', ' system ', ' response ', ' time '], [' EPS ', ' User ', ' interface ', ' management ', ' system '], [' System ', ' Hu Man ', ' system ', ' engineering ', ' testing ', ' EPS '], [' Relation ', ' user ', ' perceived ', ' response ', ' time ', ' error ', ' measure ment '], [' Generation ', ' random ', ' binary ', ' unordered ', ' trees '], [' Intersection ', ' graph ', ' paths ', ' trees '], [' graph ', ' Minors ', ' IV ', ' widths ', ' trees ', ' OK ', ' quasi ', ' ordering ' ', [' graph ', ' minors ', ' survey ']]

To create a statistical dictionary:

frequency=defaultdict (int)

Statistics:

For text in texts: for
    token in text:
        frequency[token]+=1

#去掉只出现一次的单词
texts=[[token for token in text if Frequency[token] > 1] for text in texts]

Print (texts)
[[' Human ', ' interface ', ' computer '], [' Survey ', ' user ', ' computer ', ' system ', ' response ', ' time '], [' EPS ', ' User ', ' interface ', ' System '], [' System ', ' human ', ' System ', ' EPS '], [' User ', ' response ', ' time '], [' Trees '], [' GRA ph ', ' trees '], [' graph ', ' minors ', ' trees '], [' graph ', ' minors ', ' survey ']]

# Save the document in a dictionary, the dictionary has many functions:

Print (Dictionary.id2token) #id2word

{0: ' Human ', 1: ' Interface ', 2: ' Computer ', 3: ' Survey ', 4: ' User ', 5: ' System ', 6: ' Response ', 7: ' Time ', 8: ' EPS ', 9: ' t Rees ', A: ' Graph ', one: ' Minors '}

Print (DICTIONARY.TOKEN2ID) #word2id

{' Human ': 0, ' interface ': 1, ' computer ': 2, ' Survey ': 3, ' user ': 4, ' System ': 5, ' response ': 6, ' time ': 7, ' EPS ': 8, ' tree S ': 9, ' graph ': Ten, ' Minors ': 11}

Print (DICTIONARY.DFS) #词频

{0:2, 1:2, 2:2, 3:2, 4:3, 5:3, 6:2, 7:2, 8:2, 9:3, 10:3, 11:2}

CORPUS = [Dictionary.doc2bow (text) for text in texts] #语料库 can also write documents to Allow_update=true, default is False

Print (corpus) #id2词频

[[(0, 1), (1, 1), (2, 1)], [(2, 1), (3, 1), (4), 1), (5, 1), (6, 1), (7, 1)], [(1, 1), (4, 1), (5, 1), (8, 1)], [0, 1], (5 , 2), (8, 1)], [(4, 1), (6, 1), (7, 1)], [(9, 1)], [(9, 1), (10, 1)], [(9, 1), (10, 1), (11, 1)], [(3, 1), (10, 1), (11, 1) )]]

Other:

#dictionary. Save ('/tmp/deerwester.dict ') 
#dictionary. Load ('/tmp/deerwester.dict ')
#corpora. Mmcorpus.serialize (' vector.mm ', corpus) 
#加载mm文件 
#corpus =corpora. Mmcorpus (' vector.mm ')

Some other uses of dictionary

Dictionary There are some other uses, now listed in part

Dictionary.filter_n_most_frequent (N)
Filter out n words that appear most frequently

Dictionary.filter_extremes (no_below=5, no_above=0.5, keep_n=100000)
1. Remove the number of occurrences less than no_below
2. Remove the occurrence times higher than no_above. Note that this decimal point is a percentage.
3. On the basis of 1 and 2, keep the keep_n words before the frequency of occurrence

Dictionary.filter_tokens (Bad_ids=none, Good_ids=none)
There are two uses, one is to remove the bad_id corresponding words, the other is to retain good_id corresponding words and remove other words. Note here Bad_ids and good_ids are list forms

Dictionary.compacity ()
After performing the preceding filtering, you may have gaps in the number of words, and you can use the function to reorder the dictionaries and remove the voids.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Gensim Corpora and dictionary use (i)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Gensim Corpora and dictionary use (i)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support