Recently, I often hear colleagues mention correlation analysis, just see this Google Open Source Library, and the relevant operations and debugging results recorded.
The novel collection can be Baidu "Jin Yong novel complete 14 department" full (TXT) Jin Yong download down.
Need to tidy up a good format, between the door and martial arts names need to have a line break, pay attention to delete the last line of empty areas.
After the completion of the download can use their own custom tools or programs to adjust, because the corpus content is too long, the blog does not allow "piling", so did not copy up, there is need to contact again.
With open ('names.txt') as f: = [Line.strip () F.readlines ()]
Novels = Data[::2]names= Data[1::2]novel_names= {K:v.split () forKvinchZip (novels, names)}//can be printed here to see if it's all read
Start participle and load for_, NamesinchNovel_names.items ():#. Iteritems (): (Python2 's old notation) forNameinchNames:jieba.add_word (name) with open ("Kongfu.txt", encoding='UTF-8') as F:kungfu_names=[Line.strip () forLineinchf.readlines ()]with Open ("Bangs.txt") as F:bang_names=[Line.strip () forLineinchf.readlines ()] forNameinchKungfu_names:jieba.add_word (name) forNameinchBang_names:jieba.add_word (name)
Novels = ["Book and Sword]"Tian Long Eight", "mild feeling Jian","The more female sword","Flying fox rumor","Knight Line","Heroic biography of the Eagle","The Statue of God, ""Liancheng tactic","Mandarin Duck knife","The Day of the Dragon Slayer","The White Horse squealing westerly wind","laughing and proud of the lake","Snow Mountain Flying fox","Deer Ding kee"]
You know: Snow Sky shooting White deer, laughing book The Spirit of the Blue yuan
sentences = [] forNovelinchNovels:with Open ('{}.txt'. Format (novel), encoding='UTF-8') as F:data=[Line.strip () forLineinchF.readlines ()ifLine.strip ()] forLineinchData:words=List (Jieba.cut (line)) sentences.append (words) model=gensim.models.Word2Vec (sentences, size=200,# -Window=5, Min_count=5, workers=4)
Easy to use method
def Get_gongfu (A, B, c): D, _ = Model1.most_similar (Positive=[c, b], negative=[a]) [0] Print (c,d)
//Use example one
print ( " -------------What would happen if Huang Rong played the dog-stick method? ------------------------- " ) Get_gongfu ( " Huang Rong ", " hit the dog stick method ", " Guo Jing ")
Print ('-------------if Huang Rong is holding a dog stick, what will Guo Jing take? -------------------------') Get_gongfu (" Huang Rong "," hit the dog stick "," Guo Jing ")
Output Result:
-------------relevance: Qiao-------------------------percent empty bamboo 0.822662174701691% Murong Complex 0.809000551700592% Zhengchun 0.808856725692749 Percent Lunkhead 0.789826631546021% child basking 0.788126051425934% elite 0.7863771915435791%% full crown clear 0.776110172271729% cigarette smoker 0.7738543748855591%% Houlian boat 0.7663788199424744%% lu feiqing 0.7651679515838623-------------relevance: Zhu-------------------------Purple0.8502078056335449Wang0.8323276042938232Lunkhead0.8188427090644836Fang0.81195068359375Angry0.8042664527893066Miriam Lynn0.7905520796775818qingqing0.7837553024291992Fragrant Princess0.7774882316589355yingying0.7765697836875916Mrs. MA0.7628135681152344-------------Correlation: Dragon 18-------------------------hitting the dog stick method0.9099119901657104Taijiquan0.8792168498039246Empty Ming Fist0.8742830157279968Trick0.864672064781189a Yang finger0.8576483726501465Toad Gong0.8443030714988708centroid0.8419612646102905Bar Method0.840523362159729Arhat Fist0.838168740272522Small Grip0.8356980085372925-------------If Huang Rong play dog stick method, Guo Jing will do? -------------------------Guo Jing Dragon 18 Palm-------------If Huang Rong take a dog stick, what will Guo Jing take? -------------------------Guo Jing Signal flags
Model parameters:
Analysis of the relevance of Python articles---Jin Yong's martial arts-----
Sentences: Can be a ist, for large corpus, it is recommended to use Browncorpus,text8corpus or inesentence build.
SG: Used to set the training algorithm, the default is 0, corresponds to the cbow algorithm; sg=1 uses the Skip-gram algorithm.
Size: Refers to the dimension of the feature vector, which defaults to 100. Larger size requires more training data, but it works better. The recommended values are dozens of to hundreds of.
Window: Indicates what the maximum distance between the current word and the predicted word is in a sentence
Alpha: is the learning rate
Seed: Used for random number generators. is related to the initialization word vector.
Min_count: You can truncate the dictionary. Words with a frequency less than min_count are discarded and the default value is 5
Max_vocab_size: Sets the ram limit during word vector building. If the number of independent words exceeds this, the least frequent one will be removed. Approximately 1GB of RAM is required for every 10 million words. No limit is set to none.
Sample: The configuration threshold for random drop sampling of high frequency vocabulary, default is 1e-3, range is (0,1e-5)
The workers parameter controls the number of parallel sessions of the training.
HS: If 1, the Hierarchica Softmax technique will be used. If set to 0 (Defau t), negative sampling will be used.
Negative: If >0, negativesamp ing is used to set the number of noise words
Cbow_mean: If 0, the same as the vector of the context word, and if 1 (Defau t), the mean value is used. Only works when Cbow is used.
The Hashfxn:hash function to initialize the weights. Use Python's hash function by default
ITER: Number of iterations, default = 5
Trim_rule: Used to set the collation rules for the glossary, specifying which words to leave and which to delete. Can be set to None (Min_count will be used) or one accepts () and returns RU E_discard,uti s.ru E_keep or UTI s.ru E_defau The function of T.
Sorted_vocab: If 1 (Defau t) is assigned word index, the word is sorted based on the frequency descending.
Batch_words: The number of words per batch passed to the thread, default to 10000
Recently, I often hear colleagues mention correlation analysis, just see this Google Open Source Library, and the relevant operations and debugging results recorded.
The novel collection can be Baidu "Jin Yong novel complete 14 department" full (TXT) Jin Yong download down.
Need to tidy up a good format, between the door and martial arts names need to have a line break, pay attention to delete the last line of empty areas.
After the completion of the download can use their own custom tools or programs to adjust, because the corpus content is too long, the blog does not allow "piling", so did not copy up, there is need to contact again.
Note: first Baidu to the "Jin Yong novels complete 14" Full (TXT) Jin Yong download down, and then read the content, another: the above model every time training,
Analysis of relevance analysis of Python articles---Jin Yong's martial arts