1. The training word vector code is as follows:
#训练词语为向量表示
DefW2v_train (Self):
ques = self.cu.execute (' Select question from activity ')#将所有问题内容作为预料训练一个w2v模型
Da_all = []
For D in ques:
Da_all.append (d[0])
sentences = Self.get_text (da_all)
Model = Word2vec ()
Model.build_vocab (sentences)
Model.train (sentences,total_examples = Model.corpus_count,epochs = model.iter)
Model.save ("./tmp/user_w2corpus")
The result of the training for a word a vector
2. Re-remove each question from a user for word segmentation, then cluster
DefSimmetric_topic_a (Self, Clust_num, UserID):
From Sklearn.clusterImport Kmeans
From Sklearn.externalsImport Joblib
texts=Self.get_dict (userid) [1]# Vocabulary
texts_len=Len (texts)
Model = Gensim.models.Word2Vec.load ('./tmp/user_w2corpus ')
Texts_vec=[]#将每个计算完单个句子的向量的结果存储到该列表即返回句子向量
X=[]
For textIn texts:#将每个句子循环一次
Text_vec=np.zeros ((100,))#由于默认的w2v训练得到的向量维度为100, so initialize to 100, start initializing to 0, but if there is only one word in the sentence and the word is not trained, the dimension cannot be aligned with the previous
For Tin text: #每个句子中的每个词汇的向量求和
try:
# Text_vec+=model[t] #得到句子向量
X.append (Model[t]) # Adding a word to X, if it appears in more than one document, adds x multiple times
Except exception as e:
print ( "the vector set of training does not leave the word '
The problem of vector clustering based on w2v words (to be solved)