Paragraph vector Writing and application in Gensim and TensorFlow

Source: Internet
Author: User

The previous issue discussed the construction and comparison of the Word2vec model of TensorFlow and Gensim. In this issue, let's take a look at another model of Mikolov, the paragraph vector model. Currently, Mikolov and Bengio's latest paper ensemble of generative and discriminative techniques for sentiment analysis of Movie reviews introduced the Model as a user's commentary analysis method of film and television works. At the same time, many places on the network also pointed out that the model effect did not word2vec the effect of its prophase model. Here, we will not discuss whether the effect is good or bad, just how to build the model to discuss. First, I will introduce the Gensim method of the model, and then, under the Gensim model, we will try to use TensorFlow to write this model. Before we start code, it is necessary to do a little bit about this model.


Model background:

The model starts with the Cbow in Word2vec and the Skip-gram model. From the framework of the model, its structure is basically equivalent to the Cbow or Skip-gram model, but the biggest difference is that a new dimension is added to the dimension of the word dimensions as the sentence dimension, the paragraph dimension or the article dimension. The meaning of the dimension is the meaning of the people who need to use the model, that is, sentence classification, paragraph classification or article classification. This new dimension exists in a space different from the word dimension, so be careful not to confuse the word dimension with the concept of this new dimension. The model is trained in the same way as Word2vec. The purpose of the model is to add a longer sequence meaning to the word and to get a similar word2vec effect for the unsupervised classification of sentences, paragraphs, or articles. For detailed instructions, you can read the following links.

Model Code:

1. Gensim Code:

First, let's take a look at how the Gensim code is expressed:

From gensim.models Import doc2vecfrom collections import namedtupleimport csvimport reimport string# Select Wikipedia as input,    Input part of Wikipedia CSV document reader = Csv.reader (open ("Wikipedia.csv")) Count = 0data = ' for row in Reader:count = Count + 1 If Count > 301:break else:data + = row[1]# clause. We use a period, a question mark and an exclamation point as the basis of the clause. # It is worth noting that the basis is not very rigorous, such as the English Mr.wang will be divided into two sentences, but because the code is as a # demonstration, we are not interested in strict clauses, we have time to do better processing sentenceenders = Re.compile (' [ .?!]‘) Data_list = Sentenceenders.split (data) # Build a namedtuple framework to load input labeldoc = namedtuple (' Labeldoc ', ' words tags ') exclude =    Set (string.punctuation) All_docs = []count = 0for sen in data_list:word_list = Sen.split () # When a sentence is less than three words, we think its meaning is not    # Complete, so remove the class to purify our input. If Len (word_list) < 3:continue tag = [' Sen_ ' + str (count)] count + = 1 SEN = '. Join (CH for ch in sen i F CH not in exclude) All_docs.append (Labeldoc (Sen.split (), tag) # Print example look at the All_docs shape print all_docs[0:10]# in the official file of Gensim , the authors point out that the best results come from randomly arranging input sentences, or # from the process of training iterationsLess learning rate alpha, so here we use the latter. Model = Doc2vec. Doc2vec (alpha=0.025, min_alpha=0.025) # Use fixed learning Ratemodel.build_vocab (All_docs) for epoch in range: mode L.train (all_docs) Model.alpha-= 0.002 # Decrease the learning rate Model.min_alpha = model.alpha # Fix the Learni     NG rate, no decay # Save the Model Model.save (' My_model.doc2vec ')

It is not difficult to see, after finishing the input, in addition to the need to design to reduce learning rate alpha, the rest of the training methods are very easy to understand. When you test the effect of the model, run the following code:

Import Randomimport numpy as Npimport string# Select an arbitrary sentence iddoc_id = Np.random.randint (model.docvecs.count) print doc_id# through The Docvecs.most_similar function calculates the similar sentence ID and prints the first 8 Sims = Model.docvecs.most_similar (doc_id, topn=model.docvecs.count) print in turn (' TARGET ', all_docs[doc_id].words) count = 0for i in Sims:    if Count > 8:        break    pid = Int (String.Replace (i[0 ], "Sen_", ""))    print (I[0], ":", All_docs[pid].words)    count + = 1

The results of the operation are as follows:

Obviously, when our goal sentence is about Maldonado, our closest sentence is also about him. At the same time, our sentence is Guan Yu notable victories (obvious victory), the second close to the sentence is also about this topic. This shows that the system has indeed learned some of the relevance. But after all we just used a black box, how does this black box work? Below we will try to restore this logic with TensorFlow.

2. TensorFlow Code:

In my May 19 blog has been introduced on the Word2vec Cbow model on the TensorFlow, detailed information please click the link query. Based on this model, we will deduce in the future how to change to get the PV-DM model, the paragraph vector version of the Cbow model.

First, we need to sort the input. The method is the same as the previous Gensim code, and will not be duplicated here. However, it is worth noting that the original Wikipedia.csv document was preprocessed as a namedtuple struct with the word list and its corresponding sentence ID. So, while accepting this struct, we need to change the Build_data function to properly assemble the dictionary and the data input we need. Here, our goal is to keep the original count, dictionary and reverse dictionary, but for the input data, we want to change our input directly, put namedtuple in, The words in list are replaced by their index in dictionary. The following code will do the function:

def build_dataset (Input_data, Min_cut_freq): # This will input_data be re-collected as Cbow list in words model to facilitate the use of the # counter function. Words = [] for i in Input_data:for J in I.words:words.append (j) count_org = [[' UNK ',-1]] count_org . Extend (collections. Counter (words). Most_common ()) count = [[' UNK ',-1]] for Word, c in count_org:word_tuple = [Word, c] if Word = = ' U NK ': count[0][1] = c Continue if C > min_cut_freq:count.append (word_tuple) dictionary = dict () for Word, _ in Count:dictionary[word] = Len (dictionary) data = [] Unk_count = 0 for tup in Input_data:word_ data = [] for word in tup.words:if word in Dictionary:index = Dictionary[word] Else:index = 0 Unk_count + = 1 word_data.append (index) data.append (Labeldoc (Word_data, Tup.tags)) count[0][1] = Unk_ Count reverse_dictionary = dict (Zip (dictionary.values (), Dictionary.keys ())) return data, Count, dictionary, Reverse_di Ctionary

By the above code, we will get the input we need. So, how do we build our model? Before building the model, we need to change the Generate_batch function to keep the original batch and label output, and add a paragraph label for each label.

def generate_dm_batch (Batch_size, Num_skips, Skip_window): Global Word_index Global Sentence_index assert batch_s Ize% num_skips = 0 Assert num_skips <= 2 * Skip_window batch = Np.ndarray (Shape= (Batch_size, num_skips), dtype=  Np.int32) labels = Np.ndarray (shape= (batch_size, 1), dtype=np.int32) Para_labels = Np.ndarray (shape= (batch_size, 1), Dtype=np.int32) # Paragraph Labels span = 2 * skip_window + 1 # [Skip_window target Skip_window] buffer = Collect Ions.deque (Maxlen=span) for _ in range (span): Buffer.append (Data[sentence_index].words[word_index]) sen_l En = Len (data[sentence_index].words) If sen_len-1 = = Word_index: # Reaching the end of a sentence word            _index = 0 Sentence_index = (sentence_index + 1)% len (data) Else: # Increase the Word_index by 1 Word_index + = 1 for i in Range (batch_size): target = skip_window # target label at the center of the buff ER targets_to_avoid = [Skip_window] batch_temp = Np.ndarray (shape= (num_skips), Dtype=np.int32) for J in Range (Num_skips): While target in targets_to_avoid:target = Random.randint (0, span-1) targets_to_avoi D.append (target) batch_temp[j] = Buffer[target] batch[i] = batch_temp labels[i,0] = Buffer[skip_w Indow] para_labels[i, 0] = Sentence_index buffer.append (Data[sentence_index].words[word_index]) sen_l En = Len (data[sentence_index].words) If sen_len-1 = = Word_index: # Reaching the end of a sentence word            _index = 0 Sentence_index = (sentence_index + 1)% len (data) Else: # Increase the Word_index by 1 Word_index + = 1 return batch, labels, para_labels

  Here we maintain two global variables, namely Word_index and Sentence_index. The former marks the first batch in a sentence to read which word, which marks the previous batch read which sentence. Their initial values are all 0. If we find that the currently read Word is the last word in the sentence, i.e. sen_len-1 = = Word_index, we will reset Word_index and move Sentence_index to the next sentence. In this way, we maintain the original batch and labels for each input window defines a para_label that it should respond to. Well, the materials are ready, so how do we use these materials to build paragraph vectors?

With Graph.as_default (): # Input data. Train_inputs = Tf.placeholder (tf.int32,shape=[batch_size, Skip_window * 2]) Train_labels = Tf.placeholder (Tf.int32, Shape=[batch_size, 1]) #paragraph vector place Holder train_para_labels = Tf.placeholder (tf.int32,shape=[batch_size, 1] ) # Ops and variables pinned to the CPU because of missing GPUs implementation with Tf.device ('/cpu:0 '): # Look up EMB    Eddings for inputs. embeddings = tf. Variable (Tf.random_uniform ([Vocabulary_size, Embedding_size], -1.0, 1.0)) Embed_word = Tf.nn.embedding_lookup ( Embeddings, train_inputs) # Look up embeddings for paragraph inputs para_embeddings = TF. Variable (Tf.random_uniform ([Paragraph_size, Embedding_size], -1.0, 1.0)) Embed_para = Tf.nn.embedding_lookup (Para_ Embeddings, Train_para_labels) # Concat them and average them embed = tf.concat (1, [Embed_word, Embed_para]) redu  ced_embed = Tf.div (Tf.reduce_sum (embed, 1), skip_window*2 + 1) # Construct The variables for the NCE loss  Nce_weights = tf. Variable (Tf.truncated_normal ([Vocabulary_size, Embedding_size], stddev=1.0/math.sqrt (Embeddin g_size))) nce_biases = tf.    Variable (Tf.zeros ([vocabulary_size]) # Compute The average NCE loss for the batch.    # Tf.nce_loss automatically draws a new sample of the negative labels each # time we evaluate the loss. Loss = Tf.reduce_mean (Tf.nn.nce_loss (nce_weights, nce_biases, reduced_embed, Train_labels, num_sample D, Vocabulary_size))

Here, we first retain the original word embedding graph, on this basis, we joined the paragraph_labels placeholder, and defined the paragraph vector embedding. After merging them and averaging them, we trained the model through the NCE loss approach. At last

With TF. Session (Graph=graph) as session:  # We must initialize all variables before we use them.  Init.run ()  print ("Initialized")  average_loss = 0  for step in Xrange (num_steps):    batch_inputs, Batch_ Labels, batch_para_labels = Generate_dm_batch (        batch_size, Num_skips, Skip_window)    feed_dict = {train_inputs: Batch_inputs, Train_labels:batch_labels, train_para_labels:batch_para_labels}

In the session, we call our Generatge_dm_batch function and feed the batch, label, and Paragraph_label to our model. The model did not work well in the process, and because of the time constraints, I did not optimize the model. I will then try to gensim about the shuffle input statement or the proposal to reduce the learning rate alpha. If you find my code wrong, please be sure to point out, thank you for your enthusiastic participation! Thank you!

Paragraph vector Writing and application in Gensim and TensorFlow

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.