This article is "Attention-over-attention neural Networks for Reading comprehension" reading notes. The task to be dealt with in this paper is to read and understand the cloze problem. Its model architecture is built on the "Text Understanding with the Attention Sum Reader Network", the thesis is supreme. Firstly, this paper puts forward the task of using attention for cloze, and this paper adds an additional attention layer on the basis of it, which can eliminate the heuristic algorithm and some super-parameter adjustment problems. We are going to combine two papers to introduce them. Data Set
First introduce the data set, the current large-scale data set mainly includes Cnn/daliy mail and children's book Test (Cbtest). The first two is the news data set, the entire news document as the Cloze text (document), and then a word in its news digest is removed as a query, the word removed as the answer (Answer). The named entities in document are replaced with different identifiers: @entity1, @entity2 、、、, etc. for example, the first Behavior Web page URL (useless), the third Act document, the seventh behavior of the query, answer, And the named entities are replaced:
CBT data set is obtained from children's books, because there is no summary, so the use of 21 consecutive words as document, the 22nd sentence as query and other ways to build. Then it is divided into four subsets according to the part of speech of the answer: named Entity (NE), common noun (CN), verb, preposition. However, because the following two kinds of answers and text do not have a very close relationship, such as people often do not need to read the text can be judged in the blanks, so commonly used is the front two.
Finally, each piece of data is constructed as a ternary group:
<d, Q, a>
Model
First we can look at the model architecture presented in the paper "Text Understanding with the Attention Sum Reader Network", as shown in the following figure:
As can be seen from the above figure, The model first obtains the word vector e (w) for each word in document and query by embedding the Matrix V. Next, we use two encoder networks to get the vector of each word in the text contextual embedding and the expression vector of query. The encoder here uses a bidirectional GRU recurrent neural network. The query vector is then multiplied with the contextual embedding of each word using the dot product method, and the resulting result can be regarded as the weight of each word for the search, and also as a attention. Finally, the Softmax function is used to convert the weights into normalized probabilities, and the most probable result is considered the answer of the query.
Next we look at the model schema presented in this article, as shown in the following illustration:
The first half of the model is exactly the same as above, and the difference is that this paper presents a mechanism of "Attention over Attention", That is, after the vector of document and query is obtained, all the words of query are not combined into a vector, but are multiplied directly in the form of a matrix with the document matrix. Then the softmax operation of the matrix is obtained from the row and column two dimensions to obtain the attention matrix of document and the attention matrix of query. In order to sum the elements of each column in the query matrix as weights, the attention matrix of document is dot product. code implementation of the Model
In fact, the model using TensorFlow implementation is very simple, directly call Tf.contrib.rnn below the Grucell, the difficulty lies in the data processing and reading operations. Here we can refer to the two implementations above GitHub: Olavhn,marshmellox. The first use of the TF built-in read Data API, the code is very concise and clear, I have time to study the implementation of the principle of a blog to organize. The second uses the traditional data processing method, also can refer to, moreover on GitHub above should be able to find the CNN and so on the data set processing code unifies to study together. But the above two code implementation is the older version, if the use of tf1.0 and above may have some function incompatibility problem, I refer to the first code implementation made a certain modification, can be 1. Run on version 0. The code follow-up will be on my GitHub, welcome to view. It takes four or five days to run on the server, and it's not finished yet. The following figure shows the screenshot of the result:
Four parameters represent steps, error rate, accuracy, time, respectively. You can see that the accuracy is not very stable, but basically reached the effect mentioned in the paper. I can look at the code of the model after I modified it, especially the modeling part is relatively simple, only a few lines of command is implemented:
Import OS import time import random import NumPy as NP import TensorFlow as TF from tensorflow.python.ops import sparse_op s from util import Softmax, Orthogonal_initializer flags = tf.app.flags flags = flags. Flags flags. Define_integer (' Vocab_size ', 119662, ' vocabulary size ') flags. Define_integer (' embedding_size ', 384, ' embedding dimension ') flags. Define_integer (' hidden_size ', ' hidden units ') flags. Define_integer (' batch_size ', +, ' batch size ') flags. Define_integer (' epochs ', 2, ' number of epochs to Train/test ') flags. Define_boolean (' Training ', True, ' training or testing a model ') flags. Define_string (' name ', ' Lc_model ', ' model name ' (used for statistics and model path ') flags. Define_float (' Dropout_keep_prob ', 0.9, ' keep prob for embedding dropout ') flags. Define_float (' L2_reg ', 0.0001, ' L2 regularization for embeddings ') Model_path = ' models/' + flags.name if not os.path.ex Ists (Model_path): Os.makedirs (Model_path) def read_records (index=0): Train_queue = Tf.train.string_inpuT_producer ([' Training.tfrecords '], num_epochs=flags.epochs) validation_queue = Tf.train.string_input_producer ([' Validation.tfrecords '], num_epochs=flags.epochs) test_queue = Tf.train.string_input_producer ([' Test.tfrecords '], Num_epochs=flags.epochs) queue = tf. Queuebase.from_list (index, [Train_queue, Validation_queue, test_queue]) reader = TF. Tfrecordreader () _, Serialized_example = Reader.read (queue) features = Tf.parse_single_example (serialized_examp Le, features={' document ': TF. Varlenfeature (Tf.int64), ' query ': TF. Varlenfeature (Tf.int64), ' answer ': TF. Fixedlenfeature ([], Tf.int64)}) document = Sparse_ops.serialize_sparse (features[' document ')) query = Sparse_op S.serialize_sparse (features[' query ') answer = features[' answer '] document_batch_serialized, query_batch_serialized , Answer_batch = Tf.train.shuffle_batch ([document, query, answer], batch_size=flags.batch_size, capacity=2000 , Min_after_dequeue=1000) Sparse_document_batch = Sparse_ops.deserialize_many_sparse (document_batch_serialized, Dtype=tf.int64) sparse _query_batch = Sparse_ops.deserialize_many_sparse (query_batch_serialized, dtype=tf.int64) Document_batch = Tf.sparse _tensor_to_dense (sparse_document_batch) document_weights = Tf.sparse_to_dense (sparse_document_batch.indices, Sparse_document_batch.dense_shape, 1) query_batch = Tf.sparse_tensor_to_dense (sparse_query_batch) query_weights = tf. Sparse_to_dense (Sparse_query_batch.indices, Sparse_query_batch.dense_shape, 1) return Document_batch, Document_ Weights, Query_batch, query_weights, Answer_batch def inference (documents, Doc_mask, query, query_mask): embedding = t F.get_variable (' embedding ', [Flags.vocab_size, Flags.embedding_size], Initializer=tf.random_un Iform_initializer (minval=-0.05, maxval=0.05)) Regularizer = Tf.nn.l2_loss (embedding) Doc_emb = Tf.nn.dropout (tf.nn.e Mbedding_lookup (embedding, documents), Flags.droPout_keep_prob) Doc_emb.set_shape ([None, none, Flags.embedding_size]) Query_emb = Tf.nn.dropout (tf.nn.embedding_look Up (embedding, query), Flags.dropout_keep_prob) Query_emb.set_shape ([None, none, Flags.embedding_size]) with Tf.variab Le_scope (' Document ', Initializer=orthogonal_initializer ()): Fwd_cell = Tf.contrib.rnn.GRUCell (flags.hidden_size) b Ack_cell = Tf.contrib.rnn.GRUCell (flags.hidden_size) Doc_len = Tf.reduce_sum (Doc_mask, Reduction_indices=1) H, _ = Tf.nn.bidirectional_dynamic_rnn (Fwd_cell, Back_cell, Doc_emb, Sequence_length=tf.to_int64 (Doc_len), DTYPE=TF.FL Oat32) #h_doc = Tf.nn.dropout (Tf.concat (2, h), flags.dropout_keep_prob) H_doc = Tf.concat (h, 2) with TF.VARIABL E_scope (' query ', Initializer=orthogonal_initializer ()): Fwd_cell = Tf.contrib.rnn.GRUCell (flags.hidden_size) back_ Cell = Tf.contrib.rnn.GRUCell (flags.hidden_size) Query_len = Tf.reduce_sum (Query_mask, Reduction_indices=1) H, _ = Tf.nn.bidirectional_dynAmic_rnn (Fwd_cell, Back_cell, Query_emb, Sequence_length=tf.to_int64 (Query_len), dtype=tf.float32) #h_query = Tf.nn.dropout (Tf.concat (2, h), flags.dropout_keep_prob) H_query = Tf.concat (h, 2) M = Tf.matmul (H_doc, H_query, ad joint_b=true) M_mask = Tf.to_float (Tf.matmul (Tf.expand_dims (Doc_mask,-1), Tf.expand_dims (Query_mask, 1))) Alpha = so Ftmax (M, 1, m_mask) beta = Softmax (M, 2, m_mask) #query_importance = Tf.expand_dims (Tf.reduce_mean (Beta, reduction_in
Dices=1),-1) query_importance = Tf.expand_dims (Tf.reduce_sum (Beta, 1)/Tf.to_float (Tf.expand_dims (Doc_len,-1)),-1) s = Tf.squeeze (Tf.matmul (Alpha, Query_importance), [2]) unpacked_s = Zip (Tf.unstack (S, flags.batch_size), Tf.unstack ( Documents, flags.batch_size)) Y_hat = Tf.stack ([Tf.unsorted_segment_sum (Attentions, sentence_ids, flags.vocab_size) For (attentions, sentence_ids) in unpacked_s]) return y_hat, Regularizer def train (Y_hat, Regularizer, document, Doc_w Eight, answer): # Trick WHIle we wait for tf.gather_nd-https://github.com/tensorflow/tensorflow/issues/206 # This unfortunately causes us to exp And a sparse tensor into the full vocabulary index = tf.range (0, flags.batch_size) * flags.vocab_size + tf.to_int32 (ANSW
ER) flat = tf.reshape (Y_hat, [-1]) Relevant = Tf.gather (flat, index) # mean cause reg is independent of batch size Loss =-tf.reduce_mean (Tf.log (relevant)) + Flags.l2_reg * Regularizer global_step = tf. Variable (0, name= "Global_step", trainable=false) accuracy = Tf.reduce_mean (Tf.to_float (Tf.equal (Tf.argmax (Y_hat), 1), Answer))) Optimizer = Tf.train.AdamOptimizer () Grads_and_vars = optimizer.compute_gradients (loss) Capped_grads_and_ VARs = [(Tf.clip_by_value (Grad, -5, 5), Var) for (grad, Var) in grads_and_vars] Train_op = optimizer.apply_gradients (Cap Ped_grads_and_vars, Global_step=global_step) tf.summary.scalar (' loss ', loss) tf.summary.scalar (' accuracy ', accuracy ) return loss, Train_op, global_step, accuracy def maiN (): DataSet = Tf.placeholder_with_default (0, []) Document_batch, document_weights, Query_batch, query_weights, answer _batch = Read_records (DataSet) Y_hat, Reg = Inference (Document_batch, Document_weights, Query_batch, query_weights) L OSS, TRAIN_OP, global_step, accuracy = Train (Y_hat, Reg, Document_batch, document_weights, answer_batch) Summary_op = TF . Summary.merge_all () with TF. Session () as Sess:summary_writer = Tf.summary.FileWriter (Model_path, sess.graph) saver_variables = Tf.all_variabl ES () if not FLAGS.training:saver_variables = filter (lambda var:var.name! = ' Input_producer/limit_epochs/epochs : 0 ', saver_variables) Saver_variables = filter (lambda var:var.name! = ' smooth_acc:0 ', saver_variables) saver_
variables = filter (lambda var:var.name! = ' avg_acc:0 ', saver_variables) saver = Tf.train.Saver (saver_variables) Sess.run ([Tf.initialize_all_variables (), Tf.initialize_local_variables ()]) model = TF.TRAIN.LAtest_checkpoint (Model_path) if Model:print (' Restoring ' + model) Saver.restore (sess, model) Coord = Tf.train.Coordinator () threads = Tf.train.start_queue_runners (coord=coord) start_time = Time.time () accumula ted_accuracy = 0 Try:if FLAGS.training:while not coord.should_stop (): loss_t, _, step, acc = Sess.run ([Loss, train_op, global_step, accuracy], feed_dict={dataset:0}) elapsed_time, start_time = Time.time ( )-Start_time, Time.time () print (step, loss_t, ACC, elapsed_time) if step% = = 0:sum
Mary_str = Sess.run (summary_op) summary_writer.add_summary (summary_str, step) if step% 1000 = = 0: Saver.save (sess, Model_path + '/aoa ', global_step=step) Else:step = 0 While not coord.sh Ould_stop (): ACC = sess.run (accuracy, feed_dict={dataset:2}) Step + = 1 accumulated_accuracy + = (acc-accumulated_accuracy)/step elapsed_time, start_time = Time.time ()-Start_time, Time.time () print (ACCUMULATED_ACC
Uracy, ACC, elapsed_time) except Tf.errors.OutOfRangeError:print (' done! ') Finally:coord.request_stop () coord.join (threads) ' Import pickle with open (' Counter.pickle ', ' R ' As F:counter = Pickle.load (f) Word, _ = Zip (*counter.most_common ()) "If __name__ = =" __main__ ": Mai
N ()