The principle and realization of attention-over-attention neural network model in reading comprehension task

Source: Internet
Author: User
Tags scalar zip

This article is "Attention-over-attention neural Networks for Reading comprehension" reading notes. The task to be dealt with in this paper is to read and understand the cloze problem. Its model architecture is built on the "Text Understanding with the Attention Sum Reader Network", the thesis is supreme. Firstly, this paper puts forward the task of using attention for cloze, and this paper adds an additional attention layer on the basis of it, which can eliminate the heuristic algorithm and some super-parameter adjustment problems. We are going to combine two papers to introduce them. Data Set

First introduce the data set, the current large-scale data set mainly includes Cnn/daliy mail and children's book Test (Cbtest). The first two is the news data set, the entire news document as the Cloze text (document), and then a word in its news digest is removed as a query, the word removed as the answer (Answer). The named entities in document are replaced with different identifiers: @entity1, @entity2 、、、, etc. for example, the first Behavior Web page URL (useless), the third Act document, the seventh behavior of the query, answer, And the named entities are replaced:

CBT data set is obtained from children's books, because there is no summary, so the use of 21 consecutive words as document, the 22nd sentence as query and other ways to build. Then it is divided into four subsets according to the part of speech of the answer: named Entity (NE), common noun (CN), verb, preposition. However, because the following two kinds of answers and text do not have a very close relationship, such as people often do not need to read the text can be judged in the blanks, so commonly used is the front two.
Finally, each piece of data is constructed as a ternary group:

<d, Q, a>
Model

First we can look at the model architecture presented in the paper "Text Understanding with the Attention Sum Reader Network", as shown in the following figure:

As can be seen from the above figure, The model first obtains the word vector e (w) for each word in document and query by embedding the Matrix V. Next, we use two encoder networks to get the vector of each word in the text contextual embedding and the expression vector of query. The encoder here uses a bidirectional GRU recurrent neural network. The query vector is then multiplied with the contextual embedding of each word using the dot product method, and the resulting result can be regarded as the weight of each word for the search, and also as a attention. Finally, the Softmax function is used to convert the weights into normalized probabilities, and the most probable result is considered the answer of the query.
Next we look at the model schema presented in this article, as shown in the following illustration:

The first half of the model is exactly the same as above, and the difference is that this paper presents a mechanism of "Attention over Attention", That is, after the vector of document and query is obtained, all the words of query are not combined into a vector, but are multiplied directly in the form of a matrix with the document matrix. Then the softmax operation of the matrix is obtained from the row and column two dimensions to obtain the attention matrix of document and the attention matrix of query. In order to sum the elements of each column in the query matrix as weights, the attention matrix of document is dot product. code implementation of the Model

In fact, the model using TensorFlow implementation is very simple, directly call Tf.contrib.rnn below the Grucell, the difficulty lies in the data processing and reading operations. Here we can refer to the two implementations above GitHub: Olavhn,marshmellox. The first use of the TF built-in read Data API, the code is very concise and clear, I have time to study the implementation of the principle of a blog to organize. The second uses the traditional data processing method, also can refer to, moreover on GitHub above should be able to find the CNN and so on the data set processing code unifies to study together. But the above two code implementation is the older version, if the use of tf1.0 and above may have some function incompatibility problem, I refer to the first code implementation made a certain modification, can be 1. Run on version 0. The code follow-up will be on my GitHub, welcome to view. It takes four or five days to run on the server, and it's not finished yet. The following figure shows the screenshot of the result:


Four parameters represent steps, error rate, accuracy, time, respectively. You can see that the accuracy is not very stable, but basically reached the effect mentioned in the paper. I can look at the code of the model after I modified it, especially the modeling part is relatively simple, only a few lines of command is implemented:

Import OS import time import random import NumPy as NP import TensorFlow as TF from tensorflow.python.ops import sparse_op s from util import Softmax, Orthogonal_initializer flags = tf.app.flags flags = flags. Flags flags. Define_integer (' Vocab_size ', 119662, ' vocabulary size ') flags. Define_integer (' embedding_size ', 384, ' embedding dimension ') flags. Define_integer (' hidden_size ', ' hidden units ') flags. Define_integer (' batch_size ', +, ' batch size ') flags. Define_integer (' epochs ', 2, ' number of epochs to Train/test ') flags. Define_boolean (' Training ', True, ' training or testing a model ') flags. Define_string (' name ', ' Lc_model ', ' model name ' (used for statistics and model path ') flags. Define_float (' Dropout_keep_prob ', 0.9, ' keep prob for embedding dropout ') flags. Define_float (' L2_reg ', 0.0001, ' L2 regularization for embeddings ') Model_path = ' models/' + flags.name if not os.path.ex Ists (Model_path): Os.makedirs (Model_path) def read_records (index=0): Train_queue = Tf.train.string_inpuT_producer ([' Training.tfrecords '], num_epochs=flags.epochs) validation_queue = Tf.train.string_input_producer ([' Validation.tfrecords '], num_epochs=flags.epochs) test_queue = Tf.train.string_input_producer ([' Test.tfrecords '], Num_epochs=flags.epochs) queue = tf. Queuebase.from_list (index, [Train_queue, Validation_queue, test_queue]) reader = TF. Tfrecordreader () _, Serialized_example = Reader.read (queue) features = Tf.parse_single_example (serialized_examp Le, features={' document ': TF. Varlenfeature (Tf.int64), ' query ': TF. Varlenfeature (Tf.int64), ' answer ': TF. Fixedlenfeature ([], Tf.int64)}) document = Sparse_ops.serialize_sparse (features[' document ')) query = Sparse_op S.serialize_sparse (features[' query ') answer = features[' answer '] document_batch_serialized, query_batch_serialized , Answer_batch = Tf.train.shuffle_batch ([document, query, answer], batch_size=flags.batch_size, capacity=2000 , Min_after_dequeue=1000) Sparse_document_batch = Sparse_ops.deserialize_many_sparse (document_batch_serialized, Dtype=tf.int64) sparse _query_batch = Sparse_ops.deserialize_many_sparse (query_batch_serialized, dtype=tf.int64) Document_batch = Tf.sparse _tensor_to_dense (sparse_document_batch) document_weights = Tf.sparse_to_dense (sparse_document_batch.indices, Sparse_document_batch.dense_shape, 1) query_batch = Tf.sparse_tensor_to_dense (sparse_query_batch) query_weights = tf. Sparse_to_dense (Sparse_query_batch.indices, Sparse_query_batch.dense_shape, 1) return Document_batch, Document_ Weights, Query_batch, query_weights, Answer_batch def inference (documents, Doc_mask, query, query_mask): embedding = t F.get_variable (' embedding ', [Flags.vocab_size, Flags.embedding_size], Initializer=tf.random_un Iform_initializer (minval=-0.05, maxval=0.05)) Regularizer = Tf.nn.l2_loss (embedding) Doc_emb = Tf.nn.dropout (tf.nn.e Mbedding_lookup (embedding, documents), Flags.droPout_keep_prob) Doc_emb.set_shape ([None, none, Flags.embedding_size]) Query_emb = Tf.nn.dropout (tf.nn.embedding_look Up (embedding, query), Flags.dropout_keep_prob) Query_emb.set_shape ([None, none, Flags.embedding_size]) with Tf.variab Le_scope (' Document ', Initializer=orthogonal_initializer ()): Fwd_cell = Tf.contrib.rnn.GRUCell (flags.hidden_size) b Ack_cell = Tf.contrib.rnn.GRUCell (flags.hidden_size) Doc_len = Tf.reduce_sum (Doc_mask, Reduction_indices=1) H, _ = Tf.nn.bidirectional_dynamic_rnn (Fwd_cell, Back_cell, Doc_emb, Sequence_length=tf.to_int64 (Doc_len), DTYPE=TF.FL Oat32) #h_doc = Tf.nn.dropout (Tf.concat (2, h), flags.dropout_keep_prob) H_doc = Tf.concat (h, 2) with TF.VARIABL E_scope (' query ', Initializer=orthogonal_initializer ()): Fwd_cell = Tf.contrib.rnn.GRUCell (flags.hidden_size) back_ Cell = Tf.contrib.rnn.GRUCell (flags.hidden_size) Query_len = Tf.reduce_sum (Query_mask, Reduction_indices=1) H, _ = Tf.nn.bidirectional_dynAmic_rnn (Fwd_cell, Back_cell, Query_emb, Sequence_length=tf.to_int64 (Query_len), dtype=tf.float32) #h_query = Tf.nn.dropout (Tf.concat (2, h), flags.dropout_keep_prob) H_query = Tf.concat (h, 2) M = Tf.matmul (H_doc, H_query, ad joint_b=true) M_mask = Tf.to_float (Tf.matmul (Tf.expand_dims (Doc_mask,-1), Tf.expand_dims (Query_mask, 1))) Alpha = so Ftmax (M, 1, m_mask) beta = Softmax (M, 2, m_mask) #query_importance = Tf.expand_dims (Tf.reduce_mean (Beta, reduction_in

  Dices=1),-1) query_importance = Tf.expand_dims (Tf.reduce_sum (Beta, 1)/Tf.to_float (Tf.expand_dims (Doc_len,-1)),-1) s = Tf.squeeze (Tf.matmul (Alpha, Query_importance), [2]) unpacked_s = Zip (Tf.unstack (S, flags.batch_size), Tf.unstack ( Documents, flags.batch_size)) Y_hat = Tf.stack ([Tf.unsorted_segment_sum (Attentions, sentence_ids, flags.vocab_size) For (attentions, sentence_ids) in unpacked_s]) return y_hat, Regularizer def train (Y_hat, Regularizer, document, Doc_w Eight, answer): # Trick WHIle we wait for tf.gather_nd-https://github.com/tensorflow/tensorflow/issues/206 # This unfortunately causes us to exp And a sparse tensor into the full vocabulary index = tf.range (0, flags.batch_size) * flags.vocab_size + tf.to_int32 (ANSW
  ER) flat = tf.reshape (Y_hat, [-1]) Relevant = Tf.gather (flat, index) # mean cause reg is independent of batch size Loss =-tf.reduce_mean (Tf.log (relevant)) + Flags.l2_reg * Regularizer global_step = tf. Variable (0, name= "Global_step", trainable=false) accuracy = Tf.reduce_mean (Tf.to_float (Tf.equal (Tf.argmax (Y_hat), 1), Answer))) Optimizer = Tf.train.AdamOptimizer () Grads_and_vars = optimizer.compute_gradients (loss) Capped_grads_and_ VARs = [(Tf.clip_by_value (Grad, -5, 5), Var) for (grad, Var) in grads_and_vars] Train_op = optimizer.apply_gradients (Cap Ped_grads_and_vars, Global_step=global_step) tf.summary.scalar (' loss ', loss) tf.summary.scalar (' accuracy ', accuracy ) return loss, Train_op, global_step, accuracy def maiN (): DataSet = Tf.placeholder_with_default (0, []) Document_batch, document_weights, Query_batch, query_weights, answer _batch = Read_records (DataSet) Y_hat, Reg = Inference (Document_batch, Document_weights, Query_batch, query_weights) L OSS, TRAIN_OP, global_step, accuracy = Train (Y_hat, Reg, Document_batch, document_weights, answer_batch) Summary_op = TF . Summary.merge_all () with TF. Session () as Sess:summary_writer = Tf.summary.FileWriter (Model_path, sess.graph) saver_variables = Tf.all_variabl ES () if not FLAGS.training:saver_variables = filter (lambda var:var.name! = ' Input_producer/limit_epochs/epochs : 0 ', saver_variables) Saver_variables = filter (lambda var:var.name! = ' smooth_acc:0 ', saver_variables) saver_

    variables = filter (lambda var:var.name! = ' avg_acc:0 ', saver_variables) saver = Tf.train.Saver (saver_variables) Sess.run ([Tf.initialize_all_variables (), Tf.initialize_local_variables ()]) model = TF.TRAIN.LAtest_checkpoint (Model_path) if Model:print (' Restoring ' + model) Saver.restore (sess, model) Coord = Tf.train.Coordinator () threads = Tf.train.start_queue_runners (coord=coord) start_time = Time.time () accumula  ted_accuracy = 0 Try:if FLAGS.training:while not coord.should_stop (): loss_t, _, step, acc = Sess.run ([Loss, train_op, global_step, accuracy], feed_dict={dataset:0}) elapsed_time, start_time = Time.time ( )-Start_time, Time.time () print (step, loss_t, ACC, elapsed_time) if step% = = 0:sum
            Mary_str = Sess.run (summary_op) summary_writer.add_summary (summary_str, step) if step% 1000 = = 0: Saver.save (sess, Model_path + '/aoa ', global_step=step) Else:step = 0 While not coord.sh  Ould_stop (): ACC = sess.run (accuracy, feed_dict={dataset:2}) Step + = 1 accumulated_accuracy + = (acc-accumulated_accuracy)/step elapsed_time, start_time = Time.time ()-Start_time, Time.time () print (ACCUMULATED_ACC
    Uracy, ACC, elapsed_time) except Tf.errors.OutOfRangeError:print (' done! ') Finally:coord.request_stop () coord.join (threads) ' Import pickle with open (' Counter.pickle ', ' R ' As F:counter = Pickle.load (f) Word, _ = Zip (*counter.most_common ()) "If __name__ = =" __main__ ": Mai
 N ()

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.