Learning notes TF019: sequence classification, IMDB rating classification, tf019imdb
Sequence classification: used to predict the category tags of the entire input sequence. Sentiment analysis predicts users' attitudes towards writing text topics. Predict election results or product or movie scores.
Movie evaluation dataset of the International Movie Database. The target value is binary, positive, or negative. There are a lot of language negatives, slang, and fuzzy expressions. You can't just check whether words appear. Construct a word Vector Loop Network, view each comment one by one, and train and predict the sentiment classifier of the entire comment.
IMDB evaluation data set of the Artificial Intelligence Laboratory of Stancu University: http://ai.stanford.edu /~ Amaas/data/sentiment /. Compress the tar document and obtain negative comments from two folder text files. Extract plain text using regular expressions and convert all letters to lowercase letters.
Embedding the word vector representation is richer than the semantics of the word in single hot encoding. The vocabulary determines the word index and finds the correct word vector. The sequence is filled with the same length. Multiple rating data is sent to the network in batches.
Sequence Annotation model. Two Placeholders are input: one input data or a sequence, two target values, and two emotions. Input the params object of the configuration parameter, optimizer.
Dynamically calculates the sequence length of the current batch of data. The data is in the form of a single tensor, and each sequence is supplemented with 0 in the longest rating length. The absolute value is the maximum value of the reduced word vector. Zero vector, scalar 0. Real-type word vector. The scalar value is greater than 0 real numbers. Tf. sign () is discrete to 0 or 1. The result is added along the time step to obtain the sequence length. The tensor length is the same as the batch data capacity, and the scalar represents the sequence length.
Use the params object to define the unit type and number. The length attribute specifies the maximum number of rows that can be batch data to RNN. Obtain the final activity value of each sequence and send it to the softmax layer. Because the length of each review varies, the final output active values of each RNN sequence of batch data have different indexes. Create an index in the time step dimension (batch data shape sequences * time_steps * word_vectors. Tf. gather () creates an index along the 1st dimension. Output the Active Value shape sequences * time_steps * word_vectors flat the first two dimensions (flatten) and add the sequence length. Add length-1 and select the last valid time step.
Gradient pruning. The gradient value is within a reasonable range. There is a meaningful cost function available for any category, and the model outputs the probability distribution of all classes available. Increase gradient clipping to improve learning results and limit maximum weight update. RNN is difficult to train, and different super parameters are not properly matched, so the weights are easily divergent.
TensorFlow Supports Deduction of the compute_gradients function of the optimizer instance, modification of the gradient, and application weight change of the apply_gradients function. If the gradient component is smaller than-limit, set-limit. If the gradient component is limit, set limit. The TensorFlow derivative can be None, indicating that a variable has nothing to do with the cost function. It should be a zero vector in mathematics, but None is conducive to internal performance optimization and only needs to return the None value.
Movie Reviews are sent to recurrent neural networks one by one, and each time step consists of word vectors to form a batch of data. The batched function is used to search for word vectors. The length of all sequences is filled. Training Model, defining super parameters, loading data sets and word vectors, and training batch data running model after preprocessing. Successful model training depends on the network structure, hyperparameters, and word vector quality. Pre-training word vectors can be loaded from skip-gram model word2vec project (https://code.google.com/archive/p/word2vec/), Stanford NLP Research Group Glove model (https://nlp.stanford.edu/projects/glove.
Kaggle Open Learning competition (https://kaggle.com/c/word2vec-nlp-tutorial), IMDB rating data, comparison with others forecast results.
import tarfile import re from helpers import download class ImdbMovieReviews: DEFAULT_URL = \ 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz' TOKEN_REGEX = re.compile(r'[A-Za-z]+|[!?.:,()]') def __init__(self, cache_dir, url=None): self._cache_dir = cache_dir self._url = url or type(self).DEFAULT_URL def __iter__(self): filepath = download(self._url, self._cache_dir) with tarfile.open(filepath) as archive: for filename in archive.getnames(): if filename.startswith('aclImdb/train/pos/'): yield self._read(archive, filename), True elif filename.startswith('aclImdb/train/neg/'): yield self._read(archive, filename), False def _read(self, archive, filename): with archive.extractfile(filename) as file_: data = file_.read().decode('utf-8') data = type(self).TOKEN_REGEX.findall(data) data = [x.lower() for x in data] return data import bz2 import numpy as np class Embedding: def __init__(self, vocabulary_path, embedding_path, length): self._embedding = np.load(embedding_path) with bz2.open(vocabulary_path, 'rt') as file_: self._vocabulary = {k.strip(): i for i, k in enumerate(file_)} self._length = length def __call__(self, sequence): data = np.zeros((self._length, self._embedding.shape[1])) indices = [self._vocabulary.get(x, 0) for x in sequence] embedded = self._embedding[indices] data[:len(sequence)] = embedded return data @property def dimensions(self): return self._embedding.shape[1] import tensorflow as tf from helpers import lazy_property class SequenceClassificationModel: def __init__(self, data, target, params): self.data = data self.target = target self.params = params self.prediction self.cost self.error self.optimize @lazy_property def length(self): used = tf.sign(tf.reduce_max(tf.abs(self.data), reduction_indices=2)) length = tf.reduce_sum(used, reduction_indices=1) length = tf.cast(length, tf.int32) return length @lazy_property def prediction(self): # Recurrent network. output, _ = tf.nn.dynamic_rnn( self.params.rnn_cell(self.params.rnn_hidden), self.data, dtype=tf.float32, sequence_length=self.length, ) last = self._last_relevant(output, self.length) # Softmax layer. num_classes = int(self.target.get_shape()[1]) weight = tf.Variable(tf.truncated_normal( [self.params.rnn_hidden, num_classes], stddev=0.01)) bias = tf.Variable(tf.constant(0.1, shape=[num_classes])) prediction = tf.nn.softmax(tf.matmul(last, weight) + bias) return prediction @lazy_property def cost(self): cross_entropy = -tf.reduce_sum(self.target * tf.log(self.prediction)) return cross_entropy @lazy_property def error(self): mistakes = tf.not_equal( tf.argmax(self.target, 1), tf.argmax(self.prediction, 1)) return tf.reduce_mean(tf.cast(mistakes, tf.float32)) @lazy_property def optimize(self): gradient = self.params.optimizer.compute_gradients(self.cost) try: limit = self.params.gradient_clipping gradient = [ (tf.clip_by_value(g, -limit, limit), v) if g is not None else (None, v) for g, v in gradient] except AttributeError: print('No gradient clipping parameter specified.') optimize = self.params.optimizer.apply_gradients(gradient) return optimize @staticmethod def _last_relevant(output, length): batch_size = tf.shape(output)[0] max_length = int(output.get_shape()[1]) output_size = int(output.get_shape()[2]) index = tf.range(0, batch_size) * max_length + (length - 1) flat = tf.reshape(output, [-1, output_size]) relevant = tf.gather(flat, index) return relevant import tensorflow as tf from helpers import AttrDict from Embedding import Embedding from ImdbMovieReviews import ImdbMovieReviews from preprocess_batched import preprocess_batched from SequenceClassificationModel import SequenceClassificationModel IMDB_DOWNLOAD_DIR = './imdb' WIKI_VOCAB_DIR = '../01_wikipedia/wikipedia' WIKI_EMBED_DIR = '../01_wikipedia/wikipedia' params = AttrDict( rnn_cell=tf.contrib.rnn.GRUCell, rnn_hidden=300, optimizer=tf.train.RMSPropOptimizer(0.002), batch_size=20, ) reviews = ImdbMovieReviews(IMDB_DOWNLOAD_DIR) length = max(len(x[0]) for x in reviews) embedding = Embedding( WIKI_VOCAB_DIR + '/vocabulary.bz2', WIKI_EMBED_DIR + '/embeddings.npy', length) batches = preprocess_batched(reviews, length, embedding, params.batch_size) data = tf.placeholder(tf.float32, [None, length, embedding.dimensions]) target = tf.placeholder(tf.float32, [None, 2]) model = SequenceClassificationModel(data, target, params) sess = tf.Session() sess.run(tf.initialize_all_variables()) for index, batch in enumerate(batches): feed = {data: batch[0], target: batch[1]} error, _ = sess.run([model.error, model.optimize], feed) print('{}: {:3.1f}%'.format(index + 1, 100 * error))
References:
TensorFlow practices for Machine Intelligence
Welcome to join me: qingxingfengzi
My public account: qingxingfengzigz
My wife Zhang Xingqing's Public Account: qingqingfeifangz