Learning notes TF019: sequence classification, IMDB rating classification, tf019imdb

Last Update:2017-06-04 Source: Internet

Author: User

Tags stanford nlp

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Learning notes TF019: sequence classification, IMDB rating classification, tf019imdb

Sequence classification: used to predict the category tags of the entire input sequence. Sentiment analysis predicts users' attitudes towards writing text topics. Predict election results or product or movie scores.

Movie evaluation dataset of the International Movie Database. The target value is binary, positive, or negative. There are a lot of language negatives, slang, and fuzzy expressions. You can't just check whether words appear. Construct a word Vector Loop Network, view each comment one by one, and train and predict the sentiment classifier of the entire comment.

IMDB evaluation data set of the Artificial Intelligence Laboratory of Stancu University: http://ai.stanford.edu /~ Amaas/data/sentiment /. Compress the tar document and obtain negative comments from two folder text files. Extract plain text using regular expressions and convert all letters to lowercase letters.

Embedding the word vector representation is richer than the semantics of the word in single hot encoding. The vocabulary determines the word index and finds the correct word vector. The sequence is filled with the same length. Multiple rating data is sent to the network in batches.

Sequence Annotation model. Two Placeholders are input: one input data or a sequence, two target values, and two emotions. Input the params object of the configuration parameter, optimizer.

Dynamically calculates the sequence length of the current batch of data. The data is in the form of a single tensor, and each sequence is supplemented with 0 in the longest rating length. The absolute value is the maximum value of the reduced word vector. Zero vector, scalar 0. Real-type word vector. The scalar value is greater than 0 real numbers. Tf. sign () is discrete to 0 or 1. The result is added along the time step to obtain the sequence length. The tensor length is the same as the batch data capacity, and the scalar represents the sequence length.

Use the params object to define the unit type and number. The length attribute specifies the maximum number of rows that can be batch data to RNN. Obtain the final activity value of each sequence and send it to the softmax layer. Because the length of each review varies, the final output active values of each RNN sequence of batch data have different indexes. Create an index in the time step dimension (batch data shape sequences * time_steps * word_vectors. Tf. gather () creates an index along the 1st dimension. Output the Active Value shape sequences * time_steps * word_vectors flat the first two dimensions (flatten) and add the sequence length. Add length-1 and select the last valid time step.

Gradient pruning. The gradient value is within a reasonable range. There is a meaningful cost function available for any category, and the model outputs the probability distribution of all classes available. Increase gradient clipping to improve learning results and limit maximum weight update. RNN is difficult to train, and different super parameters are not properly matched, so the weights are easily divergent.

TensorFlow Supports Deduction of the compute_gradients function of the optimizer instance, modification of the gradient, and application weight change of the apply_gradients function. If the gradient component is smaller than-limit, set-limit. If the gradient component is limit, set limit. The TensorFlow derivative can be None, indicating that a variable has nothing to do with the cost function. It should be a zero vector in mathematics, but None is conducive to internal performance optimization and only needs to return the None value.

Movie Reviews are sent to recurrent neural networks one by one, and each time step consists of word vectors to form a batch of data. The batched function is used to search for word vectors. The length of all sequences is filled. Training Model, defining super parameters, loading data sets and word vectors, and training batch data running model after preprocessing. Successful model training depends on the network structure, hyperparameters, and word vector quality. Pre-training word vectors can be loaded from skip-gram model word2vec project (https://code.google.com/archive/p/word2vec/), Stanford NLP Research Group Glove model (https://nlp.stanford.edu/projects/glove.

Kaggle Open Learning competition (https://kaggle.com/c/word2vec-nlp-tutorial), IMDB rating data, comparison with others forecast results.

    import tarfile    import re    from helpers import download    class ImdbMovieReviews:        DEFAULT_URL = \        'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'        TOKEN_REGEX = re.compile(r'[A-Za-z]+|[!?.:,()]')    def __init__(self, cache_dir, url=None):        self._cache_dir = cache_dir        self._url = url or type(self).DEFAULT_URL        def __iter__(self):            filepath = download(self._url, self._cache_dir)            with tarfile.open(filepath) as archive:                for filename in archive.getnames():                    if filename.startswith('aclImdb/train/pos/'):                        yield self._read(archive, filename), True                    elif filename.startswith('aclImdb/train/neg/'):                        yield self._read(archive, filename), False        def _read(self, archive, filename):            with archive.extractfile(filename) as file_:                data = file_.read().decode('utf-8')                data = type(self).TOKEN_REGEX.findall(data)                data = [x.lower() for x in data]                return data    import bz2    import numpy as np    class Embedding:        def __init__(self, vocabulary_path, embedding_path, length):            self._embedding = np.load(embedding_path)            with bz2.open(vocabulary_path, 'rt') as file_:                self._vocabulary = {k.strip(): i for i, k in enumerate(file_)}            self._length = length        def __call__(self, sequence):            data = np.zeros((self._length, self._embedding.shape[1]))            indices = [self._vocabulary.get(x, 0) for x in sequence]            embedded = self._embedding[indices]            data[:len(sequence)] = embedded            return data        @property        def dimensions(self):            return self._embedding.shape[1]    import tensorflow as tf    from helpers import lazy_property    class SequenceClassificationModel:        def __init__(self, data, target, params):            self.data = data            self.target = target            self.params = params            self.prediction            self.cost            self.error            self.optimize        @lazy_property        def length(self):            used = tf.sign(tf.reduce_max(tf.abs(self.data), reduction_indices=2))            length = tf.reduce_sum(used, reduction_indices=1)            length = tf.cast(length, tf.int32)            return length        @lazy_property        def prediction(self):            # Recurrent network.            output, _ = tf.nn.dynamic_rnn(                self.params.rnn_cell(self.params.rnn_hidden),                self.data,                dtype=tf.float32,                sequence_length=self.length,            )            last = self._last_relevant(output, self.length)            # Softmax layer.            num_classes = int(self.target.get_shape()[1])            weight = tf.Variable(tf.truncated_normal(                [self.params.rnn_hidden, num_classes], stddev=0.01))            bias = tf.Variable(tf.constant(0.1, shape=[num_classes]))            prediction = tf.nn.softmax(tf.matmul(last, weight) + bias)            return prediction        @lazy_property        def cost(self):            cross_entropy = -tf.reduce_sum(self.target * tf.log(self.prediction))            return cross_entropy        @lazy_property        def error(self):            mistakes = tf.not_equal(                tf.argmax(self.target, 1), tf.argmax(self.prediction, 1))            return tf.reduce_mean(tf.cast(mistakes, tf.float32))        @lazy_property        def optimize(self):            gradient = self.params.optimizer.compute_gradients(self.cost)            try:                limit = self.params.gradient_clipping                gradient = [                    (tf.clip_by_value(g, -limit, limit), v)                    if g is not None else (None, v)                    for g, v in gradient]            except AttributeError:                print('No gradient clipping parameter specified.')            optimize = self.params.optimizer.apply_gradients(gradient)            return optimize        @staticmethod        def _last_relevant(output, length):            batch_size = tf.shape(output)[0]            max_length = int(output.get_shape()[1])            output_size = int(output.get_shape()[2])            index = tf.range(0, batch_size) * max_length + (length - 1)            flat = tf.reshape(output, [-1, output_size])            relevant = tf.gather(flat, index)            return relevant    import tensorflow as tf    from helpers import AttrDict    from Embedding import Embedding    from ImdbMovieReviews import ImdbMovieReviews    from preprocess_batched import preprocess_batched    from SequenceClassificationModel import SequenceClassificationModel    IMDB_DOWNLOAD_DIR = './imdb'    WIKI_VOCAB_DIR = '../01_wikipedia/wikipedia'    WIKI_EMBED_DIR = '../01_wikipedia/wikipedia'    params = AttrDict(        rnn_cell=tf.contrib.rnn.GRUCell,        rnn_hidden=300,        optimizer=tf.train.RMSPropOptimizer(0.002),        batch_size=20,    )    reviews = ImdbMovieReviews(IMDB_DOWNLOAD_DIR)    length = max(len(x[0]) for x in reviews)    embedding = Embedding(        WIKI_VOCAB_DIR + '/vocabulary.bz2',        WIKI_EMBED_DIR + '/embeddings.npy', length)    batches = preprocess_batched(reviews, length, embedding, params.batch_size)    data = tf.placeholder(tf.float32, [None, length, embedding.dimensions])    target = tf.placeholder(tf.float32, [None, 2])    model = SequenceClassificationModel(data, target, params)    sess = tf.Session()    sess.run(tf.initialize_all_variables())    for index, batch in enumerate(batches):        feed = {data: batch[0], target: batch[1]}        error, _ = sess.run([model.error, model.optimize], feed)        print('{}: {:3.1f}%'.format(index + 1, 100 * error))

References:
TensorFlow practices for Machine Intelligence

Welcome to join me: qingxingfengzi
My public account: qingxingfengzigz
My wife Zhang Xingqing's Public Account: qingqingfeifangz

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More