Learning notes TF019: sequence classification, IMDB rating classification, tf019imdb

Source: Internet
Author: User
Tags stanford nlp

Learning notes TF019: sequence classification, IMDB rating classification, tf019imdb

Sequence classification: used to predict the category tags of the entire input sequence. Sentiment analysis predicts users' attitudes towards writing text topics. Predict election results or product or movie scores.

Movie evaluation dataset of the International Movie Database. The target value is binary, positive, or negative. There are a lot of language negatives, slang, and fuzzy expressions. You can't just check whether words appear. Construct a word Vector Loop Network, view each comment one by one, and train and predict the sentiment classifier of the entire comment.

IMDB evaluation data set of the Artificial Intelligence Laboratory of Stancu University: http://ai.stanford.edu /~ Amaas/data/sentiment /. Compress the tar document and obtain negative comments from two folder text files. Extract plain text using regular expressions and convert all letters to lowercase letters.

Embedding the word vector representation is richer than the semantics of the word in single hot encoding. The vocabulary determines the word index and finds the correct word vector. The sequence is filled with the same length. Multiple rating data is sent to the network in batches.

Sequence Annotation model. Two Placeholders are input: one input data or a sequence, two target values, and two emotions. Input the params object of the configuration parameter, optimizer.

Dynamically calculates the sequence length of the current batch of data. The data is in the form of a single tensor, and each sequence is supplemented with 0 in the longest rating length. The absolute value is the maximum value of the reduced word vector. Zero vector, scalar 0. Real-type word vector. The scalar value is greater than 0 real numbers. Tf. sign () is discrete to 0 or 1. The result is added along the time step to obtain the sequence length. The tensor length is the same as the batch data capacity, and the scalar represents the sequence length.

Use the params object to define the unit type and number. The length attribute specifies the maximum number of rows that can be batch data to RNN. Obtain the final activity value of each sequence and send it to the softmax layer. Because the length of each review varies, the final output active values of each RNN sequence of batch data have different indexes. Create an index in the time step dimension (batch data shape sequences * time_steps * word_vectors. Tf. gather () creates an index along the 1st dimension. Output the Active Value shape sequences * time_steps * word_vectors flat the first two dimensions (flatten) and add the sequence length. Add length-1 and select the last valid time step.

Gradient pruning. The gradient value is within a reasonable range. There is a meaningful cost function available for any category, and the model outputs the probability distribution of all classes available. Increase gradient clipping to improve learning results and limit maximum weight update. RNN is difficult to train, and different super parameters are not properly matched, so the weights are easily divergent.

TensorFlow Supports Deduction of the compute_gradients function of the optimizer instance, modification of the gradient, and application weight change of the apply_gradients function. If the gradient component is smaller than-limit, set-limit. If the gradient component is limit, set limit. The TensorFlow derivative can be None, indicating that a variable has nothing to do with the cost function. It should be a zero vector in mathematics, but None is conducive to internal performance optimization and only needs to return the None value.

Movie Reviews are sent to recurrent neural networks one by one, and each time step consists of word vectors to form a batch of data. The batched function is used to search for word vectors. The length of all sequences is filled. Training Model, defining super parameters, loading data sets and word vectors, and training batch data running model after preprocessing. Successful model training depends on the network structure, hyperparameters, and word vector quality. Pre-training word vectors can be loaded from skip-gram model word2vec project (https://code.google.com/archive/p/word2vec/), Stanford NLP Research Group Glove model (https://nlp.stanford.edu/projects/glove.

Kaggle Open Learning competition (https://kaggle.com/c/word2vec-nlp-tutorial), IMDB rating data, comparison with others forecast results.

    import tarfile    import re    from helpers import download    class ImdbMovieReviews:        DEFAULT_URL = \        'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'        TOKEN_REGEX = re.compile(r'[A-Za-z]+|[!?.:,()]')    def __init__(self, cache_dir, url=None):        self._cache_dir = cache_dir        self._url = url or type(self).DEFAULT_URL        def __iter__(self):            filepath = download(self._url, self._cache_dir)            with tarfile.open(filepath) as archive:                for filename in archive.getnames():                    if filename.startswith('aclImdb/train/pos/'):                        yield self._read(archive, filename), True                    elif filename.startswith('aclImdb/train/neg/'):                        yield self._read(archive, filename), False        def _read(self, archive, filename):            with archive.extractfile(filename) as file_:                data = file_.read().decode('utf-8')                data = type(self).TOKEN_REGEX.findall(data)                data = [x.lower() for x in data]                return data    import bz2    import numpy as np    class Embedding:        def __init__(self, vocabulary_path, embedding_path, length):            self._embedding = np.load(embedding_path)            with bz2.open(vocabulary_path, 'rt') as file_:                self._vocabulary = {k.strip(): i for i, k in enumerate(file_)}            self._length = length        def __call__(self, sequence):            data = np.zeros((self._length, self._embedding.shape[1]))            indices = [self._vocabulary.get(x, 0) for x in sequence]            embedded = self._embedding[indices]            data[:len(sequence)] = embedded            return data        @property        def dimensions(self):            return self._embedding.shape[1]    import tensorflow as tf    from helpers import lazy_property    class SequenceClassificationModel:        def __init__(self, data, target, params):            self.data = data            self.target = target            self.params = params            self.prediction            self.cost            self.error            self.optimize        @lazy_property        def length(self):            used = tf.sign(tf.reduce_max(tf.abs(self.data), reduction_indices=2))            length = tf.reduce_sum(used, reduction_indices=1)            length = tf.cast(length, tf.int32)            return length        @lazy_property        def prediction(self):            # Recurrent network.            output, _ = tf.nn.dynamic_rnn(                self.params.rnn_cell(self.params.rnn_hidden),                self.data,                dtype=tf.float32,                sequence_length=self.length,            )            last = self._last_relevant(output, self.length)            # Softmax layer.            num_classes = int(self.target.get_shape()[1])            weight = tf.Variable(tf.truncated_normal(                [self.params.rnn_hidden, num_classes], stddev=0.01))            bias = tf.Variable(tf.constant(0.1, shape=[num_classes]))            prediction = tf.nn.softmax(tf.matmul(last, weight) + bias)            return prediction        @lazy_property        def cost(self):            cross_entropy = -tf.reduce_sum(self.target * tf.log(self.prediction))            return cross_entropy        @lazy_property        def error(self):            mistakes = tf.not_equal(                tf.argmax(self.target, 1), tf.argmax(self.prediction, 1))            return tf.reduce_mean(tf.cast(mistakes, tf.float32))        @lazy_property        def optimize(self):            gradient = self.params.optimizer.compute_gradients(self.cost)            try:                limit = self.params.gradient_clipping                gradient = [                    (tf.clip_by_value(g, -limit, limit), v)                    if g is not None else (None, v)                    for g, v in gradient]            except AttributeError:                print('No gradient clipping parameter specified.')            optimize = self.params.optimizer.apply_gradients(gradient)            return optimize        @staticmethod        def _last_relevant(output, length):            batch_size = tf.shape(output)[0]            max_length = int(output.get_shape()[1])            output_size = int(output.get_shape()[2])            index = tf.range(0, batch_size) * max_length + (length - 1)            flat = tf.reshape(output, [-1, output_size])            relevant = tf.gather(flat, index)            return relevant    import tensorflow as tf    from helpers import AttrDict    from Embedding import Embedding    from ImdbMovieReviews import ImdbMovieReviews    from preprocess_batched import preprocess_batched    from SequenceClassificationModel import SequenceClassificationModel    IMDB_DOWNLOAD_DIR = './imdb'    WIKI_VOCAB_DIR = '../01_wikipedia/wikipedia'    WIKI_EMBED_DIR = '../01_wikipedia/wikipedia'    params = AttrDict(        rnn_cell=tf.contrib.rnn.GRUCell,        rnn_hidden=300,        optimizer=tf.train.RMSPropOptimizer(0.002),        batch_size=20,    )    reviews = ImdbMovieReviews(IMDB_DOWNLOAD_DIR)    length = max(len(x[0]) for x in reviews)    embedding = Embedding(        WIKI_VOCAB_DIR + '/vocabulary.bz2',        WIKI_EMBED_DIR + '/embeddings.npy', length)    batches = preprocess_batched(reviews, length, embedding, params.batch_size)    data = tf.placeholder(tf.float32, [None, length, embedding.dimensions])    target = tf.placeholder(tf.float32, [None, 2])    model = SequenceClassificationModel(data, target, params)    sess = tf.Session()    sess.run(tf.initialize_all_variables())    for index, batch in enumerate(batches):        feed = {data: batch[0], target: batch[1]}        error, _ = sess.run([model.error, model.optimize], feed)        print('{}: {:3.1f}%'.format(index + 1, 100 * error))

 

References:
TensorFlow practices for Machine Intelligence

Welcome to join me: qingxingfengzi
My public account: qingxingfengzigz
My wife Zhang Xingqing's Public Account: qingqingfeifangz

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.