Learning notes TF019: Sequence classification, IMDB film Review classification

Source: Internet
Author: User
Tags bz2 scalar stanford nlp

Sequence classification, which predicts the category labels for the entire input sequence. Sentiment analysis, predict the user to write the text topic attitude. Predict election results or product and movie ratings.

International Film Database (international movie) Film Critics DataSet. The target value is two yuan, positive or negative. Language is a large number of negative, irony, fuzzy, can not only see whether the word appears. Construct word vector loop network, view each comment by word, and the last word value training to predict the whole comment emotion classifier.

IMDB film review Data set: http://ai.stanford.edu/~amaas/data/sentiment/at the AI Lab at the University of the Fox. Compress the tar document, positive negative comments are obtained from two folder text files. Use regular expressions to extract plain text, and all letters to lowercase.

Word vector embedding means richer than single-hot coded words. The glossary determines the word index and finds the correct word vector. The sequence fills the same length, and multiple movie review data is fed into the network.

The sequence callout model, passing in two placeholders, one input data or sequence, two target values target or mood. Incoming configuration parameter params object, optimizer.

Dynamically calculates the length of the current batch data series. Data in single tensor form, each sequence is 0 of the longest movie review length. Absolute maximum reduced word vector. 0 vector, scalar 0. Real word vector, scalar greater than 0 real. Tf.sign () is 0 or 1 discrete. The results are added along time steps to get the sequence length. The tensor length is the same as the batch data capacity, and the scalar represents the sequence length.

Use the params object to define the cell type and number of units. The Length property specifies the maximum number of rows to provide to the RNN for batch data. Gets the last active value of each sequence and feeds into the Softmax layer. Depending on the length of each film review, the batch data has different indices for each sequence RNN the last correlated output activity. The index is established in the Time Step dimension (batch Data shape sequences*time_steps*word_vectors). Tf.gather () indexes along the 1th dimension. The output active value shape Sequences*time_steps*word_vectors The first two dimensions of flattening (flatten), adding the sequence length. Add Length-1 and select the last available time step.

Gradient clipping, the gradient value is limited to a reasonable range. A meaningful cost function can be used in any classification, and the model output can be used for all categories of probability distributions. Increased gradient cropping (gradient clipping) improves learning results and limits maximum weight updates. RNN training difficult, different parameters with improper, the weight is very easy to divergence.

The TensorFlow supports the optimizer instance compute_gradients function deduction, modifies the gradient, and the apply_gradients function applies the weight value change. Gradient component is less than-limit, set-limit, gradient component is limit, set limit. The TensorFlow derivative can be none, indicating that a variable is not related to the cost function, that it should be zero vector mathematically, but none is conducive to internal performance optimization, simply return the value of None.

The film critic feeds the recurrent neural network one by one, and each time step consists of a word vector to form batch data. The batched function looks up the word vector, and all the sequence lengths are padded. Train the model, define the hyper-parameters, load the dataset and Word vectors, and run the model through the preprocessing training batch data. The successful training of model depends on network structure, hyper-parameter and word vector quality. Available from the Skip-gram Model Word2vec project (https://code.google.com/archive/p/word2vec/), the Stanford NLP Research Group Glove Model (https://nlp.stanford.edu /projects/glove), load the pre-trained word vector.

Kaggle Open Learning Contest (https://kaggle.com/c/word2vec-nlp-tutorial), IMDB film review data, compare predictions with others.

    ImportTarfileImportRe fromHelpersImportDownloadclassImdbmoviereviews:default_url=         'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'Token_regex= Re.compile (r'[a-za-z]+| [!?.:,()]')    def __init__(Self, Cache_dir, url=None): Self._cache_dir=Cache_dir Self._url= URLortype (self). Default_urldef __iter__(self): filepath=Download (Self._url, Self._cache_dir) with Tarfile.open (filepath) as archive: forFileNameincharchive.getnames ():ifFilename.startswith ('aclimdb/train/pos/'):                        yieldself._read (archive, filename), TrueelifFilename.startswith ('aclimdb/train/neg/'):                        yieldself._read (archive, filename), Falsedef_read (self, archive, filename): with archive.extractfile (filename) as File_: Data= File_.read (). Decode ('Utf-8') Data=type (self). Token_regex.findall (data) data= [X.lower () forXinchData]returnDataImportbz2ImportNumPy as NPclassEmbedding:def __init__(self, vocabulary_path, Embedding_path, length): Self._embedding=Np.load (Embedding_path) with Bz2.open (Vocabulary_path,'RT') as File_: Self._vocabulary= {K.strip (): I forI, KinchEnumerate (file_)} self._length=lengthdef __call__(self, Sequence): Data= Np.zeros ((self._length, self._embedding.shape[1])) Indices= [Self._vocabulary.get (x, 0) forXinchsequence] Embedded=self._embedding[indices] Data[:len (sequence)]=EmbeddedreturnData @propertydefDimensions (self):returnSelf._embedding.shape[1]    ImportTensorFlow as TF fromHelpersImportLazy_propertyclassSequenceclassificationmodel:def __init__(self, data, target, params): Self.data=Data Self.target=Target Self.params=params self.prediction self.cost self.error self.optimize @lazy_pr OpertydefLength (self): used= Tf.sign (Tf.reduce_max (Tf.abs (self.data), reduction_indices=2)) Length= Tf.reduce_sum (Used, Reduction_indices=1) Length=tf.cast (length, Tf.int32)returnlength @lazy_propertydefprediction (self):#Recurrent network.Output, _ =Tf.nn.dynamic_rnn (Self.params.rnn_cell (Self.params.rnn_hidden), Self.data, Dtype=Tf.float32, Sequence_length=self.length,) last=self._last_relevant (output, self.length)#Softmax layer.num_classes = Int (Self.target.get_shape () [1]) Weight=TF. Variable (Tf.truncated_normal ([Self.params.rnn_hidden, num_classes], StdDev=0.01)) Bias= TF. Variable (Tf.constant (0.1, shape=[num_classes])) Prediction= Tf.nn.softmax (Tf.matmul (last, weight) +bias)returnPrediction @lazy_propertydefCost (self): cross_entropy=-tf.reduce_sum (Self.target *Tf.log (self.prediction))returncross_entropy @lazy_propertydeferror (self): mistakes=tf.not_equal (Tf.argmax (Self.target,1), Tf.argmax (self.prediction, 1))            returnTf.reduce_mean (Tf.cast (mistakes, tf.float32)) @lazy_propertydefoptimize (self): gradient=self.params.optimizer.compute_gradients (self.cost)Try: Limit=self.params.gradient_clipping Gradient=[(Tf.clip_by_value (G,-limit, limit), V)ifG is  notNoneElse(None, v) forG, VinchGradient]exceptAttributeerror:Print('No Gradient clipping parameter specified.') Optimize=self.params.optimizer.apply_gradients (gradient)returnoptimize @staticmethoddef_last_relevant (output, length): Batch_size=tf.shape (output) [0] Max_length= Int (Output.get_shape () [1]) output_size= Int (Output.get_shape () [2]) Index= Tf.range (0, batch_size) * max_length + (length-1) Flat= Tf.reshape (output, [-1, Output_size]) Relevant=Tf.gather (flat, index)returnrelevantImportTensorFlow as TF fromHelpersImportattrdict fromEmbeddingImportEmbedding fromImdbmoviereviewsImportimdbmoviereviews fromPreprocess_batchedImportpreprocess_batched fromSequenceclassificationmodelImportSequenceclassificationmodel Imdb_download_dir='./imdb'Wiki_vocab_dir='.. /01_wikipedia/wikipedia'Wiki_embed_dir='.. /01_wikipedia/wikipedia'params=attrdict (Rnn_cell=Tf.contrib.rnn.GRUCell, Rnn_hidden=300, Optimizer=tf.train.rmspropoptimizer (0.002), Batch_size=20,) reviews=imdbmoviereviews (imdb_download_dir) Length= Max (len (x[0]) forXinchreviews) Embedding=Embedding (Wiki_vocab_dir+'/vocabulary.bz2', Wiki_embed_dir+'/embeddings.npy', length) batches=preprocess_batched (reviews, length, embedding, params.batch_size) data=Tf.placeholder (Tf.float32, [None, Length, embedding.dimensions]) target= Tf.placeholder (Tf.float32, [None, 2]) Model=Sequenceclassificationmodel (data, target, params) Sess=TF. Session () Sess.run (Tf.initialize_all_variables ()) forIndex, BatchinchEnumerate (batches): Feed= {Data:batch[0], target:batch[1]} error, _=Sess.run ([Model.error, model.optimize], feed)Print('{}: {: 3.1f}%'. Format (index + 1, error))

Resources:
"TensorFlow Practice for Machine Intelligence"

Welcome to add me to Exchange: Qingxingfengzi
My public number: Qingxingfengzigz
My wife Zhang Yuqing's public number: Qingqingfeifangz

Learning notes TF019: Sequence classification, IMDB film Review classification

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.