TensorFlow's Word2vec Demo analysis

Source: Internet
Author: User

The code path for the simple demo is in tensorflow\tensorflow\g3doc\tutorials\word2vec\word2vec_basic.py

Model thinking of sikp gram mode

Http://tensorflow.org/tutorials/word2vec/index.md

You can also refer to the cs224d course courseware.

??

the window is set to the left and right 1 words

corresponding to the skip Gram model is a word predicting its surrounding words (the Cbow model is to enter a series of context words to predict a central word )

??

Quick--The quick brown

Skip Gram 's training target cost function is

Corresponding

but it's too time-consuming. the cost of training every step of the time is O (Vocabularysize)

so we used the NCE (noise-contrastive estimation) approach , which is negative sample sampling, in some way randomly generated words as negative samples, such as Quick-sheep ,sheep as a negative sample, suppose we take a negative sample

??

    1. input data here is delimited words
    2. read in Word store to list
    3. statistic word frequency 0 location to unknown, unknown Gets the default dictionary size For example 50000 50000 unknown

      set up key->id id->key bidirectional index map< Span style= "font-family: Microsoft Jas Black" >

4. generating a set of training batch

Batch_size = 128

Embedding_size = Dimension of the embedding vector.

Skip_window = 1 # How many words to consider left and right.

Num_skips = 2 # How many times to reuse an input to generate a label.

??

Batch_size the size of the data scanned per SGD training, the size of the embedding_size word vector, the size of the Skip_window window ,

Num_skips = 2 indicates that input uses the limit of the number of times the label is generated

The default in demo is 2, can be set to 1 contrast

By default when you are 2

Batch, labels = Generate_batch (batch_size=8, num_skips=2, skip_window=1)

For I in range (8):

Print (Batch[i], '-a ', labels[i, 0])

Print (Reverse_dictionary[batch[i]], '---', reverse_dictionary[labels[i, 0])

??

Sample data [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156]

3084 -5239

Originated-anarchism

3084 -12

originated as

6

A

3084

Originated

6-195

Term A-

6-12

A-as

195-2

term, of

195-6

Term a

3084 left 2 times , corresponding to the window around 1

When set to 1

Batch, labels = Generate_batch (batch_size=8, num_skips=1, skip_window=1)

For I in range (8):

Print (Batch[i], '-a ', labels[i, 0])

Print (Reverse_dictionary[batch[i]], '---', reverse_dictionary[labels[i, 0])

??

Sample data [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156]

3084 -12

originated as

3084

Originated

6-12

A-as

195-2

term, of

2-3137

of abuse

3137-46

Abuse-First

59

First--Used

156

3084 left only 1 times

??

??

# Step 4:function to generate a training batch for the Skip-gram model.

def generate_batch (Batch_size, Num_skips, Skip_window):

Global Data_index

Assert batch_size% num_skips = = 0

Assert num_skips <= 2 * Skip_window

Batch = Np.ndarray (Shape= (batch_size), Dtype=np.int32)

Labels = Np.ndarray (shape= (batch_size, 1), Dtype=np.int32)

span = 2 * skip_window + 1 # [Skip_window Target Skip_window]

Buffer = Collections.deque (Maxlen=span)

For _ in range (span):

Buffer.append (Data[data_index])

Data_index = (data_index + 1)% len (data)

For I in range (batch_size//num_skips):

target = Skip_window # target label at the center of the buffer

Targets_to_avoid = [Skip_window]

For j in Range (Num_skips):

While Target in targets_to_avoid:

target = Random.randint (0, span-1)

Targets_to_avoid.append (target)

Batch[i * num_skips + j] = Buffer[skip_window]

Labels[i * num_skips + j, 0] = Buffer[target]

Buffer.append (Data[data_index])

Data_index = (data_index + 1)% len (data)

return batch, Labels

??

Batch, labels = Generate_batch (batch_size=8, num_skips=2, skip_window=1)

For I in range (8):

Print (Batch[i], '-a ', labels[i, 0])

Print (Reverse_dictionary[batch[i]], '---', reverse_dictionary[labels[i, 0])

??

??

It 's about a central word . randomly selects num_skips words in the window range , producing a series of

(input_id, output_id) as a (batch_instance, label)

These are all positive samples.

??

Training preparation,

Input Embedding W

??

??

Output Embedding w^

??

The following code is easier to understand,TF defines the Nce_loss to automatically process, each time will automatically add random negative samples

num_sampled = # of negative examples to sample.

??

Graph = tf. Graph ()

??

With Graph.as_default ():

??

# Input data.

Train_inputs = Tf.placeholder (Tf.int32, shape=[batch_size])

Train_labels = Tf.placeholder (Tf.int32, shape=[batch_size, 1])

Valid_dataset = Tf.constant (Valid_examples, Dtype=tf.int32)

??

# Construct the variables.

embeddings = tf. Variable (

Tf.random_uniform ([Vocabulary_size, Embedding_size],-1.0, 1.0))

Nce_weights = tf. Variable (

Tf.truncated_normal ([Vocabulary_size, Embedding_size],

STDDEV=1.0/MATH.SQRT (embedding_size)))

nce_biases = tf. Variable (Tf.zeros ([vocabulary_size]))

??

# Look up embeddings for inputs.

Embed = Tf.nn.embedding_lookup (embeddings, train_inputs)

??

# Compute The average NCE loss for the batch.

# Tf.nce_loss automatically draws a new sample of the negative labels each

# time we evaluate the loss.

Loss = Tf.reduce_mean (

Tf.nn.nce_loss (nce_weights, nce_biases, embed, Train_labels,

num_sampled, Vocabulary_size))

??

# Construct the SGD optimizer using a learning rate of 1.0.

Optimizer = Tf.train.GradientDescentOptimizer (1.0). Minimize (loss)

??

The training process uses the multiplication of the embedding matrix to calculate the Euclidean distance of different word vectors and calculates the nearest word display for the corresponding distance of several words in high frequency .

??

finally call Skitlearn 's tsne module to reduce the dimension to 2 yuan, drawing display.

??

TensorFlow's Word2vec Demo analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.