TensorFlow's Word2vec Demo analysis

Last Update:2015-11-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The code path for the simple demo is in tensorflow\tensorflow\g3doc\tutorials\word2vec\word2vec_basic.py

Model thinking of sikp gram mode

Http://tensorflow.org/tutorials/word2vec/index.md

You can also refer to the cs224d course courseware.

the window is set to the left and right 1 words

corresponding to the skip Gram model is a word predicting its surrounding words (the Cbow model is to enter a series of context words to predict a central word )

Quick--The quick brown

Skip Gram 's training target cost function is

Corresponding

but it's too time-consuming. the cost of training every step of the time is O (Vocabularysize)

so we used the NCE (noise-contrastive estimation) approach , which is negative sample sampling, in some way randomly generated words as negative samples, such as Quick-sheep ,sheep as a negative sample, suppose we take a negative sample

input data here is delimited words
read in Word store to list
statistic word frequency 0 location to unknown, unknown Gets the default dictionary size For example 50000 50000 unknown
set up key->id id->key bidirectional index map< Span style= "font-family: Microsoft Jas Black" >

4. generating a set of training batch

Batch_size = 128

Embedding_size = Dimension of the embedding vector.

Skip_window = 1 # How many words to consider left and right.

Num_skips = 2 # How many times to reuse an input to generate a label.

Batch_size the size of the data scanned per SGD training, the size of the embedding_size word vector, the size of the Skip_window window ,

Num_skips = 2 indicates that input uses the limit of the number of times the label is generated

The default in demo is 2, can be set to 1 contrast

By default when you are 2

Batch, labels = Generate_batch (batch_size=8, num_skips=2, skip_window=1)

For I in range (8):

Print (Batch[i], '-a ', labels[i, 0])

Print (Reverse_dictionary[batch[i]], '---', reverse_dictionary[labels[i, 0])

Sample data [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156]

3084 -5239

Originated-anarchism

3084 -12

originated as

3084

Originated

6-195

Term A-

6-12

A-as

195-2

term, of

195-6

Term a

3084 left 2 times , corresponding to the window around 1

When set to 1

Batch, labels = Generate_batch (batch_size=8, num_skips=1, skip_window=1)

For I in range (8):

Print (Batch[i], '-a ', labels[i, 0])

Print (Reverse_dictionary[batch[i]], '---', reverse_dictionary[labels[i, 0])

Sample data [5239, 3084, 12, 6, 195, 2, 3137, 46, 59, 156]

3084 -12

originated as

3084

Originated

6-12

A-as

195-2

term, of

2-3137

of abuse

3137-46

Abuse-First

First--Used

156

3084 left only 1 times

# Step 4:function to generate a training batch for the Skip-gram model.

def generate_batch (Batch_size, Num_skips, Skip_window):

Global Data_index

Assert batch_size% num_skips = = 0

Assert num_skips <= 2 * Skip_window

Batch = Np.ndarray (Shape= (batch_size), Dtype=np.int32)

Labels = Np.ndarray (shape= (batch_size, 1), Dtype=np.int32)

span = 2 * skip_window + 1 # [Skip_window Target Skip_window]

Buffer = Collections.deque (Maxlen=span)

For _ in range (span):

Buffer.append (Data[data_index])

Data_index = (data_index + 1)% len (data)

For I in range (batch_size//num_skips):

target = Skip_window # target label at the center of the buffer

Targets_to_avoid = [Skip_window]

For j in Range (Num_skips):

While Target in targets_to_avoid:

target = Random.randint (0, span-1)

Targets_to_avoid.append (target)

Batch[i * num_skips + j] = Buffer[skip_window]

Labels[i * num_skips + j, 0] = Buffer[target]

Buffer.append (Data[data_index])

Data_index = (data_index + 1)% len (data)

return batch, Labels

Batch, labels = Generate_batch (batch_size=8, num_skips=2, skip_window=1)

For I in range (8):

Print (Batch[i], '-a ', labels[i, 0])

Print (Reverse_dictionary[batch[i]], '---', reverse_dictionary[labels[i, 0])

It 's about a central word . randomly selects num_skips words in the window range , producing a series of

(input_id, output_id) as a (batch_instance, label)

These are all positive samples.

Training preparation,

Input Embedding W

Output Embedding w^

The following code is easier to understand,TF defines the Nce_loss to automatically process, each time will automatically add random negative samples

num_sampled = # of negative examples to sample.

Graph = tf. Graph ()

With Graph.as_default ():

# Input data.

Train_inputs = Tf.placeholder (Tf.int32, shape=[batch_size])

Train_labels = Tf.placeholder (Tf.int32, shape=[batch_size, 1])

Valid_dataset = Tf.constant (Valid_examples, Dtype=tf.int32)

# Construct the variables.

embeddings = tf. Variable (

Tf.random_uniform ([Vocabulary_size, Embedding_size],-1.0, 1.0))

Nce_weights = tf. Variable (

Tf.truncated_normal ([Vocabulary_size, Embedding_size],

STDDEV=1.0/MATH.SQRT (embedding_size)))

nce_biases = tf. Variable (Tf.zeros ([vocabulary_size]))

# Look up embeddings for inputs.

Embed = Tf.nn.embedding_lookup (embeddings, train_inputs)

# Compute The average NCE loss for the batch.

# Tf.nce_loss automatically draws a new sample of the negative labels each

# time we evaluate the loss.

Loss = Tf.reduce_mean (

Tf.nn.nce_loss (nce_weights, nce_biases, embed, Train_labels,

num_sampled, Vocabulary_size))

# Construct the SGD optimizer using a learning rate of 1.0.

Optimizer = Tf.train.GradientDescentOptimizer (1.0). Minimize (loss)

The training process uses the multiplication of the embedding matrix to calculate the Euclidean distance of different word vectors and calculates the nearest word display for the corresponding distance of several words in high frequency .

finally call Skitlearn 's tsne module to reduce the dimension to 2 yuan, drawing display.

TensorFlow's Word2vec Demo analysis

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

TensorFlow's Word2vec Demo analysis

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support