Word2vec code interpretation under Python TensorFlow

Last Update:2016-05-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Objective:

As a serious fanatic of deep learning, after studying various theories, he has been trying to learn the framework of deep learning through project practiced hand and the knowledge of structure used in actual combat. The wish is good, but the opportunity is not easy to find. Just recently there was a project that took this opportunity to practiced hand, and I found that there were similar problems in the major machine learning and tensorflow framework groups. So I hope to borrow the hands of the project to share a bit of my understanding of the process and experience, I hope that the basis for the benefit of everyone's work, to get the industry in the criticism of professionals, thank you for your support!

The first chapter of the blog I will be divided into two parts, this section will tell Word2vec in the TensorFlow of the official version of the basic structure and how to build a Cbow model to compensate for the lack of the version of the model architecture. In the next section, I'll focus on the results of the TensorFlow Basic, optimised, and Gensim three versions of Word2vec.

Code parsing:

First, the basic tutorials provided by TensorFlow have explained what Word2vec is and how TensorFlow is building this network for training. The address of the tutorial can be seen here. In addition, this basic version of the code can be found here.

The structure of the

code seems confusing, but it's straightforward. First, the 61st line limits the demo to learn altogether 50,000 different words. Then, in the build_dataset (words) function, line 65th shows the strength of the Python language, which is to organize the entire input in one row. After the count of UNK (that is, the unknown word, that is, the word frequency is less than a certain number of rare words), the count number is embedded with the extend function for the lower number of vocabulary_size-1 from the high net, In this way, all repetitions of less than 49,999 words can only be lost, and count will crowd it out. The count is formed after dictionary comes from sorting the word frequency in count, removing the number of repetitions but ranking the order as the key for this dict structure. The word itself becomes the value of the DICT structure. After that, the input words are converted into their code in the dictionary, and finally, the number of words in the input data is not in this dictionary, increase the quantity of UNK according to the number, and the dictionary function in order from high frequency to low frequency sorting method. As a result, the Build_dataset function successfully rebuilds the input data and forms the code word comparison table, where data will be used to train the model and dictionary will be able to query the translation of the vector and word relationships most ultimately. What if you don't want to limit the vocabulary_size in dictionary? In fact, the answer is simple. Mikolov that it would be nice to remove a word with a frequency of less than 3 to 10, then we could do the following modification to the function:

def build_dataset (words, min_cut_freq): count_org = [[' UNK ',-1]] Count_ Org.extend (collections.    Counter (words). Most_common ()) #这里我们收集全部的单词的词频 count = [[' UNK ',-1]] for Word, c in count_org:word_tuple = [Word, c] If Word = = ' UNK ': #保留UNK的位置已备后用 count[0][1] = c Continue if c > min_cut_freq: #这里定义一个para为min_cut _freq, less than this number will be clicked off Count.append (word_tuple) dictionary = dict () for Word, _ in Count:dictionary[word] = Len (d ictionary) data = List () Unk_count = 0 for word in words:if word in Dictionary:index = Dictionary[word] El Se:index = 0 # dictionary[' UNK '] unk_count + = 1 data.append (index) count[0][1] = Unk_count reverse_dictio nary = dict (Zip (dictionary.values (), Dictionary.keys ())) return data, Count, dictionary, Reverse_dictionary

After

, the Generate_batch in line 91st of the source code is actually the entry to build the Skip-gram model, not the frame after the 137th line with Graph.as_default (). After 137 lines to create a simple MLP model to tensor flow in the model. The tensor and its target form are the elements of the building model. If you read it carefully you will find in an input as "Batman defeated Superman, Captain America was beaten by Iron Man" in this sentence, after the Build_dataset function conversion may be Batman by its code in the dictionary 3 substitution, defeated by 90 substitution, Superman was 600 replaced, Captain America is 58, is 77, Iron Man is 888 and hit 965. So this sentence became [3,90,600,58,77,888,965]. Suppose the window size is 3, here the model is Skip-gram, the Generate_batch function starts from 90, the output batch is [90,90,600,600,58,58,77,77,888,888], the output target is [3,600,90,58,600,77,58,888,77,965]. So, how to build a Cbow model? In fact, it is very simple to note that the input and prediction of the Cbow model is the opposite of Skipgram, so we have to swap the 109th line of batch and the 110th line labels the okay? The specific code is as follows:

def generate_cbow_batch (Batch_size, Num_skips, Skip_window): Global Data_index assert batch_size% num_skips = 0 Asser T num_skips <= 2 * Skip_window batch = Np.ndarray (Shape= (batch_size), dtype=np.int32) labels = Np.ndarray (shape= (BATC H_size, 1), dtype=np.int32) span = 2 * skip_window + 1 # [Skip_window target Skip_window] buffer = Collections.deque (M Axlen=span) for _ in range (span): Buffer.append (Data[data_index]) Data_index = (data_index + 1)% len (data) for I  In range (Batch_size//num_skips): target = skip_window # target label at the center of the buffer targets_to_avoid = [Skip_window] for J in Range (Num_skips): when target in targets_to_avoid:target = Random.randint (0, span-1) Targets_to_avoid.append (target)
#这里的batch和labels是skipgram模型的 #batch [i * num_skips + j] = Buffer[skip_window] #labels [i * num_skips + j, 0] = bu Ffer[target]
#这里的batch和labels是CBOW模型的, the principle is to drop two lines of the above Skipgram model. Batch[i * num_skips + j] = Buffer[target] Labels[i * num_skips + j, 0] = Buffer[skip_window] Buffer.append (data[da Ta_index]) Data_index = (data_index + 1)% len (data) return batch, labels

Thus, we only need to batch_inputs in the back, Batch_labels = Generate_batch (Batch_size, num_skips, Skip_window) function to replace the function for your cbow model function just fine.

Word2vec code interpretation under Python TensorFlow

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Word2vec code interpretation under Python TensorFlow

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support