At present, when using the depth network for text task model training, the first step should be to convert the text into Word vector processing. But the effect of general word vector is related to the size of corpus, and the corpus of processing task is insufficient to support our experiment, then we need to use the mass corpus training word vector on the Internet. 1, download
On-line public word vector download address: Https://github.com/xgli/word2vec-api
Glove's file describes how to use the pre-training word vector, which is formatted as follows: Each action a word and its corresponding word vector, separated by a space.
Glove corresponding word vector, non binary file
Word2vec corresponding word vector, non binary file
2. Load
Loading of glove word vectors
filename = ' glove.6b.50d.txt '
def loadglove (filename):
vocab = []
embd = []
vocab.append (' UNK ') # Load not recognized word
embd.append ([0]*emb_size) #这个emb_size可能需要指定
file = open (filename, ' r ') for line in
File.readlines ():
row = Line.strip (). Split (')
vocab.append (row[0])
embd.append (row[1:])
print (' Loaded glove! '
) File.close () return
vocab,embd
vocab,embd = loadglove (filename)
vocab_size = Len (vocab)
Embedding_dim = Len (embd[0])
embedding = Np.asarray (EMBD)
Loading of Word2vec word vectors
def loadword2vec (filename):
vocab = []
embd = []
cnt = 0
fr = open (filename, ' r ') line
= Fr.readline () . Decode (' Utf-8 '). Strip ()
#print line
word_dim = Int (Line.split (') [1])
vocab.append ("UNK")
Embd.append ([0]*word_dim) for line in
fr:
row = Line.strip (). Split (")
Vocab.append (row[0])
Embd.append (row[1:])
print "Loaded Word2vec"
fr.close () return
vocab,embd vocab,embd
= Loadglove (filename)
vocab_size = Len (vocab)
Embedding_dim = Len (embd[0))
embedding = Np.asarray (EMBD) Vocab: For Thesaurus
Embed: For word vector 3, Word vector layer
The word vector layer when constructing network
W = tf. Variable (tf.constant (0.0, shape=[vocab_size, Embedding_dim]),
trainable=false, name= "W")
Embedding_ placeholder = Tf.placeholder (Tf.float32, [Vocab_size, Embedding_dim])
embedding_init = W.assign (embedding_ PLACEHOLDER)
Declaring a word vector matrix in a network structure W
Sess.run (Embedding_init, feed_dict={embedding_placeholder:embedding})
The embedding is passed to the network assignment. 4. Glossary
This section does not apply to certain tasks, such as dialogs, sequence annotations, and so on, that the built-in function automatically filters out punctuation, but punctuation is also the information that some tasks need.
Tf.nn.embedding_lookup (W, input_x)
The code maps the input to a word vector, but input_x the ID of the word. So we need to map the input text to the word ID sequence.
From Tensorflow.contrib import Learn
#init vocab Processor
vocab_processor = Learn.preprocessing.VocabularyProcessor (max_document_length)
#fit the vocab from glove
pretrain = vocab_ Processor.fit (vocab)
#transform inputs
input_x = Np.array (List (Vocab_processor.transform (your_raw_input)) )
Use TensorFlow's own word processing API for processing, mapping words into Word IDs, and filtering out punctuation.
At present, write so much, at that time, when they wrote, into a lot of pits, this writing is not detailed, if there is not understand, welcome to comment on the exchange, or email me (mail more timely).
The original author inside is wrong, less consideration of "UNK" this situation. Attention, please.
Thanks to the author: https://ireneli.eu/2017/01/17/tensorflow-07-word-embeddings-2-loading-pre-trained-vectors/
Go from https://blog.csdn.net/lxg0807/article/details/72518962