TensorFlow uses the pre-trained word vector

Source: Internet
Author: User

At present, when using the depth network for text task model training, the first step should be to convert the text into Word vector processing. But the effect of general word vector is related to the size of corpus, and the corpus of processing task is insufficient to support our experiment, then we need to use the mass corpus training word vector on the Internet. 1, download

On-line public word vector download address: Https://github.com/xgli/word2vec-api
Glove's file describes how to use the pre-training word vector, which is formatted as follows: Each action a word and its corresponding word vector, separated by a space.
Glove corresponding word vector, non binary file

Word2vec corresponding word vector, non binary file
2. Load

Loading of glove word vectors

filename = ' glove.6b.50d.txt '
def loadglove (filename):
    vocab = []
    embd = []
    vocab.append (' UNK ') # Load not recognized word
    embd.append ([0]*emb_size) #这个emb_size可能需要指定
    file = open (filename, ' r ') for line in
    File.readlines ():
        row = Line.strip (). Split (')
        vocab.append (row[0])
        embd.append (row[1:])
    print (' Loaded glove! '
    ) File.close () return
    vocab,embd
vocab,embd = loadglove (filename)
vocab_size = Len (vocab)
Embedding_dim = Len (embd[0])
embedding = Np.asarray (EMBD)
Loading of Word2vec word vectors
def loadword2vec (filename):
    vocab = []
    embd = []
    cnt = 0
    fr = open (filename, ' r ') line
    = Fr.readline () . Decode (' Utf-8 '). Strip ()
    #print line
    word_dim = Int (Line.split (') [1])    
    vocab.append ("UNK")
    Embd.append ([0]*word_dim) for line in
    fr:
        row = Line.strip (). Split (")
        Vocab.append (row[0])
        Embd.append (row[1:])
    print "Loaded Word2vec"
    fr.close () return
    vocab,embd vocab,embd

= Loadglove (filename)
vocab_size = Len (vocab)
Embedding_dim = Len (embd[0))
embedding = Np.asarray (EMBD) Vocab: For Thesaurus

Embed: For word vector 3, Word vector layer

The word vector layer when constructing network

W = tf. Variable (tf.constant (0.0, shape=[vocab_size, Embedding_dim]),
                trainable=false, name= "W")
Embedding_ placeholder = Tf.placeholder (Tf.float32, [Vocab_size, Embedding_dim])
embedding_init = W.assign (embedding_ PLACEHOLDER)
Declaring a word vector matrix in a network structure W
Sess.run (Embedding_init, feed_dict={embedding_placeholder:embedding})
The embedding is passed to the network assignment. 4. Glossary

This section does not apply to certain tasks, such as dialogs, sequence annotations, and so on, that the built-in function automatically filters out punctuation, but punctuation is also the information that some tasks need.

Tf.nn.embedding_lookup (W, input_x)

The code maps the input to a word vector, but input_x the ID of the word. So we need to map the input text to the word ID sequence.

From Tensorflow.contrib import Learn
#init vocab Processor
vocab_processor = Learn.preprocessing.VocabularyProcessor (max_document_length)
#fit the vocab from glove
pretrain = vocab_ Processor.fit (vocab)
#transform inputs
input_x = Np.array (List (Vocab_processor.transform (your_raw_input)) )
Use TensorFlow's own word processing API for processing, mapping words into Word IDs, and filtering out punctuation.

At present, write so much, at that time, when they wrote, into a lot of pits, this writing is not detailed, if there is not understand, welcome to comment on the exchange, or email me (mail more timely).

The original author inside is wrong, less consideration of "UNK" this situation. Attention, please.

Thanks to the author: https://ireneli.eu/2017/01/17/tensorflow-07-word-embeddings-2-loading-pre-trained-vectors/

Go from https://blog.csdn.net/lxg0807/article/details/72518962

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.