TensorFlow uses the pre-trained word vector

Last Update:2018-08-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

At present, when using the depth network for text task model training, the first step should be to convert the text into Word vector processing. But the effect of general word vector is related to the size of corpus, and the corpus of processing task is insufficient to support our experiment, then we need to use the mass corpus training word vector on the Internet. 1, download

On-line public word vector download address: Https://github.com/xgli/word2vec-api
Glove's file describes how to use the pre-training word vector, which is formatted as follows: Each action a word and its corresponding word vector, separated by a space.
Glove corresponding word vector, non binary file

Word2vec corresponding word vector, non binary file
2. Load

Loading of glove word vectors

filename = ' glove.6b.50d.txt '
def loadglove (filename):
    vocab = []
    embd = []
    vocab.append (' UNK ') # Load not recognized word
    embd.append ([0]*emb_size) #这个emb_size可能需要指定
    file = open (filename, ' r ') for line in
    File.readlines ():
        row = Line.strip (). Split (')
        vocab.append (row[0])
        embd.append (row[1:])
    print (' Loaded glove! '
    ) File.close () return
    vocab,embd
vocab,embd = loadglove (filename)
vocab_size = Len (vocab)
Embedding_dim = Len (embd[0])
embedding = Np.asarray (EMBD)

Loading of Word2vec word vectors

def loadword2vec (filename):
    vocab = []
    embd = []
    cnt = 0
    fr = open (filename, ' r ') line
    = Fr.readline () . Decode (' Utf-8 '). Strip ()
    #print line
    word_dim = Int (Line.split (') [1])    
    vocab.append ("UNK")
    Embd.append ([0]*word_dim) for line in
    fr:
        row = Line.strip (). Split (")
        Vocab.append (row[0])
        Embd.append (row[1:])
    print "Loaded Word2vec"
    fr.close () return
    vocab,embd vocab,embd

= Loadglove (filename)
vocab_size = Len (vocab)
Embedding_dim = Len (embd[0))
embedding = Np.asarray (EMBD) Vocab: For Thesaurus

Embed: For word vector 3, Word vector layer

The word vector layer when constructing network

W = tf. Variable (tf.constant (0.0, shape=[vocab_size, Embedding_dim]),
                trainable=false, name= "W")
Embedding_ placeholder = Tf.placeholder (Tf.float32, [Vocab_size, Embedding_dim])
embedding_init = W.assign (embedding_ PLACEHOLDER)

Declaring a word vector matrix in a network structure W

Sess.run (Embedding_init, feed_dict={embedding_placeholder:embedding})

The embedding is passed to the network assignment. 4. Glossary

This section does not apply to certain tasks, such as dialogs, sequence annotations, and so on, that the built-in function automatically filters out punctuation, but punctuation is also the information that some tasks need.

Tf.nn.embedding_lookup (W, input_x)

The code maps the input to a word vector, but input_x the ID of the word. So we need to map the input text to the word ID sequence.

From Tensorflow.contrib import Learn
#init vocab Processor
vocab_processor = Learn.preprocessing.VocabularyProcessor (max_document_length)
#fit the vocab from glove
pretrain = vocab_ Processor.fit (vocab)
#transform inputs
input_x = Np.array (List (Vocab_processor.transform (your_raw_input)) )

Use TensorFlow's own word processing API for processing, mapping words into Word IDs, and filtering out punctuation.

At present, write so much, at that time, when they wrote, into a lot of pits, this writing is not detailed, if there is not understand, welcome to comment on the exchange, or email me (mail more timely).

The original author inside is wrong, less consideration of "UNK" this situation. Attention, please.

Thanks to the author: https://ireneli.eu/2017/01/17/tensorflow-07-word-embeddings-2-loading-pre-trained-vectors/

Go from https://blog.csdn.net/lxg0807/article/details/72518962

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

TensorFlow uses the pre-trained word vector

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

TensorFlow uses the pre-trained word vector

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support