Word2vec practices and keyword Clustering

Last Update:2018-05-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Query Processing becomes more and more important in the search field, where classification is an important part. query classification is a difficult project because query is generally short and contains very little information (entropy, therefore, it is difficult to classify. The common method is to extend the query, such as capturing the search engine results, or directly extending the query to the corresponding doc, and then classifying the doc, it became easy to classify doc files, and the accuracy was relatively high. Recently, word2vec was very popular, using unsupervised machine learning, that is, no need to label data, so I studied it, check whether results can be used for query classification extension.

Where is word2vec?

Https://code.google.com/p/word2vec/

You can download the specific code above for compilation and generate relevant analysis tools. The above C code writes some "abstraction" and the following are C ++ versions, which looks more intuitive.

Https://github.com/jdeng/word2vec

Training corpus acquisition

Some news data can be obtained in the sogou lab. Although it is old, it will be used. In fact, Weibo may have better data. First, the data volume is large, second, the news corpus can be

You can download the http://www.sogou.com/labs/dl/ca.html connector.

1. Run ftp ftp.labs.sogou.com In the cmd window.

2. Enter the user name generated for registration

3. Enter the password generated for registration and connect to ftp.

4. Run cd to the corresponding directory and run dir or ls to view the specific file.

5. get news_tensite_xml.full.tar.gz to download the file to the personal document directory.

Process corpus and Word Segmentation

The corpus is structured in xml and the news content needs to be cleansed.

cat news_tensite_xml.dat | iconv -f gbk -t utf-8 -c | grep "
 
  " | sed 's\
  
   \\' | sed 's\
  \\' > news.txt

In this way, we can cleanse the news content, write an article in a row, and then perform word segmentation on the corpus. We have found some open source word segmentation, which is difficult to use in java versions, sometimes it may take half a day to get confused about the Garbled text. Here we use the Chinese Emy of Sciences word segmentation ICTCLAS and C ++ versions. It is easy to run in linux. I have already written a word segmentation program, put it on CSDN. You can directly download the required information, including libraries, Word Segmentation dictionaries, binary programs, and word segmentation tools. Click here to download. ICTCLAS word divider information can view http://hi.baidu.com/drkevinzhang/

There are a total of 1143394 corpus articles, and 2.2 GB of data files after word segmentation. The situation after word segmentation is as follows:

Run word2vec for analysis

./word2vec -train out.txt -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1

This process may take some time to wait. After running, the vectors. binfile will be generated, and then you can use the provided cosine computing tool to view keywords.

Run./distance vectors. bin and enter the query word you want to view to see the effect.

We can see that the analysis results for object names are still very reliable, and it will be better if we do some Preprocessing for corpus.

You can use

../Word2vec-train out.txt-output classes.txt-cbow 0-size 200-window 5-negative 0-hs 1-sample 1e-3-threads 12-classes 500

Clustering the analysis results is used for query classification. The results are as follows:

After the words are removed, the results are quite impressive.

Refer:

Http://blog.csdn.net/zhaoxinfan/article/details/11069485

Https://code.google.com/p/word2vec/

Please pay attention to my blog word2vec practices and keyword Clustering

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Word2vec practices and keyword Clustering

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Word2vec practices and keyword Clustering

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support