Word2vec practices and keyword Clustering

Source: Internet
Author: User
Query Processing becomes more and more important in the search field, where classification is an important part. query classification is a difficult project because query is generally short and contains very little information (entropy, therefore, it is difficult to perform classification. The common method is to extend the query, for example, capture the search engine results, or directly extend the query to the corresponding doc, and then

Query Processing becomes more and more important in the search field, where classification is an important part. query classification is a difficult project because query is generally short and contains very little information (entropy, therefore, it is difficult to perform classification. The common method is to extend the query, for example, capture the search engine results, or directly extend the query to the corresponding doc, and then

Query Processing becomes more and more important in the search field, where classification is an important part. query classification is a difficult project because query is generally short and contains very little information (entropy, therefore, it is difficult to classify. The common method is to extend the query, such as capturing the search engine results, or directly extending the query to the corresponding doc, and then classifying the doc, it became easy to classify doc files, and the accuracy was relatively high. Recently, word2vec was very popular, using unsupervised machine learning, that is, no need to label data, so I studied it, check whether results can be used for query classification extension.

Where is word2vec?

Https://code.google.com/p/word2vec/

You can download the specific code above for compilation and generate relevant analysis tools. The above C code writes some "abstraction" and the following are C ++ versions, which looks more intuitive.

Https://github.com/jdeng/word2vec

Training corpus acquisition

Some news data can be obtained in the sogou lab. Although it is old, it will be used. In fact, Weibo may have better data. First, the data volume is large, second, the news corpus can be

You can download the http://www.sogou.com/labs/dl/ca.html connector.



1. Run ftp ftp.labs.sogou.com In the cmd window.

2. Enter the user name generated for registration

3. Enter the password generated for registration and connect to ftp.

4. Run cd to the corresponding directory and run dir or ls to view the specific file.

5. get news_tensite_xml.full.tar.gz to download the file to the personal document directory.

Process corpus and Word Segmentation

The corpus is structured in xml and the news content needs to be cleansed.

cat news_tensite_xml.dat | iconv -f gbk -t utf-8 -c | grep "
 
  " | sed 's\
  
   \\' | sed 's\
  \\' > news.txt
 

In this way, we can cleanse the news content, write an article in a row, and then perform word segmentation on the corpus. We have found some open source word segmentation, which is difficult to use in java versions, sometimes it may take half a day to get confused about the Garbled text. Here we use the Chinese Emy of Sciences word segmentation ICTCLAS and C ++ versions. It is easy to run in linux. I have already written a word segmentation program, put it on CSDN. You can directly download the required information, including libraries, Word Segmentation dictionaries, binary programs, and word segmentation tools. Click here to download. ICTCLAS word divider information can view http://hi.baidu.com/drkevinzhang/


There are a total of 1143394 corpus articles, and 2.2 GB of data files after word segmentation. The situation after word segmentation is as follows:



Run word2vec for analysis

./word2vec -train out.txt -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1


This process may take some time to wait. After running, the vectors. binfile will be generated, and then you can use the provided cosine computing tool to view keywords.

Run./distance vectors. bin and enter the query word you want to view to see the effect.







We can see that the analysis results for object names are still very reliable, and it will be better if we do some Preprocessing for corpus.

You can use

../Word2vec-train out.txt-output classes.txt-cbow 0-size 200-window 5-negative 0-hs 1-sample 1e-3-threads 12-classes 500

Clustering the analysis results is used for query classification. The results are as follows:


After the words are removed, the results are quite impressive.


Refer:

Http://blog.csdn.net/zhaoxinfan/article/details/11069485

Https://code.google.com/p/word2vec/




Please pay attention to my blog word2vec practices and keyword Clustering

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.