Query Processing becomes more and more important in the search field, where classification is an important part. query classification is a difficult project because query is generally short and contains very little information (entropy, therefore, it is difficult to perform classification. The common method is to extend the query, for example, capture the search engine results, or directly extend the query to the corresponding doc, and then
Query Processing becomes more and more important in the search field, where classification is an important part. query classification is a difficult project because query is generally short and contains very little information (entropy, therefore, it is difficult to perform classification. The common method is to extend the query, for example, capture the search engine results, or directly extend the query to the corresponding doc, and then
Query Processing becomes more and more important in the search field, where classification is an important part. query classification is a difficult project because query is generally short and contains very little information (entropy, therefore, it is difficult to classify. The common method is to extend the query, such as capturing the search engine results, or directly extending the query to the corresponding doc, and then classifying the doc, it became easy to classify doc files, and the accuracy was relatively high. Recently, word2vec was very popular, using unsupervised machine learning, that is, no need to label data, so I studied it, check whether results can be used for query classification extension.
Where is word2vec?
Https://code.google.com/p/word2vec/
You can download the specific code above for compilation and generate relevant analysis tools. The above C code writes some "abstraction" and the following are C ++ versions, which looks more intuitive.
Https://github.com/jdeng/word2vec
Training corpus acquisition
Some news data can be obtained in the sogou lab. Although it is old, it will be used. In fact, Weibo may have better data. First, the data volume is large, second, the news corpus can be
You can download the http://www.sogou.com/labs/dl/ca.html connector.
1. Run ftp ftp.labs.sogou.com In the cmd window.
2. Enter the user name generated for registration
3. Enter the password generated for registration and connect to ftp.
4. Run cd to the corresponding directory and run dir or ls to view the specific file.
5. get news_tensite_xml.full.tar.gz to download the file to the personal document directory.
Process corpus and Word Segmentation
The corpus is structured in xml and the news content needs to be cleansed.
cat news_tensite_xml.dat | iconv -f gbk -t utf-8 -c | grep "
" | sed 's\
\\' | sed 's\
\\' > news.txt
In this way, we can cleanse the news content, write an article in a row, and then perform word segmentation on the corpus. We have found some open source word segmentation, which is difficult to use in java versions, sometimes it may take half a day to get confused about the Garbled text. Here we use the Chinese Emy of Sciences word segmentation ICTCLAS and C ++ versions. It is easy to run in linux. I have already written a word segmentation program, put it on CSDN. You can directly download the required information, including libraries, Word Segmentation dictionaries, binary programs, and word segmentation tools. Click here to download. ICTCLAS word divider information can view http://hi.baidu.com/drkevinzhang/
There are a total of 1143394 corpus articles, and 2.2 GB of data files after word segmentation. The situation after word segmentation is as follows:
Run word2vec for analysis
./word2vec -train out.txt -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1
This process may take some time to wait. After running, the vectors. binfile will be generated, and then you can use the provided cosine computing tool to view keywords.
Run./distance vectors. bin and enter the query word you want to view to see the effect.
We can see that the analysis results for object names are still very reliable, and it will be better if we do some Preprocessing for corpus.
You can use
../Word2vec-train out.txt-output classes.txt-cbow 0-size 200-window 5-negative 0-hs 1-sample 1e-3-threads 12-classes 500
Clustering the analysis results is used for query classification. The results are as follows:
After the words are removed, the results are quite impressive.
Refer:
Http://blog.csdn.net/zhaoxinfan/article/details/11069485
Https://code.google.com/p/word2vec/
Please pay attention to my blog word2vec practices and keyword Clustering