K-means text clustering series (completed)

Source: Internet
Author: User

(Note: Download myProgramAfter the source code, you may need to download a new ictclas3.0 package from the network, and then overwrite the ictclas3.0 component in the original project file, probably because ictclas3.0 license, an ictclas3.0 package can only be used on one computer. At present, some netizens have encountered similar problems .)

Author: finallyliuyu reprinted and used. Please specify the source.

 

1. How to create a word Bag Model

2. DF Feature Word Selection Method

3. VSM model

4. Obtain the cluster center from WEKA to complete text clustering

5. AllCodeAnd resource download

6. How to use this open-source framework: How to Use the preprocess class

 

Preface:

Many people in the gardenAlgorithmIf you are interested, the top ones in Google search are:

Frog frog recommendation: frog teaches you to use text clustering. This version of code is written in C #.

And the K-means algorithm implemented by the C ++ language based on the previous frog blog. Kmeans-based text clustering

Readers will wonder if I have any need to chew on the hacker that someone else has chewed here to present a poor C ++ cainiao code...

 

Then let me talk about the differences between K-means and the above two versions.

The above two versions focus on the implementation of the kmeans clustering algorithm itself, that is, the frog and the Dongting sangren implemented the kmeans algorithm in the C # And C ++ languages respectively, rather than true text clustering: 1. They didn't embed the word segmentation component, but roughly wrote a word segmentation function using spaces as separators. 2. They have no real corpus. Their text to be clustered is a few words written by themselves.

My version of kmeans does not focus on the implementation of the kmeans algorithm itself, but also relies on well-known open-source components in the field of data mining.WEKATo implement the clustering algorithm. My focus is on implementationCommon text preprocessing module.The so-called text preprocessing includes Word Segmentation-"Remove deprecated word =" build the word bag model = "feature word selection =" build the document vector model (VSM) model. Finally, write the VSM model of the test text into the ARFF data format required by WEKA.. What I emphasize is to provideOpen-source frameworkYou can use this framework to complete text preprocessing, convert the training and testing document set to ARFF data format, and then call WEKA, use WEKA to complete text clustering. Finally, obtain the cluster center calculated by WEKA. For each article in the test sample setArticleCalculate the distance from the cluster center to complete clustering.

Secondly, I will also provide authentic text, which contains three categories: "Entertainment", "legal", and "education" for sixteen pieces of news as a test corpus (note: these sixty-six news articles are collected by personal web page text extraction software. If you need corpus, you can download my configuration program and download it by yourself. For more information, see 《News webpage Text Extraction series blog).We will give you a demonstration. At the same time, I will upload the source code and news to my blog for you to download,Research and learning.

First, declare that in my framework, the test text must be stored in the database (MSSQL Server2000 ). At present, there are still many imperfections in this framework. Let's look at it and hope that the high people in the garden will give some advice.

Let's take a look at the clustering effect first.

Test corpus (Part ):

 

 

Results After clustering:

 

In order to make readers more clearly observe the situation before and after the test sample set clustering, the corpus database storage is provided.

This series of blog posts will be expanded as follows. First, we will introduce the meaning of the program code of each module, then the classes that encapsulate the functions of each module, and the usage instructions of this framework, and how to use WEKA clustering, and provide corpus andSource codeDownload.

The code meaning of each module program code has been described in the following two parts:

Statement: This is my first project program written in C ++. I hope you can read the code style and accept it in a critical way. Of course, you are welcome to give more comments, help me improve programming skills. For example, I have been struggling to get a good name for how to send letters, variables, and so on. I have been familiar with C, C ++, C #, Java, and python, but I have read many naming styles, schools, and Hungary naming rules, what is the shorthand, what is the first word in lowercase, and so on... I hope you can give me some tips on function naming.

At the same time, I would like to thank you.Zookeeper AndGalacticAThank you for your selfless and timely help in writing the C ++ program!

 

 

 

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.