K-means text clustering series (completed)

Last Update:2018-12-07 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

(Note: Download myProgramAfter the source code, you may need to download a new ictclas3.0 package from the network, and then overwrite the ictclas3.0 component in the original project file, probably because ictclas3.0 license, an ictclas3.0 package can only be used on one computer. At present, some netizens have encountered similar problems .)

Author: finallyliuyu reprinted and used. Please specify the source.

1. How to create a word Bag Model

2. DF Feature Word Selection Method

3. VSM model

4. Obtain the cluster center from WEKA to complete text clustering

5. AllCodeAnd resource download

6. How to use this open-source framework: How to Use the preprocess class

Preface:

Many people in the gardenAlgorithmIf you are interested, the top ones in Google search are:

Frog frog recommendation: frog teaches you to use text clustering. This version of code is written in C #.

And the K-means algorithm implemented by the C ++ language based on the previous frog blog. Kmeans-based text clustering

Readers will wonder if I have any need to chew on the hacker that someone else has chewed here to present a poor C ++ cainiao code...

Then let me talk about the differences between K-means and the above two versions.

The above two versions focus on the implementation of the kmeans clustering algorithm itself, that is, the frog and the Dongting sangren implemented the kmeans algorithm in the C # And C ++ languages respectively, rather than true text clustering: 1. They didn't embed the word segmentation component, but roughly wrote a word segmentation function using spaces as separators. 2. They have no real corpus. Their text to be clustered is a few words written by themselves.

My version of kmeans does not focus on the implementation of the kmeans algorithm itself, but also relies on well-known open-source components in the field of data mining.WEKATo implement the clustering algorithm. My focus is on implementationCommon text preprocessing module.The so-called text preprocessing includes Word Segmentation-"Remove deprecated word =" build the word bag model = "feature word selection =" build the document vector model (VSM) model. Finally, write the VSM model of the test text into the ARFF data format required by WEKA.. What I emphasize is to provideOpen-source frameworkYou can use this framework to complete text preprocessing, convert the training and testing document set to ARFF data format, and then call WEKA, use WEKA to complete text clustering. Finally, obtain the cluster center calculated by WEKA. For each article in the test sample setArticleCalculate the distance from the cluster center to complete clustering.

Secondly, I will also provide authentic text, which contains three categories: "Entertainment", "legal", and "education" for sixteen pieces of news as a test corpus (note: these sixty-six news articles are collected by personal web page text extraction software. If you need corpus, you can download my configuration program and download it by yourself. For more information, see 《News webpage Text Extraction series blog).We will give you a demonstration. At the same time, I will upload the source code and news to my blog for you to download,Research and learning.

First, declare that in my framework, the test text must be stored in the database (MSSQL Server2000 ). At present, there are still many imperfections in this framework. Let's look at it and hope that the high people in the garden will give some advice.

Let's take a look at the clustering effect first.

Test corpus (Part ):

Results After clustering:

In order to make readers more clearly observe the situation before and after the test sample set clustering, the corpus database storage is provided.

This series of blog posts will be expanded as follows. First, we will introduce the meaning of the program code of each module, then the classes that encapsulate the functions of each module, and the usage instructions of this framework, and how to use WEKA clustering, and provide corpus andSource codeDownload.

The code meaning of each module program code has been described in the following two parts:

Statement: This is my first project program written in C ++. I hope you can read the code style and accept it in a critical way. Of course, you are welcome to give more comments, help me improve programming skills. For example, I have been struggling to get a good name for how to send letters, variables, and so on. I have been familiar with C, C ++, C #, Java, and python, but I have read many naming styles, schools, and Hungary naming rules, what is the shorthand, what is the first word in lowercase, and so on... I hope you can give me some tips on function naming.

At the same time, I would like to thank you.Zookeeper AndGalacticAThank you for your selfless and timely help in writing the C ++ program!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

K-means text clustering series (completed)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

K-means text clustering series (completed)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support