Use kmeans for text clustering in mahout-Example Analysis

Source: Internet
Author: User
Tags idf

In mahout_in_action, a text clustering instance is provided and raw input data is provided.

As the main application scenario of clustering algorithms-text classification, text information modeling is also a common problem. There is already a good modeling method in the field of information retrieval, which is the most common vector space model in the field of information retrieval.

Term Frequency-inverse Document Frequency (TF-IDF): It is an enhancement to the TF method, and the importance of a word increases in proportion to the number of times it appears in the file, at the same time, it decreases proportionally with the frequency of its appearance in all texts. For example, for "frequently-frequency meaningless words", most of them will appear in all texts, so their weights will be greatly reduced, this makes the text model more accurate in describing text features. In the field of information retrieval, TF-IDF is the most common method for text information modeling.

Mahout provides a tool class for vectorizing text information. It analyzes Text Information Based on Lucene and then creates a text vector. The following example shows that the analyzed text data is the news data provided by Reuters. After downloading the dataset, place it in the "src/test/input" directory. Dataset: http://www.daviddlewis.com/resources/testcollections/reuters21578/

1. Extract Reuters data. mahout provides a dedicated method.

File inputFolder = new File("src/test/input"); File outputFolder = new File("src/test/input-extracted"); ExtractReuters extractor = new ExtractReuters(inputFolder, outputFolder); extractor.extract(); 

2. Store data as sequencefile

Mahout directly provides the seqdirectory method to convert the character text to sequencefile. You can directly bin/mahout seqdirectory-h to check the help of this command and set the input and output parameters, the input here is the text extracted directly from the previous step. The directory is "src/test/input-extracted"

3. normalize the data in the sequencefile using Lucene tools.

Mahout directly provides the seq2sparse command to redirect traffic. You can run bin/mahout seq2sparse-h to check the help of this command. The input is the output in step 2.

The directory structure of the generated vectorized file is as follows:

  • DF-count Directory: stores the frequency of Text
  • TF-vectors Directory: stores a text vector that uses TF as the weight
  • TFIDF-vectors Directory: stores the text vector with TFIDF as the weight
  • Tokenized-documents Directory: stores text information after word segmentation
  • Wordcount Directory: stores the number of times a global word appears.
  • Dictionary. File-0 Directory: vocabulary that stores the text
  • Frequcency-file-0 Directory: stores the frequency information corresponding to the vocabulary.

Use mahout kmeans for clustering. The input parameter is the file in the TF-vectors directory. If the entire process is correct, you can see the output directory clusters-n.

Finally, you can use the result provided by mahout to view the command mahout clusterdump to analyze the cluster results.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.