Mallet Instructions for use

Source: Internet
Author: User

Mallet: Natural Language Processing Toolkit

posted 128 days ago? Technology , research ? Number of reviews 6 ? Be onlookers 1006 views+

Mallet Java- based Natural Language Processing toolbox, including sub-file classification, sentence class, subject model, information extraction and other machine learning in the text, although the application of text, but can be fully multimedia, such as machine vision.

Mallet contains sufficient algorithms for text categorization, as well as algorithms for feature extraction. The algorithms for text categorization are like Na?ve Bayes, Maximum Entropy, and decision trees, and are also optimized for the code.

Mallet also contains sequence tagging tools and algorithms, such as the application of information extraction, the algorithm has Hidden Markov Models, Maximum Entropy Markov Models, and Conditional Random fields.

Mallet also includes a theme model:Topic Modeling Toolkit contains efficient, sampling-based implementations of latent Dirichlet Allocation, Pachinko Allocation, and hierarchical LDA.

Mallet of course there are other features, very powerful. Here's a link to the API and a PDF: [API] [ tutorial ].

============== about the Mallet installation configuration ==================

here is a description of the installation configuration, which I reproduced here:

Mallet Instructions for use

Mallet is a software package specifically for machine learning, which is based on Java. The Mallet tool enables natural language processing, text categorization, and theme modeling. Text clustering, information extraction, and so on. Here's how to configure the Mallet environment to describe how to use mallet.

A. Lab Environment Configuration

1. Download and install the JDKand set the environment variables correctly

You need to set three environment variables:

2 Java_home: The value of this environment variable is the directory where JAVA resides.

For example C:\Program files\java\jdk1.6.0_10

2 path: Specifies a list of paths for searching the executable file.

The environment variable value is:%java_home%\bin

2 CLASSPATH: Specifies a list of paths that are used to search for Java compilations or classes that need to be run. The value of this environment variable is:%java_home%\lib\tools.jar;%java_home%\lib\dt.jar

2 Running Àcmd

2 Enter Javac and Java separately, no error indicates the configuration was successful.

2. Download the apache-ant, no installation required, just set the environment variables correctly.

Apache-ant is a Java-based compilation tool.

2 download http://ant.apache.org/bindownload.cgi, unzip and put in the directory you want

2 configuration

Ant_home: Set the extracted directory to Apache-ant, for example: C:\server\apache-ant-1.8.0
Classpath:%ant_home%\lib

Path:%ant_home%\bin

2 Testing

2 Running Àcmd

2 input Ant

The result is that the configuration was successful

Buildfile:build.xml does not exist!
Build failed

3. Download Mallet Latest version mallet-2.0.5, address:http://mallet.cs.umass.edu/download.php

To configure environment variables:

2 Mallet_home=mallet of the extract directory, such as C:\mallet

2 Add%mallet_home%\bin in Path

2 Classpath:%mallet_home%\class;%mallet_home%\lib;%mallet_home%\lib\mallet-deps.jar

2 run the Àcmd and go to the Mallet directory

2 input Ant

If the build successful word appears, the configuration succeeds

two. Mallet Brief description

Full Name: Machine learning for LanguagE Toolkit

Mallet is a Java software package designed to be used for statistical natural language processing, text categorization, theme modeling, information extraction, and other machine learning applications involving text.

A) Text classification: The basic idea is to train the classifier with a large number of training samples, test the classifier performance with some test samples, and then save the trained classifier model. When you enter the text of an unknown category into a trained classification model, you can output the probabilities of the classes to which this unknown class sample belongs.

b) Topic Modeling: Topic modeling is used to analyze a large number of text that is not marked (category unknown). By analyzing the text, you can draw some topics that can be specified or default, and each topic consists of words that often appear together. You can save the modeled body model for the purpose of inferring an unknown text that belongs to the subject.

c) Mallet can convert text to a mathematical expression, making it more efficient for machine learning of text. This process is implemented through the pipe system, which can be used for word segmentation, removal of inactive words, conversion of sequences to vectors, and so on. The specific code is visible mallet\src\cc\mallet\pipe.

three. Mallet Use steps

Text classification:

1. C:\mallet>mallet import-dir--input sample-data\classify-input\*--output classify-input.mallet

This command is equivalent to:

C:\mallet>java cc.mallet.classify.tui.Text2Vectors--input sample-data\ classify-input \*--output Classify-input.vectors

This command is to put classify-input (this name can be changed according to their own needs, I named this folder named Classify-input) in the folder in the directory of all the data into the form of eigenvectors, Mallet can be used to train and test classifiers with converted data formats.

Note: Here,there are three folders under the Classify-input, respectively, sport, science and food. After executing this command, the system automatically divides the data into three categories, the category name is sport, science, and food, and the data categories under three folders correspond to the respective folder name one by one.

2. C:\mallet>mallet train-classifier--input classify-input.mallet--trainer naivebayes--training-portion 0.8--o Utput-classifier Classifier1.classifier

This command is equivalent to:

C:\mallet>java cc.mallet.classify.tui.Vectors2Classify--input classify-input.vectors--trainer Naivebayes-- Training-portion 0.8--output-classifier Classifier1.classifier

This command is a training, test classifier. the value of the--input parameter classify-input.mallet is the feature vector generated in the first step, the value of the--trainer parameter Naivebayes refers to the algorithm that trains the classifier, you can specify other algorithms, such as MaxEnt, and so on. The value of the--training-portion parameter here is 0.8, can be set according to need, 0.8 means to randomly extract 80% of classify-input.mallet data when the training data, the remaining when the test data, Used to test the accuracy of the trained classifier, and so on performance indicators. The value of the--output-classifier parameter classifier1.classifier is the name of the stored classifier that has been trained.

3. C:\mallet>java cc.mallet.classify.tui.Text2Classify--input sample-data\data\ classify-test.txt--output---CL Assifier Classifier1.classifier

This command categorizes an unknown category of text with a trained classifier. the--input parameter value sample-data\data\ classify-test.txt is the location of the unknown category text to be categorized. --output The following parameter value "-" means the probability of outputting the respective categories directly on the command line. The value of the--classifier parameter refers to the classifier name used (that is, the trained classifier).

Note: The unknown category of text classification without data preprocessing, directly input text, the text of a row represents a classification instance.

Themed modeling

1. C:\mallet>mallet import-dir--input sample-data\topic-input--output topic-input.mallet--keep-sequence-- Remove-stopwords

This command converts all text under the Topic-input directory to a feature sequence, and the--keep-sequence parameter must be there, otherwise an error occurs because the data source used to model the theme is the feature sequence, not the eigenvectors, Therefore, you must use--keep-sequence this parameter to restrict the format of the converted data. The--remove-stopwords means to remove the stop word.

2. C:\mallet>mallet train-topics--input topic-input.mallet--num-topics 2--output-doc-topics docstopics-- Inferencer-filename Infer1.inferencer

This command is modeled with the first step of the data, the value of the parameter--num-topics 2 means that the number of limited topics is 2, you can set other values as needed, and the default number of topics is 10. The--output-doc-topics parameter means the output document-the subject matrix, which is stored in the Docstopics file. The--inferencer-filename parameter means to store a well-trained subject model for later use, where the subject model is stored in the parameter value Infer1.inferencer and can be named according to Custom.

3. C:\mallet>mallet import-dir--input sample-data\data--output topic-test.mallet--keep-sequence--remove-stopwo Rds

with 1 instructions.

4. C:\mallet>mallet infer-topics--input topic-test.mallet--inferencer infer1.inferencer--output-doc-topics testd Ocstopics

The topic-test of the non-marked text is subject to inference with the trained topic model . The--inferencer parameter means using the trained theme model Infer1.inferencer to infer the subject of the unknown text. The--output-doc-topics parameter means the output document-the subject matrix, which is stored in the Docstopics file.

Note:

2 Text Classification Unknown text must be represented by a document in which each row represents a categorical instance. While theme modeling can model a single document theme, you can model all the documents in a directory, such as the third step of topic modeling, with the Import-dir command.

C:\mallet>mallet import-file--input sample-data\data\topic-test.txt--output topic-test.mallet--keep-sequence-- Remove-stopwords

2 Import-file,import-dir,train-topics,infer-topics,train-classifier and so on. These commands can be queried using the following actions:

C:\mallet>mallet

The parameters for each command can be queried by the following command-line operation:

Example :c:\mallet>mallet import-dir--help

You can choose the parameters according to your own needs.

Reference: http://blog.csdn.net/xianggelilaling/article/details/5634815/

Mallet Instructions for use

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.