Based on Jgibblda LDA topic thematic model

Source: Internet
Author: User

Recently doing text categorization based on LDA (latent Dirichlet Allocation) began to learn and touch LDA, because the code was in Java, so I chose the LDA Open Source tool is Jgibblda, this is the Java version of LDA implementation, The download address is: http://jgibblda.sourceforge.net/, the current latest version is v1.0. The corresponding C + + version is Gibbslda and the download address is: http://gibbslda.sourceforge.net/.

Thank Phantom (Byr ID) for my guidance and help in the course of learning and using.

First download and unzip, unzip the file directory as follows:


Bin is stored in the compiled class file, lib in the Args4j-2.0.6.jar,models store has generated a good example of the topic model, SRC storage that source file. I put the jar file in the project when I used it, and copied the Jgibblda package directly under SRC into the project. 1. Input file Format

The format of the input file is shown in the following figure:


The first line describes the number of training corpus documents, each line is a document, and the contents of each row are the words in the document. 2. Output file

Output files are mainly <model_name>.others, <model_name>.phi, <model_name>.theta, <model_name>.tassign, <model_name>.twords, there is also a wordmap.txt file. Where <model_name> is specified according to the number of sampling iterations, such as model-00800, the last sample name is named Model-final.

The. Others file stores LDA model parameters, such as alpha, beta, and so on.

. phi files Store words-topic distribution, each line is a topic, and the contents of the list are words.

The. theta file topic document distribution, each line is a document, the column content is the topic probability.

The. tassign file is the subject designation (attribution) for the training of the expected word, and each line is a corpus document. 3. How to use

Jgibblda can be run by command line or code call.
3.1. Introduction to command-line invocation

Initial use of the training model

[Java] View plain copy $ java [-mx512m]-CP Bin:lib/args4j-2.0.6.jar Jgibblda. lda-est [-alpha <double>] [-beta <double>] [-ntopics <int>] [-niters <int>] [-savestep <INT&G t;] [-twords <int>]–dir <string>-dfile <string>
-est evaluation of LDA model from training corpus

Alpha value in-alpha LDA model, default = 50/k (K is the number of topics)

Beta values in the-beta LDA model, default is 0.1

-ntopics topic number, default value is 100

The number of iterations-niters Gibbs sampled, with a default value of 2000

-SAVESTEP Specifies the number of iterations to start saving the LDA model

-dir Training Corpus Catalogue

-dfile Training Corpus file name

It should be noted here that dfile best write the input corpus file name, and dir specifies the directory where the corpus, the model files generated during training will be stored by default under Dir. For example, the absolute path of the corpus is/usr/java/models/newdoc.dat, can be set to Dir=/usr/java/models, and Dfile=newdoc.dat, so that the model files during the training process will be stored by default in/usr/java/ Models below.

Continue iteration on the original model

[Java] View plain copy $ Java [-mx512m]-cp Bin:lib/args4j-2.0.6.jar Jgibblda. Lda-estc-dir <string>-model <string> [-niters <int>] [-savestep <int>] [-twords <int>]
The meaning of the specific parameters can be found on the official website, here is not introduced.

Prediction of new corpus based on existing LDA model

[Java] View plain copy $ java [-mx512m]-CP Bin:lib/args4j-2.0.6.jar Jgibblda. Lda-inf-dir <string>-model <string> [-niters <int>] [-twords <int>]-dfile <string>
3.2 code calls

First training, generating models

[Java] View plain copy

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.