Lingpipe-based text trend analysis-lingpipe learning notes

Source: Internet
Author: User
Text Tendency Analysis

Text Tendency Analysis (sentiment analysis) divides users' views into "positive" and "negative", sometimes more "neutral ". A more intuitive application of text Orientation analysis is to track users' views and preferences on one thing, for example, to analyze the comments of a movie on Douban. This is why sentiment analysis is also called Opinion Mining.

Lingpipe

Lingpipe is a natural language processing software package developed by alias. It consists of more than 10 modules, including topic classification, sentence question detection, and character language modeling. And the documentation is complete.AlgorithmFor reference. What's even more valuable is that it supports Chinese.

Official Address: http://alias-i.com/lingpipe/

: Http://alias-i.com/lingpipe/web/download.html

Lingpipe is divided into two major parts: one is the core lingpipe file and the other is the lingpipe model class. To support Chinese characters, download the Chinese word segmentation module.

Prepare a corpus

In linguistics, a corpus refers to a large number of texts, which are usually organized and have established formats and tags.

Generally, an important step for sentiment analysis is to collect ideas and organize them. However, the processing methods vary depending on different application scenarios. For convenience, the processed corpus is used here. Movie review data's polarity dataset V2.0 contains 1000 positive views and 1000 negative views. Of course, this is in English.

Basic polarity Analysis

Basic polarity refers to the general tendency of a thing from the perspective of samples. For example, in a book, the user's emotional tendency is positive, and such an assertion is a basic polarity assertion.

For basic polarity analysis, use the dynamiclmclassifier of lingpipi.

In general, there are two steps: first step training and second step analysis.

Create a class named polaritybasic.

 
Public polaritybasic (string basepath) {pdir = new file (basepath, "txt_sentoken"); // obtain the corpus set categories = pdir. list (); // get category int Ngram = 4; classifer = dynamiclmclassifier. createngramprocess (categories, Ngram); // create a dynamic classifier}

Let's take a look at how to train.

Public void train () throws ioexception {for (INT I = 0; I <categories. length; ++ I) {string Category = categories [I]; classification = new classification (category); // create a category file dir = new file (pdir, categories [I]); file [] trainfiles = dir. listfiles (); For (Int J = 0; j <trainfiles. length; ++ J) {file trainfile = trainfiles [J]; If (istrainingfile (trainfile )) {// determine whether to use some data as the training set and some as the test set string Review = files. readfromfile (trainfile, "ISO-8859-1"); classified = new classified (Review, classification); // specify the content and category classifer. handle (classified); // training }}}}

The istrainingfile method is described here. We need a test set and a training set, but we only have one corpus, which is separated by humans. I used a random number each time, but it affects the speed. Here we use the file name as the basis for determination.

 
Boolean istrainingfile (File file) {return file. getname (). charat (2 )! = '1'; // If the 2nd-bit value is 1, it indicates the test set}

After training, use classifer to perform polarity analysis.

Public void evaluate () throws ioexception {int numtests = 0; int numcorrect = 0; For (INT I = 0; I <categories. length; ++ I) {string Category = categories [I]; file = new file (pdir, categories [I]); file [] testfiles = file. listfiles (); For (Int J = 0; j <testfiles. length; ++ J) {file testfile = testfiles [J]; If (! Istrainingfile (testfile) {string Review = files. readfromfile (testfile, "ISO-8859-1"); ++ numtests; classification = classifer. classify (review); string resultcategory = classification. bestcategory (); If (resultcategory. equals (Category) ++ numcorrect ;}} system. out. println ("Total number of tests:" + numtests); system. out. println ("correct count:" + numcorrect); system. out. println ("accuracy rate" + (double) numcorrect)/(double) numtests );}

Effect:

Modify the istrainingfile.

Boolean istrainingfile (File file) {return file. getname (). charat (2 )! = '2'; // If the 2nd-bit value is 2, it indicates the test set}

In terms of accuracy, how to divide the training set and the test set has little impact.

It can also be divided like this

 
Boolean istrainingfile (File file) {return (file. getname (). charat (2 )! = '2') & (file. getname (). charat (2 )! = '1 ');}
Extension

Basic polarity analysis is only a simple part of text tendency analysis. If you need to analyze it in depth, lingpipe can also achieve subjective analysis and hierarchical polarity analysis.

To support Chinese, download the words-zh-as.CompiledSpellChecker.

The last three references are provided:

    • Bo Pang, Lillian Lee, and shivakumar vaithyanw.2002. Thumbs up? Sentiment classification using machine learning techniques. emnlp proceedings.
    • Bo Pang and Lillian Lee. 2004. A Sentimental Education: sentiment analysis using subjectivity summarization based on minimum cuts. ACL proceedings .
    • Bo Pang and Lillian Lee. 2005. Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. ACL proceedings .

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.