VSM vector space model for text classification and simple implementation

Source: Internet
Author: User

1: No matter what advanced method is used for text classification, we need to first establish a mathematical model. In this case, SVM is used for text classification. Its principle is based on the characteristics of the text, for example, if a text has 10 features (generally, each feature is a keyword representing the text), the size of the text vector is 10. Each value is the weight of the feature. (There are many types of weight calculation, which are represented by word frequency ). Then, read the test text and perform operations on the feature vectors in the sample based on the features in the test text. In this case, the angle between the vectors is obtained and expressed by the cosine value, if the angle is large, it is far away. Otherwise, it is relatively close (this place does not consider the case where the angle is greater than 90 ° ).

2: This example is intended for my next use of SVM. It is an entry for this type. I think this effect has a lot to do with the input sample features. I can determine the weights for different categories of the same type, such as stocks.

3: The JAVA source code is as follows:

Package COM. baseframework. sort; import Java. io. bufferedreader; import Java. io. file; import Java. io. filenotfoundexception; import Java. io. filereader; import Java. io. ioexception; import Java. util. vector; public class vsmmain {public static void main (string [] ARGs) {vsmmain VSM = new vsmmain (); string basepath = VSM. getclass (). getclassloader (). getresource (""). tostring (). substring (6); string content = VSM. getco Ntent (basepath + "article.txt"); vector <string> samples = VSM. loadsample (basepath + "sort.txt"); VSM. samilarity (content, samples);}/*** calculate and compare the cosine of the document and sample ** @ Param content * @ Param samples */Public void samilarity (string content, vector <vector <string> samples) {for (INT I = 0; I <samples. size (); I ++) {vector <string> single = samples. get (I); // stores the words in each sample, the number of occurrences in the comparison text vector <integer> wordcount = New vector <integer> (); For (Int J = 0; j <single. size (); j ++) {string word = single. get (j); int COUNT = getcharinstringcount (content, word); wordcount. add (J, count); // system. out. print (word + ":" + TFIDF + ",");} // system. out. println ("\ n"); // calculate the cosine of int samplelength = 0; int textlength = 0; int totallength = 0; For (Int J = 0; j <single. size (); j ++) {// The Vector Value in the sample is 1 samplelength + = 1; textlength + = wordcount. Get (j) * wordcount. get (j); totallength + = 1 * wordcount. get (j) ;}// calculate double value = 0.00 by the open party; if (samplelength> 0 & textlength> 0) {value = (double) totallength/(math. SQRT (samplelength) * Math. SQRT (textlength);} system. out. println (single. get (0) + "," + samplelength + "," + textlength + "," + totallength + "," + value );}} /*** calculate the number of times a word appears in the content *** @ Param content * @ Param word * @ return */Public int Getcharinstringcount (string content, string word) {string STR = content. replaceall (word, ""); Return (content. length ()-Str. length ()/word. length ();}/*** load sample ** @ Param path * @ return */Public vector <string> loadsample (string path) {vector <string> vector = new vector <string> (); try {filereader reader = new filereader (new file (PATH )); bufferedreader bufferreader = ne W bufferedreader (Reader); string hasread = ""; while (hasread = bufferreader. Readline ())! = NULL) {string info [] = hasread. split (","); vector <string> single = new vector <string> (); For (INT I = 0; I <info. length; I ++) {single. add (info [I]);} vector. add (single) ;}} catch (filenotfoundexception e) {e. printstacktrace () ;}} catch (ioexception e) {e. printstacktrace ();} return vector;}/*** read the file content of the corresponding path ** @ Param path * @ return */Public String getcontent (string path) {stringbuffer buffer = new St Ringbuffer (); try {filereader reader = new filereader (new file (PATH); bufferedreader bufferreader = new bufferedreader (Reader); string hasread = ""; while (hasread = bufferreader. readline ())! = NULL) {buffer. append (hasread) ;}} catch (filenotfoundexception e) {e. printstacktrace () ;}} catch (ioexception e) {e. printstacktrace ();} return buffer. tostring ();}}

In this example, sort is a manually maintained class feature. Each feature is separated by and. Article is the text I entered to be tested ..

Get started .. It seems that the effect is still acceptable. Next, let's implement SVM and implement sample features automatically ..

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.