VSM vector space model for text classification and simple implementation

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1: No matter what advanced method is used for text classification, we need to first establish a mathematical model. In this case, SVM is used for text classification. Its principle is based on the characteristics of the text, for example, if a text has 10 features (generally, each feature is a keyword representing the text), the size of the text vector is 10. Each value is the weight of the feature. (There are many types of weight calculation, which are represented by word frequency ). Then, read the test text and perform operations on the feature vectors in the sample based on the features in the test text. In this case, the angle between the vectors is obtained and expressed by the cosine value, if the angle is large, it is far away. Otherwise, it is relatively close (this place does not consider the case where the angle is greater than 90 ° ).

2: This example is intended for my next use of SVM. It is an entry for this type. I think this effect has a lot to do with the input sample features. I can determine the weights for different categories of the same type, such as stocks.

3: The JAVA source code is as follows:

Package COM. baseframework. sort; import Java. io. bufferedreader; import Java. io. file; import Java. io. filenotfoundexception; import Java. io. filereader; import Java. io. ioexception; import Java. util. vector; public class vsmmain {public static void main (string [] ARGs) {vsmmain VSM = new vsmmain (); string basepath = VSM. getclass (). getclassloader (). getresource (""). tostring (). substring (6); string content = VSM. getco Ntent (basepath + "article.txt"); vector <string> samples = VSM. loadsample (basepath + "sort.txt"); VSM. samilarity (content, samples);}/*** calculate and compare the cosine of the document and sample ** @ Param content * @ Param samples */Public void samilarity (string content, vector <vector <string> samples) {for (INT I = 0; I <samples. size (); I ++) {vector <string> single = samples. get (I); // stores the words in each sample, the number of occurrences in the comparison text vector <integer> wordcount = New vector <integer> (); For (Int J = 0; j <single. size (); j ++) {string word = single. get (j); int COUNT = getcharinstringcount (content, word); wordcount. add (J, count); // system. out. print (word + ":" + TFIDF + ",");} // system. out. println ("\ n"); // calculate the cosine of int samplelength = 0; int textlength = 0; int totallength = 0; For (Int J = 0; j <single. size (); j ++) {// The Vector Value in the sample is 1 samplelength + = 1; textlength + = wordcount. Get (j) * wordcount. get (j); totallength + = 1 * wordcount. get (j) ;}// calculate double value = 0.00 by the open party; if (samplelength> 0 & textlength> 0) {value = (double) totallength/(math. SQRT (samplelength) * Math. SQRT (textlength);} system. out. println (single. get (0) + "," + samplelength + "," + textlength + "," + totallength + "," + value );}} /*** calculate the number of times a word appears in the content *** @ Param content * @ Param word * @ return */Public int Getcharinstringcount (string content, string word) {string STR = content. replaceall (word, ""); Return (content. length ()-Str. length ()/word. length ();}/*** load sample ** @ Param path * @ return */Public vector <string> loadsample (string path) {vector <string> vector = new vector <string> (); try {filereader reader = new filereader (new file (PATH )); bufferedreader bufferreader = ne W bufferedreader (Reader); string hasread = ""; while (hasread = bufferreader. Readline ())! = NULL) {string info [] = hasread. split (","); vector <string> single = new vector <string> (); For (INT I = 0; I <info. length; I ++) {single. add (info [I]);} vector. add (single) ;}} catch (filenotfoundexception e) {e. printstacktrace () ;}} catch (ioexception e) {e. printstacktrace ();} return vector;}/*** read the file content of the corresponding path ** @ Param path * @ return */Public String getcontent (string path) {stringbuffer buffer = new St Ringbuffer (); try {filereader reader = new filereader (new file (PATH); bufferedreader bufferreader = new bufferedreader (Reader); string hasread = ""; while (hasread = bufferreader. readline ())! = NULL) {buffer. append (hasread) ;}} catch (filenotfoundexception e) {e. printstacktrace () ;}} catch (ioexception e) {e. printstacktrace ();} return buffer. tostring ();}}

In this example, sort is a manually maintained class feature. Each feature is separated by and. Article is the text I entered to be tested ..

Get started .. It seems that the effect is still acceptable. Next, let's implement SVM and implement sample features automatically ..

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

VSM vector space model for text classification and simple implementation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

VSM vector space model for text classification and simple implementation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support