The business scenario is that the customer needs to submit a material list when the business is processed, the material will enter the material library, and the next time the customer enters the customer's ID card, it will be loaded through the material library, and we will not need to manually upload the material through the material name matching material similarity. ( first need ikanalyzer2012ff_u1.jar for download support jar)
1. The following is the core algorithm for processing two words
PackageCom.ikanalyzer;ImportJava.util.HashMap;ImportJava.util.Iterator;ImportJava.util.Map;ImportJava.util.Vector;/*** Description: Percentage of similarity *@author: Administrator * @Date: 2015-1-22 pm 1:20:34 *@version1.0*/ Public classikanalyzerutil{//threshold Value Public Static DoubleYuzhi = 0.2 ; /*** Return percentage *@author: Administrator * @Date: January 22, 2015 *@paramT1 *@paramT2 *@return */ Public Static DoubleGetsimilarity (vector<string> T1, vector<string> T2)throwsException {intSize = 0, Size2 = 0 ; if(T1! =NULL&& (size = t1.size ()) > 0 && T2! =NULL&& (size2 = t2.size ()) > 0) {Map<string,Double[]> T =NewHashmap<string,Double[]>(); //T1 and T2 of the Assembly TString index =NULL ; for(inti = 0; i < size; i++) {Index=T1.get (i); if(Index! =NULL){ Double[] C =T.get (index); C=New Double[2]; c[0] = 1;//semantic score ci of T1C[1] = Yuzhi;//semantic score ci of T2T.put (index, c); } } for(inti = 0; i < size2; i++) {Index=T2.get (i); if(Index! =NULL ){ Double[] C =T.get (index); if(c! =NULL&& C.length = = 2) {c[1] = 1;//T2 also exist in the T2, the semantic score =1}Else{C=New Double[2]; c[0] = Yuzhi;//semantic score ci of T1C[1] = 1;//semantic score ci of T2T.put (index, c); } } } //start calculation, percentIterator<string> it =T.keyset (). iterator (); DoubleS1 = 0, s2 = 0, ssum = 0;//S1, S2 while(It.hasnext ()) {Double[] C =T.get (It.next ()); Ssum+ = C[0]*c[1]; S1+ = C[0]*c[0]; S2+ = C[1]*c[1]; } //percentage returnSSUM/MATH.SQRT (s1*S2); } Else { Throw NewException ("There is a problem with the incoming parameter! "); } }}
2. The following is a method of calling a word breaker returns the similarity of two words
PackageCom.ikanalyzer;Importjava.io.IOException;ImportJava.io.StringReader;ImportJava.util.Vector;ImportOrg.wltea.analyzer.core.IKSegmenter;ImportOrg.wltea.analyzer.core.Lexeme; Public classCheckthesame {/*** participle *@author: Administrator * @Date: March 5, 2016 15:10:47 *@paramSTR *@return */ Public StaticVector<string>participle (String str) {Vector<String> str1 =NewVector<string> ();//word breaker on input Try{StringReader Reader=NewStringReader (str); Iksegmenter ik=NewIksegmenter (Reader,false);//when True, the word breaker is intelligently slicedLexeme lexeme =NULL ; while((Lexeme = Ik.next ())! =NULL) {Str1.add (Lexeme.getlexemetext ()); } if(str1.size () = = 0 ) { return NULL ; } //after participle//System.out.println ("str after participle:" + str1); } Catch(IOException E1) {//System.out.println (); } returnstr1;}/*** Returns the similarity of the two strings compared *@paramStrone *@paramStrtwo *@return */ Publicstring Getsemblance (String strone,string strtwo) {string semblancestring= "0.0000"; //participleVector<string> strs1 =participle (strone); Vector<String> strs2 =participle (strtwo); //return similarity based on participle Doublesame = 0 ; Try{Same=ikanalyzerutil.getsimilarity (strs1, strs2); } Catch(Exception e) {//System.out.println (E.getmessage ());} semblancestring=string.valueof (same); //System.out.println ("similarity:" + same); returnsemblancestring;} Public Static voidMain (string[] args) {//participleVector<string> strs1 = participle ("Proof of Identity" ) ; Vector<String> strs2 = participle ("Copy of personal identification certificate" ) ; //return similarity based on participle Doublesame = 0 ; Try{Same=ikanalyzerutil.getsimilarity (strs1, strs2); } Catch(Exception e) {System.out.println (E.getmessage ()); } System.out.println ("Similarity:" +same); }}
Specifically in the implementation of the following
Ikanalyzer also has a lot of algorithms to do the similarity of the match forget later more research
By using Word segmentation technology, two string matching and similarity ratios are generated.