By: finallyliuyu
The specific implementation is as follows:
1. First, create the class myewordentity in snowballanalyzer. cs. This class can be seen as the interface of Snowball. CS: MasterProgramThe ultimate goal of calling snowball. CS is to obtain such an "entity" about words"
// Vocabulary entity
Public class myewordentity
{
Public String txtword; // text of a word
Public String stemroot; // the root of the filtered word
Public String posword; // word part of speech
Public int token_begin; // in Article Start position in
Public int token_end; // end position in the article
Public myewordentity ()
{
Txtword = string. empty;
Posword = string. empty;
Stemroot = string. empty;
Token_begin = 0;
Token_end = 0;
}
}
2. Create a class Stemmer under snowballanalyzer. CS. Restore the root of a wordCode(See link in Section 2)
3. In snowballanalyzer. CS, class snowballanalyzer: analyzer makes the following changes:
1. Private system. string name;
Private system. Collections. hashtable stopset; // disables the vocabulary.
Private string mmodelpath; // location of the part-of-speech tagging Software Model
//
builds the named analyzer with no stop words.
2. Public snowballanalyzer (system. string name)
{< br> // location of the Software Model for obtaining part-of-speech tagging. Model files are generally placed under this project
mmodelpath = system. io. path. getdirectoryname (
system. reflection. assembly. getexecutingassembly (). getname (). codebase);
mmodelpath = new system. uri (mmodelpath ). localpath + @ "\ models \";
This. name = Name;
}
/// <Summary> builds the named analyzer with the given stop words. </Summary>
Public snowballanalyzer (system. string name, system. String [] stopwords)
: This (name)
{
Stopset = stopfilter. makestopset (stopwords );
}
3. Rewrite the tokenstream Function
Public override tokenstream (system. String fieldname, system. Io. textreader reader)
{
Tokenstream result = new standardtokenizer (Reader );
Result = new standardfilter (result );
Result = new lowercasefilter (result );
If (stopset! = NULL)
Result = new stopfilter (result, stopset );
// Extract tokens from result nokenstream to determine the part of speech.
// Result = new snowballfilter (result, name );
Return result;
}
4. The main function of the modified class, obtains the word and word position from tokenstream, and marks the part of speech.
Public list <myewordentity> tokenstreamtoentitylist (system. String fieldname, system. Io. textreader reader)
{
Tokenstream result = tokenstream (fieldname, Reader );
// Tokenstream result2 = tokenstream (fieldname, Reader );
List <myewordentity> wordenlist = new list <myewordentity> ();
While (true)
{
Token token = result. Next ();
Myewordentity entity = new myewordentity ();
If (token = NULL)
Break;
Else
{
Entity. token_begin = token. startoffset ();
Entity. token_end = token. endoffset ();
Entity.txt word = token. termtext (); // obtain the Vocabulary text
Entity. stemroot = afterstemed(entity.txt word );
Wordenlist. Add (entity );
}
}
Arraylist myposlist = new arraylist ();
Foreach (myewordentity entity in wordenlist)
{
Myposlist.add(entity.txt word );
}
Englishmaximumentropypostagger mtager = new englishmaximumentropypostagger (mmodelpath + "englishpos. nbin", mmodelpath + @ "\ parser \ tagdict ");
Myposlist = mtager. Tag (myposlist );
For (INT I = 0; I <myposlist. Count; I ++)
{
Wordenlist [I]. posword = myposlist [I]. tostring ();
}
// Restore the root of each word
/* Result2 = new snowballfilter (result2, name );
Int K = 0; // working subscript
While (true)
{
Token token = result2.next ();
If (token = NULL)
Break;
Else
{
Wordenlist [K]. stemroot = token. termtext ();
K ++;
}
}*/
Return wordenlist;
}
5. Root restoration
Public String afterstemed (string input)
{
Stemmer S = new Stemmer ();
Input = input. tolower ();
Char [] inputchar = input. tochararray ();
S. Add (inputchar, inputchar. Length );
S. stem ();
String u = S. stemertostring ();
Return U;
}