2------------Nlpir (ICTCLAS2016) word breaker Add user dictionary feature

Source: Internet
Author: User

remark:Win7 64-bit system, NetBeans programming

Basic code framework See my other article: Nlpir participle function

Code Implementation :

  1. 1  Packagecwordseg; 2   3 Importjava.io.UnsupportedEncodingException; 4 //Import Utils.  Systemparas; 5 Importcom.sun.jna.Library; 6 Importcom.sun.jna.Native; 7   8 /** 9  *  Ten * Function: Add/Remove user-defined vocabulary/dictionaries One * Last updated: March 15, 2016 14:09:49 A  */   -    -  Public classcwordseg { the      Public InterfaceClibraryextendsLibrary { -Clibrary Instance = (clibrary) native.loadlibrary ("D:\\netbeansprojects\\cwordseg\\file\\win64\\nlpir", CLibrary.class);  -          Public intNlpir_init (String Sdatapath,intencoding,string Slicencecode);  -          PublicString Nlpir_paragraphprocess (String sSrc,intbpostagged);  +         //Add User Vocabulary -          Public intNlpir_adduserword (String sWord);  +         //Delete a user's vocabulary A          Public intNlpir_delusrword (String sWord);  at         //Save user Vocabulary to user dictionary -          Public intNlpir_savetheusrdic ();  -         //Import user-defined dictionary: custom dictionary path, boverwrite=true to replace the current custom dictionary, false to add to the current custom dictionary -          Public intNlpir_importuserdict (String sFileName,Booleanboverwrite);  -          PublicString nlpir_getlasterrormsg ();  -          Public voidNlpir_exit ();  in     }   -        to      Public Staticstring transstring (String aidstring,string ori_encoding,string new_encoding) { +         Try {   -             return NewString (Aidstring.getbytes (ori_encoding), new_encoding);  the}Catch(unsupportedencodingexception e) { * E.printstacktrace ();  $         }  Panax Notoginseng         return NULL;  -     }   the        +      Public Static voidMain (string[] args)throwsException { AString Argu = "D:\\netbeansprojects\\cwordseg\\file";  the         //String system_charset = "UTF-8";  +         intCharset_type = 1;  -         intInit_flag = CLibrary.Instance.NLPIR_Init (Argu, Charset_type, "0");  $ String nativebytes;  $    -         //initialization failure Prompt -         if(0 = =Init_flag) {   theNativebytes =CLIBRARY.INSTANCE.NLPIR_GETLASTERRORMSG ();  -SYSTEM.ERR.PRINTLN ("Initialization failed! Reason: "+nativebytes); Wuyi             return;  the         }   -            WuString SInput = "This is a book about information retrieval, the author is Nanjing University." ";  -         Try {   AboutNativebytes = CLibrary.Instance.NLPIR_ParagraphProcess (sInput, 1);//Word segmentation function, whether to label part of speech $System.out.println ("The original participle result is:" +nativebytes);  -                -             //Add two user vocabularies here for a single Add method -CLibrary.Instance.NLPIR_AddUserWord ("Information Retrieval n");//N is the part of speech ACLibrary.Instance.NLPIR_AddUserWord ("Nanjing University N");  +Nativebytes = CLibrary.Instance.NLPIR_ParagraphProcess (sInput, 1);  theSystem.out.println ("Increase the vocabulary result is:" +nativebytes);  -                $CLibrary.Instance.NLPIR_DelUsrWord ("Nanjing University");//Delete One of the words theNativebytes = CLibrary.Instance.NLPIR_ParagraphProcess (sInput, 1);  theSYSTEM.OUT.PRINTLN ("The result after deleting a word is:" +nativebytes);  the                the             //CLibrary.Instance.NLPIR_SaveTheUsrDic (); //Save user-defined vocabulary, not recommended -                in             intncount = CLibrary.Instance.NLPIR_ImportUserDict ("D:\\netbeansprojects\\cwordseg\\file\\adduserdict.txt",true);  theSystem.out.println (String.Format ("Imported%d user vocabularies", ncount));  theNativebytes = CLibrary.Instance.NLPIR_ParagraphProcess (sInput, 1);  AboutSystem.out.println ("After importing the dictionary results are:" +nativebytes);  the                theCLibrary.Instance.NLPIR_Exit ();//Exit the                +}Catch(Exception ex) { -             //TODO auto-generated Catch block the Ex.printstacktrace (); Bayi         }   the     }   the}


Add a description of the function of the user's vocabulary :

The user's vocabulary is preferred for word segmentation.

1.

1  Public int

features : A small number of words are added individually

parameters : Sword represents the vocabulary to be added, in the form of: "Custom vocabulary + spaces + lexical parts of speech", the space can be multiple, you can also use Tab tab instead;

Note : The user-defined vocabulary added by this function is temporary , which is only valid for this run of the program. This function does not modify the thesaurus data for the data folder from the data point of view.

2.

    1. 1  Public int Nlpir_delusrword (String SWord);  

function : A few words are deleted individually

parameters : Sword represents the vocabulary to be deleted in the form : "Custom Vocabulary". There is no part of speech.

Note : The individual does not quite understand the meaning of the function, because if you want to delete the user's vocabulary, simply comment out or delete the Nlpir_adduserword () statement that adds the temporary vocabulary. The function can neither delete the user vocabulary saved by the Nlpir_savetheusrdic () function that will be mentioned below, nor delete the vocabulary that the nlpir_importuserdict () function imports in bulk. Because the function does not modify the files in the Data folder.

3.

1  Public int Nlpir_savetheusrdic ();  

Function: Save the user's vocabulary to the system dictionary

parameters : No parameters, if saved successfully, the return value is 1, otherwise the return value is 0.

Precautions:

(1) Save all the previously added user vocabularies (excluding and deleting them) to the user dictionary;

(2) The user's vocabulary saved by this function is permanent, because the Userdict.pdat file in the Data folder will be modified, and the future word breaker will use the saved user's vocabulary;

(3) Only the words added by Nlpir_adduserword () can be saved, and nlpir_importuserdict () imported words cannot be saved.

Deactivation Method : Due to the permanent effect, the following methods can be deactivated--
Method (1): Open the Configure.xml file in the Data folder, change the UserDict parameter from on to off;
Method (2): Replace the present with the original Userdict.pdat file.

4.

1  Public int boolean boverwrite);  

function : Batch import user vocabulary from dictionary text, return value is the number of words added.

Parameters :

sFileName is the dictionary text path; for example: D:\\netbeansprojects\\cwordseg\\file\\adduserdict.txt

Boverwrite=true indicates that the newly imported data will overwrite the original user-defined dictionary;

Boverwrite=false indicates that the newly imported data will be added to the original user's custom dictionary (append).

Text dictionary format : one per line, Word + space + speech

1 Information retrieval n   2 Nanjing University n  

Precautions:

(1) Importing the user vocabulary through this function modifies the Fielddict.pdat, fielddict.pos files in the Data folder and creates a new Userdefineddict.lst file, but does not modify the Userdict.pdat file. So the user vocabulary that is imported in this way can be overridden by importing a new user dictionary (boverwrite=true), or by adding a new vocabulary (boverwrite=false).

(2) The user vocabulary that has been added is recorded in the Userdefineddict.lst file.

(3) If you set nlpir_importuserdict parameter Boverwrite=false, that is, the newly imported data does not overwrite the original data, you can modify the contents of the Userdefineddict.lst (original words), and add new words at the same time;
(4) If you set nlpir_importuserdict parameter boverwrite=true, the newly imported data will overwrite the original data. At this point, even if you modify the contents of the Userdefineddict.lst will be overwritten by new data, and eventually only the newly imported vocabulary.

(5) The user vocabulary imported in this way is also permanent, and the word segmentation effect will persist.

Deactivation Method :
Method 1: Open the Configure.xml file in the Data folder and change the Fielddict parameter from on to off;
Method 2: Import an empty text dictionary; (The import dictionary can be empty)
Method 3: Replace the present with the original Fielddict.pdat, Fielddict.pos file, Userdefineddict.lst can be deleted.

2------------Nlpir (ICTCLAS2016) word breaker Add user dictionary feature

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.