Full-text index-comparison of two Chinese lexical analyzers (chinese_vgram_lexer chinese_lexer ),
First, let's compare the Chinese Lexical analyzer. the test process is as follows:
Create a table
create table test (str varchar2(100)) ;create table test1(str varchar2(100)) ;
Insert data
Insert into test values ('People 'S republic of china '); insert into test1values ('People 'S republic of china ');
Create two analyzer about Chinese
exec ctx_ddl.create_preference('my_lexer','CHINESE_VGRAM_LEXER') ;exec ctx_ddl.create_preference('my_lexer1','CHINESE_LEXER') ;
Create full-text index
CREATE INDEX test1_idx ON test1(str) INDEXTYPE IS ctxsys.CONTEXT PARAMETERS('LEXER my_lexer1');CREATE INDEX test_idx ON test(str) INDEXTYPE IS ctxsys.CONTEXT PARAMETERS('LEXER my_lexer');
View the Word Table generated by the full-text index
Chinese_vgram_lexer
Dexter @ STARTREK> select * from DR $ TEST_IDX $ I; TOKEN_TEXT TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT items -------------------- ------------- TOKEN_INFO items Republican 0 1 1008805 countries 0 1 1008807 countries and 0 1 1008806 Chinese 0 1 1008802 personnel 0 1 1 1008803 members 0 1 1 1008804 China 0 1 1008801
Chinese_lexer
Dexter @ STARTREK> select * from DR $ topics $ I; TOKEN_TEXT TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT items ------------ ----------- People's Republic 0 1 1008803 personnel 0 1 1008802 China 0 1 1008801
Word splitting effect:
Word splitting effect:
Chinese _ lexer |
Chinese_vgram_lexer |
Republic |
Republican |
Personnel |
Country |
China |
And China |
|
Chinese |
|
Personnel |
|
Total members |
|
China |
For chinese_vgram_lexer, the official documentation provides the following description:
The CHINESE_VGRAM_LEXER type identifies tokens in Chinese text for creating Text indexes.
The experiment proves that the full-text index is created based on two adjacent words, which is obviously not applicable to our normal usage habits in China.
For chinese _ lexer, it is much more user-friendly:
The CHINESE_LEXER type identifies tokens in traditional and simplified Chinese text for creating Oracle Text indexes.
Experiments show that there are already optimizations without generating too many word lists, which makes sense for the optimization of full-text indexes. In addition, chinese_lexer allows custom word lists to further accelerate full-text index retrieval by blocking words and custom word lists.
The following describes how to customize the lexical analyzer.