Full-text index-comparison of two Chinese lexical analyzers (chinese_vgram_lexer chinese_lexer ),

Source: Internet
Author: User
Tags lexer

Full-text index-comparison of two Chinese lexical analyzers (chinese_vgram_lexer chinese_lexer ),
First, let's compare the Chinese Lexical analyzer. the test process is as follows:
Create a table

create table test (str varchar2(100)) ;create table test1(str varchar2(100)) ;



Insert data

Insert into test values ('People 'S republic of china '); insert into test1values ('People 'S republic of china ');


Create two analyzer about Chinese

exec ctx_ddl.create_preference('my_lexer','CHINESE_VGRAM_LEXER') ;exec ctx_ddl.create_preference('my_lexer1','CHINESE_LEXER') ;




Create full-text index

CREATE INDEX test1_idx ON test1(str) INDEXTYPE IS ctxsys.CONTEXT PARAMETERS('LEXER my_lexer1');CREATE INDEX test_idx ON test(str) INDEXTYPE IS ctxsys.CONTEXT PARAMETERS('LEXER my_lexer');




View the Word Table generated by the full-text index


Chinese_vgram_lexer

Dexter @ STARTREK> select * from DR $ TEST_IDX $ I; TOKEN_TEXT TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT items -------------------- ------------- TOKEN_INFO items Republican 0 1 1008805 countries 0 1 1008807 countries and 0 1 1008806 Chinese 0 1 1008802 personnel 0 1 1 1008803 members 0 1 1 1008804 China 0 1 1008801





Chinese_lexer


Dexter @ STARTREK> select * from DR $ topics $ I; TOKEN_TEXT TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT items ------------ ----------- People's Republic 0 1 1008803 personnel 0 1 1008802 China 0 1 1008801





Word splitting effect:

Word splitting effect:

 

 

Chinese _ lexer

Chinese_vgram_lexer

Republic

Republican

Personnel

Country

China

And China

 

Chinese

 

Personnel

 

Total members

 

China

 




For chinese_vgram_lexer, the official documentation provides the following description:

The CHINESE_VGRAM_LEXER type identifies tokens in Chinese text for creating Text indexes.

The experiment proves that the full-text index is created based on two adjacent words, which is obviously not applicable to our normal usage habits in China.




For chinese _ lexer, it is much more user-friendly:

The CHINESE_LEXER type identifies tokens in traditional and simplified Chinese text for creating Oracle Text indexes.

Experiments show that there are already optimizations without generating too many word lists, which makes sense for the optimization of full-text indexes. In addition, chinese_lexer allows custom word lists to further accelerate full-text index retrieval by blocking words and custom word lists.
The following describes how to customize the lexical analyzer.




Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.