全文索引--兩種中文詞法分析器比較(chinese_vgram_lexer chinese

全文索引--兩種中文詞法分析器比較(chinese_vgram_lexer chinese_lexer)，

最後更新：2014-10-20 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

全文索引--兩種中文詞法分析器比較(chinese_vgram_lexer chinese_lexer)，
首先讓我們進行一個關於中文詞法分析器的比較，測試過程如下：
建表

create table test (str varchar2(100)) ;create table test1(str varchar2(100)) ;

插入資料

insert into test values (‘中華人員共和國’) ;insert into test1values (‘中華人員共和國’) ;

建立兩個關於中文的分析器

exec ctx_ddl.create_preference('my_lexer','CHINESE_VGRAM_LEXER') ;exec ctx_ddl.create_preference('my_lexer1','CHINESE_LEXER') ;

建立全文索引

CREATE INDEX test1_idx ON test1(str) INDEXTYPE IS ctxsys.CONTEXT PARAMETERS('LEXER my_lexer1');CREATE INDEX test_idx ON test(str) INDEXTYPE IS ctxsys.CONTEXT PARAMETERS('LEXER my_lexer');

查看全文索引產生的詞表

chinese_vgram_lexer

dexter@STARTREK>select * from DR$TEST_IDX$I ;TOKEN_TEXT TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT---------------------------------------------------------------- ---------- ----------- ---------- -----------TOKEN_INFO------------------------------------------------------------------------------------------------------------------------------------------------------共和 0 1 1 1008805國 0 1 1 1008807和國 0 1 1 1008806華人 0 1 1 1008802人員 0 1 1 1008803員共 0 1 1 1008804中華 0 1 1 1008801

chinese_lexer

dexter@STARTREK>select * from DR$TEST1_IDX$I ;TOKEN_TEXT TOKEN_TYPE TOKEN_FIRST TOKEN_LAST TOKEN_COUNT---------------------------------------------------------------- ---------- ----------- ---------- -----------TOKEN_INFO----------------------------------------------------------------------------------------------------------------------------共和國 0 1 1 1008803人員 0 1 1 1008802中華 0 1 1 1008801

分詞效果：

chinese_ lexer	chinese_vgram_lexer
共和國	共和
人員	國
中華	和國
	華人
	人員
	員共
	中華

對於chinese_vgram_lexer來說，官方文檔有這樣的描述：

The CHINESE_VGRAM_LEXER type identifies tokens in Chinese text for creating Text indexes.

通過實驗證明，其實就是按照相鄰兩個字來建立全文索引，對於我們正常的國內使用習慣來說明顯是不適用的。

而對於chinese_ lexer來說，明顯人性化了許多：

The CHINESE_LEXER type identifies tokens in traditional and simplified Chinese text for creating Oracle Text indexes.

通過實驗證明，已經有了最佳化，沒有產生過多的詞表，這對於全文索引的最佳化來說是比較有意義的。並且chinese_lexer還允許自訂詞表，通過屏蔽詞，以及自訂詞表可以進一步的加速全文索引的檢索速度。
下文將講解如何自訂詞法分析器。

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

全文索引--兩種中文詞法分析器比較(chinese_vgram_lexer chinese_lexer)，

聯繫我們

熱門內容

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

全文索引--兩種中文詞法分析器比較(chinese_vgram_lexer chinese_lexer)，

聯繫我們

熱門內容

熱門主題

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support