--oracle Full-text indexing with Case study tuning

Source: Internet
Author: User
Tags create index lexer

--oracle Full-text indexing with Case study tuning

Full-Text Search (Oracle text)

Oracle text enables oracle9i to have powerful text retrieval capabilities and intelligent text management capabilities, Oracle text is a new name used by Oracle9i, known in oracle8/8i as Oracle intermedia text, Oracle8 was previously Oracle context cartridge. The indexing and lookup capabilities of Oracle Text are not limited to data stored in the database. It can retrieve and find documents stored in the file system, and can retrieve more than 150 document types, including Microsoft Word, PDF, and XML. The Oracle text lookup features include fuzzy lookups, stemming (search mice and find mouse), wildcard characters, proximity and other search methods, as well as result grading and keyword highlighting. You can even add a dictionary to find the collocation Word and find the document that contains the collocation word.

Oracle text needs to be indexed for retrievable data items, and the user is able to search for content, and the indexing process is modeled according to the pipeline, where the data passes through a series of transformations, adding its keywords to the index. The indexing process is divided into several stages, such as

650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M01/53/C2/wKiom1Ru76bSaOvkAAC8Gbpz0iA684.jpg "title=" t1.png "alt=" Wkiom1ru76bsaovkaac8gbpz0ia684.jpg "/>

1. Data retrieval (Datastore): Simply remove the data from the data store (such as a Web page, database large object, or local file system) and send it to the next stage as a data stream.

2. Filter: The filter is responsible for converting data in various file formats into plain text format, and other components in the index pipeline can only handle plain text data and cannot recognize file formats such as Ms Word or Excel.

3. Segmentation (Sectioner): The segment adds metadata about the structure of the original data item.

4. Lexical analysis (Lexer): divides a character stream into several words based on the language of the data item. 5. Index: The last phase adds keywords to the actual index.

The difference between full-text search and general search

There are also many ways to search for text in an Oracle database, such as the InStr function and the like operation, without using the Oracle text feature:

1, SELECT *from mytext WHERE INSTR (thetext, ' Oracle ') > 0;

2. SELECT * from MyText WHERE thetext like '%oracle% ';

There are many times when it is ideal to use InStr and like, especially when searching across very small tables. However, these text positioning methods will result in a full table scan, the resource consumption is more expensive, and the implementation of the search function is very limited, so for large amounts of text data search, we recommend the use of Oralce provides full-text search functionality.

Attached: Here incidentally record instr and like:

In Oracle, you can use the INSTR function to judge a string to determine whether it contains the specified character. Its syntax is: INSTR (string, substring, position, occurrence).

String: Represents the source string (the Write field represents the contents of this field).

SUBSTRING: Represents a substring that you want to find from the source string.

Position: Represents the start position of the lookup, which is optional and defaults to 1.

Occurrence: The representative wants to find out the first occurrence of the substring from the source character, which is also optional and defaults to 1.

The value of the position is a negative number, then the representation is looked up from right to left.

Performance comparison of InStr and like

In fact, from an efficiency point of view, who can use the index, the query speed will be fast.

Like can sometimes be used to index, for example: name as ' Li% ', and when the following conditions the index will be invalidated: name '% Lee '. So generally when we look for Chinese similar to '% character% ', the index will be invalidated. Unlike other databases, Oracle supports function indexing. For example, to build a InStr index on the Name field, queries are faster, which is why InStr is more efficient than like.

Note: InStr (title, ' manual ') >0 equivalent to like '% manual% '

InStr (title, ' manual ') =0 equivalent to not-like '% manual% '

Oracle Text Indexing principle

The Oracle text index converts all the characters in the text into tokens (tokens), such as www.taobao.com, which translates into tokens such as www,taobao,com.
Four types of indexes are supported in oracle10g:

Context

Ctxcat

Ctxrule

Ctxxpath  

650) this.width=650; "src=" http://s3.51cto.com/wyfs02/M00/53/C1/wKioL1Ru8E2yfosBAAIoUrpDBrU710.jpg "title=" t2.png "alt=" Wkiol1ru8e2yfosbaaiourpdbru710.jpg "/>

CONTEXT

Used to retrieve a large amount of contiguous text data. Supports many data formats such as Word, HTML, XML, text, and more. The Support Range (range) partition, which supports the index type of the parallel creation index (Parallel indexing). Supported types: VARCHAR2, CLOB, BLOB, CHAR, BFILE, XMLType, and Uritype.

After DML operation, CTX_DDL is required. Sync_index Manual Sync Index If a query contains more than one word, it is separated directly by a space (such as Oracle Itpub)

Case Analysis:

Set up full-text search

Step Step One: Check and set the database role

First check the database for Ctxsys user and Ctxapp foot color. Without this user and role, it means that your database was created without the Intermedia feature installed (the 10G default installation has this user and role). You must modify the database to install this feature. By default, Ctxsys users are locked out, so Ctxsys users are enabled first.

11:53:13 [email protected] prod >select username,account_status from dba _users where username like  ' ctx% '; username                        account_status------------------------------ ------------------------ --------ctxsys                          EXPIRED & LOCKED11:54:17 [email  Protected] prod >alter user ctxsys identified by oracle account  unlock; User altered.11:55:07 [email protected] prod >select username,account_status  from dba_users where username like  ' ctx% '; Username              &nbSp;        account_status------------------------------ --------- -----------------------ctxsys                          OPEN12:00:13 [email  protected] prod >select role from dba_roles12:00:23   2    where role like  ' ctx% '; ROLE------------------------------Ctxapp

Step Two: Empower

Under Ctxsys users, grant the test user Scott the following permissions:

[[Email protected] ~]$ cat t.sqlgrant resource, connect, ctxapp to  scott; grant execute on ctxsys.ctx_cls to scott; grant execute on ctxsys.ctx_ddl to scott; grant execute on ctxsys.ctx_doc to scott; grant execute on ctxsys.ctx_output to scott; grant execute on ctxsys.ctx_query to scott; grant execute on ctxsys.ctx_report to scott; grant execute on ctxsys.ctx_thes to scott; Grant execute on ctxsys.ctx_ulexer to scott;11:58:04 [email protected]  prod >@/home/oracle/t.sqlgrant succeeded. Elapsed: 00:00:00.15grant succeeded. Elapsed: 00:00:00.21grant succeeded. Elapsed: 00:00:00.09grant succeeded. Elapsed: 00:00:00.09grant succeeded. Elapsed: 00:00:00.13grant succeeded. elapsed: 00: 00:00.07grant succeeded. Elapsed: 00:00:00.09grant succeeded. Elapsed: 00:00:00.10grant succeeded. elapsed: 00:00:00.07

Step three: Set up the lexical analyzer (lexer)

Oracle implements full-text retrieval, and its mechanism is simple. That is, through the Oracle patented Lexical Analyzer (lexer), all the ideographic units in the article (Oracle called term) are found, recorded in a set of tables beginning with dr$, and note the position, number of times, hash value and other information. When retrieved, Oracle looks for the appropriate term from this set of tables and calculates its frequency, based on an algorithm that calculates the score for each document (score), the so-called ' match rate '. And Lexer is the core of the mechanism, it determines the efficiency of full-text search. Oracle offers different lexer for different languages, and we typically have three of them:

basic_lexer: for English. It can separate the English words from the sentences according to the space and punctuation, and can automatically take some words that appear too high frequency to have lost the meaning of retrieval as ' garbage ' treatment, such as if,is, with high processing efficiency. However, the lexer applied to Chinese has many problems, because it only recognize spaces and punctuation, and Chinese words usually do not have spaces, so it will be the whole sentence as a term, in fact, the loss of retrieval ability. With the words ' Chinese people stand up ' as an example, the result of basic_lexer analysis is only one term, that is ' Chinese people stand up '. If you retrieve ' China ' at this point, the content will not be retrieved. Chinese_vgram_lexer: A specialized Chinese parser that supports all Chinese character sets (zhs16cgb231280 zhs16gbk zht32euc zht16big5  zht32tris zht16mswin950 zht16hkscs utf8 ). The analyzer analyzes Chinese sentences in terms of the word unit. ' The Chinese people stand up ' this sentence will be analyzed by it into the following term: ' Chinese ', ' China ', ' countrymen ', ' People ', ' min zhan ', ' Stand Up ', ' up ', ' Come ', ' up '. It can be seen that this method of analysis, the implementation of the algorithm is very simple, and can achieve ' clean sweep ', but the efficiency is passable. Chinese_lexer: This is a new Chinese parser that supports only the UTF8 character set. As has been seen above, chinese vgram lexer this analyzer because do not know the common Chinese vocabulary, so the analysis of the unit is very mechanical, like the above ' Min station ', ' Standing up ' does not appear alone in Chinese, so this term is meaningless, but it affects efficiency. Chinese_lexer's biggest improvement is that the analyzer can recognize most commonly used Chinese words, so it can more efficiently analyze sentences, like the above two foolish units will not appear again, greatly improved efficiency. But it only supports UTF8, if your database is the ZHS16GBK character set, you can only use the stupid chinese vgram lexer. If you do not make any settings, Oracle uses the Basic_lexer parser by default. 
12:05:01 [email protected] prod >select userenv (' language ')  from dual; USERENV (' LANGUAGE ')----------------------------------------------------American_america. Zhs16gbk12:08:05 [email protected] prod >desc ctx_ddlprocedure create_ preference argument name                   Type                     In/Out Default? ------------------------------  ----------------------- ------ -------- PREFERENCE_NAME                 VARCHAR2                 IN OBJECT_NAME                     VARCHAR2                 in 12:12:25 [email protected] prod >exec ctx_ ddl.create_preference  (' my_lexer ',  ' chinese_vgram_lexer ');P l/sql procedure successfully  completed. CREATE TABLE 12:13:15 [email protected] prod >create table textdemo ( 12:15:47   2       id number not null  primary key,12:15:47   3       book_author  VARCHAR2 (+),--author 12:15:47   4       publish_time date, --Release date 12:15:47   5       title varchar2,--heading 12 : 15:47   6       book_abstract varchar2 (+),--abstract 12 : 15:47   7       PATH VARCHAR2 (200)--path 12:15:47   8  ); Table created.

Inserting data insert INTO Textdemo VALUES (1, ' Gang ', to_date (' 2008-10-07 ', ' yyyy-mm-dd '), ' Moving Castle ', ' The story takes place in Europe at the end of 19th century, The kind lovable Sophie is cursed by the vicious witch, from the 18-year-old girl to 90-year-old mother-in-law, lonely helpless she accidentally walked into the town outside of the mobile castle, it is said that its owner HAL to absorb the girl's soul for joy, but things are not so scary people legend, the eccentric Hal actually took Sophie,   Two people in the four feet of the movement of the castle began a wonderful common life, a piece of love and pain, music and sad love story in the war quietly unfold ', ' E:\textsearch\moveingcastle.doc '); INSERT into Textdemo VALUES (2, ' Mobekmanbetov ', to_date (' 2008-10-07 ', ' yyyy-mm-dd '), ' Turn around ', ' this film was directed by Russian director Mobekmanbetov from 6 Since the end of the month in North America, more than $300 million in box office revenue has been achieved worldwide. In Asia, after the release of Japan, Korea and other place to win the box office title. Although many netizens have been in the past through various channels to contact the film, but believe that the film with the big screen shows the super Cool audio-visual effect, still can attract a large number of fans to the cinema to join.   ', ' E:\textsearch\catch.pdf '); INSERT into Textdemo VALUES (3, ' Yuan Quan ', to_date (' 2008-10-07 ', ' yyyy-mm-dd '), ' starring Daniel Wu and Yuan Quan ', ' movie Dream ' filmed in Shanghai Fun Square, starring Daniel Wu and Yuan Quan. Because it is late-night shooting, so there is not too many fans noticed, gave the crew a very clean shooting environment, standing in the street Yuan Quan Low head, in the cold night looks really some like female ghosts, creepy. ', ' E:\textsearch\dream.txt ');

Step four: Build the index in the Book_abstract field using the oratext_lexer:chinese_vgram_lexer you just set as the parser.

CREATE INDEX demo_abstract on Textdemo (book_abstract) indextype is ctxsys.context parameters (' lexer oratext_lexer '); commit; After many more tables and indexes that begin with dr$, the system creates four related tables: dr$demo_abstract$i (token table after word breaker) dr$demo_abstract$kdr$demo_abstract$n dr$ Demo_abstract$r

The following statement can see if an error occurred during index creation:

SELECT * from ctx_user_index_errors: for indexed types (for example, Ctxsys.context), there are four types: Context,ctxcat,ctxrule,ctxxpath. Context is used to retrieve a large number of contiguous text data. Supports many data formats such as Word, HTML, XML, text, and more. The Support Range (range) partition, which supports the index type of the parallel creation index (Parallel indexing). Supported types: VARCHAR2, CLOB, BLOB, CHAR, BFILE, XMLType, and Uritype.dml. After operation, CTX_DDL is required. Sync_index Manual Sync Index If a query contains more than one word, it is separated directly by a space (such as Oracle itpub). The query identifier CONTAINSCTXCAT applies to mixed query statements (such as query criteria including Product ID, price, description, etc.). Suitable for small queries with a certain structure of the text segment. Have transactional. After DML operations, the indexes are automatically synchronized. Operator: and,or,&gt,;<, =,between,in query identifier Catsearchctxrule query identifier matches. Ctxxpath (These two indexes do not go to more search related content) Generally we establish a context type index (contains to query).

Step Five: Query test

--query or select Score, t.* from Textdemo t WHERE contains (book_abstract, ' Move Castle or Russia ', >0; Select Score, t.* from Textdemo t WHERE contains (book_abstract, ' Move Castle or Europe ', ') >0;--basic Query Select score, t.* from T Extdemo t where contains (book_abstract, ' Moving Castle ') >0;--query contains multiple words and tests by Select Score (), t.* from Textdemo t WHERE Contains (Book_abstract, ' Moving Castle and Europe ') >0; test pass.

Create a full-text index on multiple fields

There are many times when you need to query a record that satisfies a condition from multiple text fields, you need to create a full-text index for multiple fields, such as a full-text search from the Pmhsubjects (thematic table) subjectname (topic name) and Briefintro (introduction). You need to follow these steps:

Establish a multi-field index of preference, log in as Ctxsys, and execute:

BEGIN

Ctx_ddl.create_preference (' Ctx_demo_abstract_title ', ' multi_column_datastore ');

END;

Establish the preference corresponding field value (log in Ctxsys) corresponding to the title path Book_abstract three fields to index:


BEGIN

Ctx_ddl.set_attribute (' ctx_demo_abstract_title ', ' columns ', ' title,path ');

END;

To create a full-text index:

CREATE INDEX demo_abstract_title on Textdemo (book_abstract) indextype is ctxsys.context parameters (' DATASTORE ctxsys. Ct X_demo_ abstract_title lexer oratext_lexer ');

Commit

Test

SELECT score, t.* from Textdemo t WHERE contains (book_abstract, ' Move Castle or Russia ', >0;

Search tests on large print segments

CREATE TABLE mytable (ID number PRIMARY KEY, Docs CLOB);

INSERT into MyTable VALUES (111555, ' This text'll be indexed ');

INSERT into MyTable VALUES (111556, ' This is a direct_datastore example ');

Commit;

CREATE INDEX Myindex on MyTable (Docs)

Indextype is Ctxsys.context

Parameters (' datastore ctxsys.default_datastore ');

SELECT * FROM MyTable WHERE contains (docs, ' text ') > 0;


This article is from the "Tianya blog," Please make sure to keep this source http://tiany.blog.51cto.com/513694/1580942

--oracle Full-text indexing with Case study tuning

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.