Full-text search

Last Update:2018-05-27 Source: Internet

Author: User

Tags lexer

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

No detailed content oraclewww.ll19.comw.oracle_text.html 1. the difference between full-text search and normal search does not use the Oracletext function. Of course, there are also many ways to search for text in Oracle databases, such as the INSTR function and LIKE Operation: 12SELECT * FROMmytextWHEREINSTR (thetext, Oracle) 0; SELECT *

Oracle plain text is not detailed. Of course there are also many ways to search for text in Oracle databases, such as the INSTR function and LIKE Operation: 12 SELECT * FROM mytext where instr (thetext, 'oracle ') 0; SELECT *

<无详细内容> Oracle

http://www.ll19.com/#/oracle_text.html1 . The difference between full-text search and normal search does not use the Oracle text function. Of course, there are also many ways to search text in Oracle databases, such as INSTR functions and LIKE operations: 12 SELECT * FROM mytext where instr (thetext, 'oracle ')> 0; SELECT * FROM mytext WHERE thetext LIKE' % Oracle % '; many times, it is ideal to use instr and like, especially when searching only small tables. However, using these text locating methods will lead to full table scanning, which is expensive for resources and has very limited search functions. Therefore, when you search massive volumes of text data, we recommend that you use the full-text search function provided by oralce. Note: here we record INSTR and LIKE: in Oracle, you can use the Instr function to determine whether a string contains specified characters. Its syntax is: Instr (string, substring, position, occurrence ). String: indicates the source string. (If a field is written, it indicates the content of this field ). Substring: The substring to be searched from the source string. Position: indicates the start position of the search. this parameter is optional. The default value is 1. Occurrence: specifies the number of substrings that appear from the source character list. This parameter is optional. The default value is 1. The value of position is negative, which indicates searching from right to left. The Performance Comparison Between instr and like is actually from the efficiency point of view, who can use the index, whose query speed will be faster. Like can sometimes use indexes, such as name like 'Li % '. In the current situation, the index will fail: name like' % Li '. Therefore, when we look for a Chinese character similar to '% character %', the index will be invalid. Unlike other databases, oracle supports function indexes. For example, if you create an instr index on the name field, the query speed is faster, Which is why instr is more efficient than like. Note: instr (title, 'manual')> 0 is equivalent to like '% manual %' instr (title, 'manual') = 0 is equivalent to not like '% manual %' 2. step 1: Check and set the database role. First, check whether the database has the CTXSYS user and CTXAPP roles. If you do not have this user and role, it means that the intermedia function is not installed at the time of database creation (10 Gb is installed by default with this user and role ). You must modify the database to install this function. By default, the ctxsys user is locked, so you must enable the ctxsys user first. Step 2: GRANT the following permissions TO the test user oratext under the ctxsys User: 123456789 GRANT resource, connect, ctxapp TO oratext; GRANT execute ON ctxsys. ctx_cls TO oratext; GRANT execute ON ctxsys. ctx_ddl TO oratext; GRANT execute ON ctxsys. ctx_doc TO oratext; GRANT execute ON ctxsys. ctx_output TO oratext; GRANT execute ON ctxsys. ctx_query TO oratext; GRANT execute ON ctxsys. ctx_report TO oratext; GRANT execute ON ctxsys. ctx_thes TO oratext; GRAN T execute ON ctxsys. ctx_ulexer TO oratext; Step 3: Set the lexical analyzer (lexer) Oracle TO implement full-text retrieval. The Mechanism is actually very simple. The Oracle patented lexical analyzer (lexer) is used to find all ideographic units (Oracle called term) in the article and record them in a group of tables starting with dr $, at the same time, write down the location, number of times, and hash value of the term. During retrieval, Oracle searches for the corresponding term from this table and calculates the frequency of occurrence. Based on an algorithm, it calculates the score (score) of each document, which is called the 'matching rate '. Lexer is the core of this mechanism, which determines the efficiency of full-text retrieval. Oracle provides different lexer for different languages, and we can usually use three of them: basic_lexer: for English. It can separate English words from sentences based on Spaces and punctuations, and automatically treat words that have lost retrieval meaning frequently as 'spam ', such as if, is and so on, with high processing efficiency. However, the lexer has many problems when used in Chinese. Because it only recognizes space and punctuation, and generally does not contain spaces in a Chinese sentence, it regards the entire sentence as a term, in fact, the retrieval capability is lost. Taking the phrase 'Chinese people stood up' as an example, the result of the basic_lexer analysis is only one term, that is, 'Chinese people stood up '. If 'China' is retrieved, NO content is retrieved. Chinese_vgram_lexer: A specialized Chinese analyzer that supports all Chinese character sets (ZHS16CGB231280 ZHS16GBK ZHT32EUC ZHT16BIG5 ZHT32TRIS ZHT16MSWIN950 ZHT16HKSCS UTF8 ). The analyzer analyzes Chinese sentences in units of words. The Chinese people stood up. This sentence will be analyzed into the following terms: medium, Chinese, Chinese, people, and people ', 'stand up ', get up', 'Come '. It can be seen that this analysis method is easy to implement and can achieve 'all-in-One nets', but the efficiency is unsatisfactory. Chinese_lexer: this is a new Chinese analyzer that only supports the utf8 character set. As we can see above, the analyzer of chinese vgram lexer does not know commonly used chinese words, so the analysis unit is very mechanical, like the above 'people station ', the term "Start Up" does not appear separately in Chinese. Therefore, this term is meaningless and affects efficiency. The biggest improvement of chinese_lexer is that the analyzer can recognize most of the commonly used Chinese vocabulary, so it can analyze sentences more efficiently. The above two stupid units will not appear again, greatly improving the efficiency. However, it only supports utf8. If your database is in the zhs16gbk character set, you can only use the stupid Chinese vgram lexer. If no settings are made, Oracle uses the basic_lexer analyzer by default. To specify which lexer to use, perform the following operations: 1. create a preference: 1 exec ctx_ddl.create_preference ('My _ lexer ', 'Chinese _ vgram_lexer'); 2. when creating a full-text INDEX, specify the lexer: 1 create index myindex ON mytable (mycolumn) indextype IS ctxsys. context parameters ('lexer my_lexer '); in this way, chinese_vgram_lexer is used as the analyzer. 3. test full-text search the test user is oratext, and the content of this user and the corresponding tablespace will not be written: Step 1: authorization, ctxsys login and oratext user authorization: 123456789 GRANT resource, connect, ctxapp TO oratext; GRANT execute ON ctxsys. ctx_cls TO oratext; GRANT execute ON ctxsys. ctx_ddl TO oratext; GRANT execute ON ctxsys. ctx_doc TO oratext; GRANT execute ON ctxsys. ctx_output TO oratext; GRANT execute ON ctxsys. ctx_query TO oratext; GRANT execute ON ctxsys. ctx_report TO oratext; GRANT execute ON ctxsys. ctx_thes TO oratext; GRANT execute ON ctxsys. ctx_ulexer TO oratext; Step 2: Set the lexical analyzer and use chinese_vgram_lexer as the analyzer: 1234BEGIN -- set the lexical analyzer ctx_ddl.create_preference ('oratext _ lexer ', 'Chinese _ vgram_lex; you can use the following statement to view the system's default and set oracle text parameters: 1 SELECT pre_name, pre_object FROM ctx_preferences. You can see the syntax analyzer parameter oratext_lexer I just set, (the default syntax is MY_LEXER ). Step 3: CREATE a test TABLE and insert test data: 123456789101112131415161718 create table textdemo (id number not null primary key, book_author varchar2 (20), -- author publish_time date, -- Release Date title varchar2 (400), -- title book_abstract varchar2 (2000), -- Abstract path varchar2 (200) -- path); commit; insert into textdemo VALUES (1, 'gong qijun', to_date ('1970-10-07', 'yyyy-mm-dd'), 'mobile castel', 'the story happened in Europe at the end of the 19th century, good and lovely Sophie was put under the curse by a vicious witch, from the 18-year-old girl into a 90-Year-Old Mother-in-law, alone and helpless she accidentally walked out of the town to move the castle, it is said that its host ha Er is happy to learn from the girl's soul, but things are not as terrible as people say, the strange nature of Hal actually took over Sophie, the two men started a wonderful life together in the four-legged mobile Castle. A Love Story intertwined with love and pain, joy and sorrow quietly unfolded in the war. ', 'e: \ textsearch \ moveingcastle.doc '); insert into textdemo VALUES (2, 'mo Beckman beot', to_date ('2017-10-07', 'yyyy-mm-dd '), 'bullet turns, 'the film, directed by Russian director Tim Beckman batov, has earned over $ June in box office revenue globally since its release in North America in late 0.3 billion. After its release in Asia, it also won the box office championship in Japan, South Korea and other places. Although many netizens have come into contact with this film through various channels, I believe that the film will still attract a large number of fans to the cinema thanks to its cool audiovisual effect on the screen. ', 'E: \ textsearch \ catchance'); insert into textdemo VALUES (3, 'yuanquance', to_date ('2017-10-07', 'yyyy-mm-dd '), 'The stars Wu yanzu and Yuan Quan appeared. 'The Movie "ruimeng" was filmed in Shanghai tonglefang. The stars Wu yanzu and Yuan Quan appeared. Because it was shot late at night, fans did not notice that they gave the cast a clean shooting environment, and Yuan Quan, standing on the street, bowed his head, in the cold night, it looks a little scary. ', 'E: \ textsearch \ dream.txt'); commit; Step 4: Create an index in the book_abstract field and use ORATEXT_LEXER: chinese_vgram_lexer as the analyzer. 12 create index demo_abstract ON textdemo (book_abstract) indextype IS ctxsys. context parameters ('lexer ORATEXT_LEXER '); commit; as described above, there are many more tables and indexes starting with dr $. The system will create four related tables: DR $ DEMO_ABSTRACT $ I (TOKEN table after word segmentation) \ DR $ DEMO_ABSTRACT $ K \ DR $ DEMO_ABSTRACT $ N \ DR $ DEMO_ABSTRACT $ R the following statement can check whether an error occurs during index creation: 1 SELECT * FROM ctx_USER_index_errors attached: for the index creation type (for example, ctxsys. context), including context, ctxcat, ctxrule, and ctxxpath. CONTEXT is used to retrieve a large amount of continuous text data. Supports many data formats such as word, html, xml, and text. Supports range partitions and Parallel indexing. Supported types: VARCHAR2, CLOB, BLOB, CHAR, BFILE, XMLType, and URIType. DML. After the operation, you need to manually synchronize the index by CTX_DDL.SYNC_INDEX. If a query contains multiple words, separate them with spaces (such as oracle itpub ). The query identifier contains ctxcat is applicable to mixed query statements (such as product id, price, and description ). It is suitable for querying small text segments with a certain structure. It is transactional. After the DML operation, the index is automatically synchronized. Operator: and, or, >,;<,=, between, in query identifier catsearch ctxrule query identifier MATCHES. CTXXPATH (these two indexes do not go to more search-related content). In general, we create a CONTEXT-type index (CONTAINS for query ). Step 5: Test query 1234567 -- query or SELECT score (20), t. * FROM textdemo t WHERE contains (book_abstract, 'mobile castle or Russian ', 20)> 0; SELECT score (20), t. * FROM textdemo t WHERE contains (book_abstract, 'mobile castle or Euro', 20)> 0; -- basic query SELECT score (20), t. * FROM textdemo t WHERE contains (book_abstract, 'mobile bucket', 20)> 0; -- query contains multiple words and test the results by using SELECT score (20), t. * FROM textdemo t WHERE contains (book_abstract, 'mobile Castle and Euro', 20)> 0; test passed. 4. to create a full-text index for multiple fields (still under speculation), you often need to query records that meet the conditions from multiple text fields. In this case, you need to create a full-text index for multiple fields, for example, to retrieve full text from subjectname (topic name) and briefintro (Introduction) of pmhsubjects (topic table), follow these steps: Create preference for multi-field index, log on to ctxsys and run 123BEGINctx_ddl.create_preference ('ctx _ demo_assistact_title ', 'Multi _ column_datastore'); END; create the field value of preference (Log On With ctxsys) index is created for the title path book_abstract fields: 123BEGINctx_ddl.set_attribute ('ctx _ demo_abstract_title ', 'columns', 'title, path'); END; full-text index is created: 12 create index demo_policact_title ON textdemo (book_abstract) indextype IS ctxsys. context parameters ('datastore ctxsys. ctx_demo _ effecact_title lexer ORATEXT_LEXER '); commit; Test 1 SELECT score (20), t. * FROM textdemo t WHERE contains (book_abstract, 'mobile castle or Russian ', 20)> 0; 5. search for large fields: 12345678910 create table mytable (id number primary key, docs clob); insert into mytable VALUES (111555, 'this text will be indexed '); insert into mytable VALUES (111556, 'This is a direct_datastore example '); Commit; create index myindex ON mytable (docs) indextype IS ctxsys. contextparameters ('datastore ctxsys. default_datastore '); SELECT * FROM mytable WHERE contains (docs, 'text')> 0;

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More