FTS5 and DIY

Source: Internet
Author: User

This article has been published by the author Wang Rongtao authorized NetEase Cloud community.

Welcome to NetEase Cloud Community, learn more about NetEase Technology product operation experience.

FTS5 Introduction

The previous article has introduced FTS3/FTS4, which focuses on their successor FTS5.

FTS5 was introduced in SQLite 3.9.0, but unfortunately many of the current OS or applications have not yet started using this version or the newer 3.10.x.

Note: A very exciting version of SQLite 3.9.0, in addition to the introduction of FTS5, introduces the Json1 extension, from which we can use the specific set of functions that it provides directly in the SQL-level manipulation of the JSON in the column without the need for "deserialization---and modification-and-serialization".

different from the FTS3/FTS4Build table
    • Column cannot be type decorated when table is created

    • MATCHINFO=FTS3 is removed and replaced with columnsize=0

    • Notindexed= is removed and replaced with the unindexed keyword

    • ICU word breaker is removed. Do not know whether the future will support ...

    • Compress=, uncompress=, and languageid= options are removed and there are no alternative features available

SELECT statement
    • The query syntax on the right side of the match operator is more explicit, eliminating ambiguity

    • DocId alias support is canceled and can now be used with ROWID

    • The left side of the match operator must be a table name and no longer support column names when full-text indexing

    • FTS5 supports order by rank. Rank is a special column whose value is BM25 () value when the full-text index is queried

Changes in built-in auxiliary functions
    • The Matchinfo () and offsets () functions are removed, and the function of the snippet () function is weakened.

    • Support for custom helper functions, the API can be used to build out the functions of several removed functions, and even build a more powerful

    • Built-in auxiliary functions will be further improved in the future

Other
    • There is currently no function equivalent to the Fts4aux table

    • FTS3/4 "Merge=x,y" replaced by FTS5 merge command

    • FTS3/4 "Automerge=x" is replaced by FTS5 automerge option

The difference of the underlying implementation
    • The way the inverted table is stored is changed, and the list of documents (instances) that refer to a single word can be stored separately so that it can support progressive loading at query time and save a lot of memory in some cases

    • Optimization of merging methods of index tree

The following focuses on the query syntax and custom functions, word breakers are described in detail.

Inquire

FTS5 the match query syntax is improved with respect to FTS3/FTS4. The BNF paradigm of the match right query condition can be described as follows:

< phrase>: = string [*]
< phrase>: = < Phrase> + < phrase>
< neargroup>: = near (< Phrase> < phrase> ... [, N])
< query>: = [< colspec>:] < phrase>
< query>: = [< colspec>:] < neargroup>
< query>: = (< query>)
< query>: = < query> and < query>
< query>: = < query> OR < query>
< query>: = < query> not < query>
< colspec>: = ColName
< colspec>: = {colname1 colname2 ...}

Where string can be a string or a bare word enclosed in double quotation marks. The bare words are made up of consecutive characters of the following:

Non-ASCII characters
Uppercase and lowercase English letters
Digital
Underline
Replace (Ascii/unicode code point is 26)

FTS5 match query syntax is more rigorous, more sensitive to punctuation, and its improvement brings a benefit of reducing ambiguity. At the same time, query syntax within multiple column ranges has become relatively straightforward. The focus of this article is not to explain the query syntax, so this is no longer open, interested students can from the end of the link to see this part of the details.

The built-in function of "simple match"

The current version of the FTS5 module provides the BM23 (), highlight (), snippet () three built-in functions. The BM25 () function reduces the threshold used by the Okapi BM25 function as an "add-on", while the snippet () function in FTS3/FTS4 is now renamed Highlight () and the new function is less powerful than the original, which is the "simple match" of real deal. The snippet () function under the FTS5 provides an extract of the word sequence fragments around the hit target, which is also "simple", as this can basically be obtained by a combination of parameters from the original.

For our project, the Matchinfo () and the offsets () function "Subtract" are the most influential! This also hardened our resolve to use custom word breakers and custom helper functions.

FTS5 Extension

The SQLite team took the Matchinfo () and the offsets () to be considered and did come up with a practical solution to the problem, the more developed C API. First, SQLite provides three APIs to create a custom word breaker, find the currently registered word breaker, and create a custom SQL function.

typedef struct fts5_api fts5_api;struct fts5_api {    int  iversion;  /*  Current value for 2 */    /*  create custom word breaker  */     int  (*xcreatetokenizer) (        fts5_api *pApi,         const char *zName,         void *pcontext,        fts5_tokenizer * ptokenizer,        void  (*xdestroy) (void*)      );    /*  Find the currently registered word breaker  */    int  (*xfindtokenizer) (         fts5_api *pApi,         const char *zName,        void **ppContext,         fts5_tokenizer *ptokenizer    );    /*  Create custom SQL functions  */    int  (*xcreatefunction) (         fts5_api *papi,        const char *zname, /*  The Zname parameter specifies the name of the custom SQL function  */        void *pContext,         fts5_extension_function xFunction,         void  (*xdestroy) (void*)     );};
Custom word Breakers

To implement a custom word breaker, you need to implement three functions xcreate, Xdelete, and Xtokenize.

Typedef struct fts5tokenizer fts5tokenizer;typedef struct fts5_tokenizer fts5_ tokenizer;struct fts5_tokenizer {int  (*xcreate) (void*, const char **azarg,  Int narg, fts5tokenizer **ppout);void  (*xdelete) (fts5tokenizer*);int  (*xTokenize) ( fts5tokenizer*,     void *pctx,    int flags,               /*  some with Fts5_tokenize_ A constant flag bit for the prefix that indicates the source of the call  */    const char *pText, int nText,      int  (*xtoken) (        void *pctx,          /* xtokenize () function pointer copy of the second argument  */         int tflags,         /*   Some constant flag bits prefixed with Fts5_token_ to identify whether to open a synonymDon't  */        const char *pToken, /*  Point to buffer */        int ntoken,    containing tokens       /* token size, in bytes  */         int istart,         /* token the byte offset in the input text  */         int iEnd             /* token the last character in the input text offset +1 */    );};

In practice, we tend to generate contexts in xcreate, destroy contexts in Xdelete, and xtokenize are the core logic for real word segmentation. When Xtokenize is called, SQLite gives us a few parameters:

Flags are used to specify the source of the call, whether from creating a document or full-text search
Ptext used to make input text
ntext to indicate the size of the input text
The Xtoken is a back-off function. When the word breaker determines a word, it needs to call this back to tell the FTS5 drive frame the word's message.

Seemingly simple process, actually implies some pits, you need to distinguish between the byte or the word number, and these FTS5 official documents are not significantly described. I was in the writing mmfts5, was a pit, and later by reading SQLite source code combined with certain reasoning to determine the details.

The following code gives an example of a very simple (and even rudimentary) word breaker that shows the implementation of the Xtokenize:

Int mytokenize (Fts5tokenizer*,     void *pctx,    int  flags,    const char *pText, int nText,      int  (*xtoken) (Void *, int, const char *, int, int, int))  {    int rc;    int start = -1;     int end;    for  (end = 0; end < ntext;  end++)  {        if  (Isspace (ptext[end))  {             if  (start != -1)  {                 rc =  Xtoken (pctx, 0, ptext, ntext, start, end);                 start = -1;                 if  (RC&NBSP;!=&NBSP;SQLITE_OK)  {                     return rc;                 }             }        }  else {            if  (start = =&NBSP;-1)  {                 start = end;            }         }    }    if  (Start &NBSP;!=&NBSP;-1)  {         return xtoken (Pctx, 0, ptext, ntext, start ,  end); &NBSP;&NBSP;&NBSP;&NBSP;}&NBSP;&NBSP;&NBSP;&NBSP;RETURN&NBSP;SQLITE_OK;}

It is important to note that if Xtoken returns non-SQLITE_OK, the word breaker needs to be terminated immediately, and our MMFTS5 uses this in query mode to stop the participle after the target is found to avoid unnecessary performance loss.

Custom SQL functions

We can create and register custom SQL functions by Fts5_api->xcreatefunction, as described above. The custom SQL function is defined as follows:

 Typedef struct fts5extensionapi fts5extensionapi;typedef struct fts5context  Fts5Context;typedef struct Fts5PhraseIter Fts5PhraseIter;typedef void  (*fts5_ Extension_function) (    const Fts5ExtensionApi *pApi,   /*  API object itself  */    Fts5Context *pFts,               /* fts Context  */    sqlite3_context *pctx,           /*  return Value Context  */    int  nVal,                        /* apval Number of parameters  */    sqlite3_value  **apval           /*  parameter list pointer  */); 

As an example, we write one of the simplest custom functions, as follows:

void Mysqlfunc (const FTS5EXTENSIONAPI *PAPI, Fts5context *pfts, Sqlite3_context *pctx, int nval, Sqlite3_value **apval) {sqlite3_result_int (pCtx, 0);}

This function returns the integer 0 forever, if the function is registered as "MyFunc", which means that you query the message table with a SQL statement similar to the following to get an empty set or a series of rows containing only 0.

SELECT MyFunc (message) from message WHERE message MATCH ' in '

To make a difference in custom SQL functions, the Fts5extensionapi object provides a rich set of APIs that are sufficient to combine more powerful features than Matchinfo (), offsets (). Here is a brief description of them using annotations:

struct fts5extensionapi {    int iversion;                    /*  Current fixed value is 1 */     /*  Gets the context of the custom function, which is given in the third argument of Xcreatefunction  */    void  * (*xuserdata) (fts5context*);    /*  gets the total number of columns for the table  */    int   (*xcolumncount) (fts5context*);    /*  gets the total number of rows in the table  */     int  (*xrowcount) (Fts5context*, sqlite3_int64 *pnrow);    /*  Gets the number of words in column icol and returns the number of words token if icol is negative  */    int  (*xcolumntotalsize) (fts5context*,  Int icol, sqlite3_int64 *pntoken);    /*  Word segmentation for the specified text  */     int  (*xtokenize) (fts5context*,         const  char *ptext, int ntext,        void *pctx,         int  (*xtoken) (void*, int, const char*, int, int, int)      );    /*  Returns the number of phrases in the current query expression (phrase, see previous)  */     int  (*xphrasecount) (fts5context*);    /*  returns the number of words in the iphrase phrase iphrase based on 0 */     int  (*xphrasesize) (fts5context*, int iphrase);     / *  returns the number of hit phrases in the current line  */    int  (*xinstcount) (fts5context*, int *pninst ;    /*  Query the details of the IIDX hit of the current row. Piphrase, PiCol, Pioff return hit phrase number, hit column, column offset  */    int  (*xinst) (fts5context*,  Int iidx, int *piphrase, int *picol, int *pioff);     /*   Current line rowid */    sqlite3_int64  (*xrowid) (Fts5context*);    /*  get data for the current row icol column  */    int  (* Xcolumntext) (FTS5CONTEXT*,&NBSP;INT&NBSP;ICOL,&NBSP;CONST&NBSP;CHAR&NBSP;**PZ,&NBSP;INT&NBSP;*PN);     /*  gets the number of words for a column of rows, and returns the number of words for the entire row if Icol is negative  */    int  (*xcolumnsize) ( Fts5context*, int icol, int *pntoken);    /*  rowID in ascending order to traverse all lines that hit the iphrase phrase  */    int  (*xqueryphrase) (fts5context*, int  iphrase, void *puserdata,        int (*) (const  fts5extensionapi*,fts5context*,void*)     );    /*  The following two for custom secondary data operations For data transfer between single or multiple APIs, Xdelete is responsible for destroying data */    int  (*xsetauxdata) (fts5context*, void  *paux, void (*xdelete) (void*))     void * (*xgetauxdata) (Fts5Context*, int  bclear);    /*  The following two iterations for the wordThe iphrase phrase is efficient, convenient but not easy to traverse through all phrases  */    void  (*xphrasefirst) ( fts5context*, int iphrase, fts5phraseiter*, int*, int*);     void   (*xphrasenext) (Fts5context*, fts5phraseiter*, int *picol, int *pioff);};

As an example, we will then modify the previous Mysqlfunc to the following custom function and register as "Myrowid":

void Myrowidfunc (const FTS5EXTENSIONAPI *PAPI, Fts5context *pfts, Sqlite3_context *pctx, int nval, Sqlite3_value **apval) {Sqlite3_result_int64 (PCtx, Papi->xrowid (PCTX));}

Then execute the following SQL statement to return the empty set or a series of rowid that contain "medium" data rows:

SELECT MyFunc (message) from message WHERE message MATCH ' in '

In short, FTS5 is very customizable, it is time to thoroughly and FTS3/FTS4 say goodbye ~

Reference

Http://sqlite.org/fts5.html


NetEase Cloud Free Experience Pavilion, 0 cost experience 20+ Cloud products!

More NetEase technology, products, operating experience sharing please click.


Related articles:
"Recommended" VUE framework core Data hijacking
"Recommended" from the suspect China residence group 493 million open room information disclosure See how individuals Prevent information disclosure

FTS5 and DIY

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.