This article has been published by the author Wang Rongtao authorized NetEase Cloud community.
Welcome to NetEase Cloud Community, learn more about NetEase Technology product operation experience.
FTS5 Introduction
The previous article has introduced FTS3/FTS4, which focuses on their successor FTS5.
FTS5 was introduced in SQLite 3.9.0, but unfortunately many of the current OS or applications have not yet started using this version or the newer 3.10.x.
Note: A very exciting version of SQLite 3.9.0, in addition to the introduction of FTS5, introduces the Json1 extension, from which we can use the specific set of functions that it provides directly in the SQL-level manipulation of the JSON in the column without the need for "deserialization---and modification-and-serialization".
different from the FTS3/FTS4Build table
Column cannot be type decorated when table is created
MATCHINFO=FTS3 is removed and replaced with columnsize=0
Notindexed= is removed and replaced with the unindexed keyword
ICU word breaker is removed. Do not know whether the future will support ...
Compress=, uncompress=, and languageid= options are removed and there are no alternative features available
SELECT statement
The query syntax on the right side of the match operator is more explicit, eliminating ambiguity
DocId alias support is canceled and can now be used with ROWID
The left side of the match operator must be a table name and no longer support column names when full-text indexing
FTS5 supports order by rank. Rank is a special column whose value is BM25 () value when the full-text index is queried
Changes in built-in auxiliary functions
The Matchinfo () and offsets () functions are removed, and the function of the snippet () function is weakened.
Support for custom helper functions, the API can be used to build out the functions of several removed functions, and even build a more powerful
Built-in auxiliary functions will be further improved in the future
Other
There is currently no function equivalent to the Fts4aux table
FTS3/4 "Merge=x,y" replaced by FTS5 merge command
FTS3/4 "Automerge=x" is replaced by FTS5 automerge option
The difference of the underlying implementation
The way the inverted table is stored is changed, and the list of documents (instances) that refer to a single word can be stored separately so that it can support progressive loading at query time and save a lot of memory in some cases
Optimization of merging methods of index tree
The following focuses on the query syntax and custom functions, word breakers are described in detail.
Inquire
FTS5 the match query syntax is improved with respect to FTS3/FTS4. The BNF paradigm of the match right query condition can be described as follows:
< phrase>: = string [*]
< phrase>: = < Phrase> + < phrase>
< neargroup>: = near (< Phrase> < phrase> ... [, N])
< query>: = [< colspec>:] < phrase>
< query>: = [< colspec>:] < neargroup>
< query>: = (< query>)
< query>: = < query> and < query>
< query>: = < query> OR < query>
< query>: = < query> not < query>
< colspec>: = ColName
< colspec>: = {colname1 colname2 ...}
Where string can be a string or a bare word enclosed in double quotation marks. The bare words are made up of consecutive characters of the following:
Non-ASCII characters
Uppercase and lowercase English letters
Digital
Underline
Replace (Ascii/unicode code point is 26)
FTS5 match query syntax is more rigorous, more sensitive to punctuation, and its improvement brings a benefit of reducing ambiguity. At the same time, query syntax within multiple column ranges has become relatively straightforward. The focus of this article is not to explain the query syntax, so this is no longer open, interested students can from the end of the link to see this part of the details.
The built-in function of "simple match"
The current version of the FTS5 module provides the BM23 (), highlight (), snippet () three built-in functions. The BM25 () function reduces the threshold used by the Okapi BM25 function as an "add-on", while the snippet () function in FTS3/FTS4 is now renamed Highlight () and the new function is less powerful than the original, which is the "simple match" of real deal. The snippet () function under the FTS5 provides an extract of the word sequence fragments around the hit target, which is also "simple", as this can basically be obtained by a combination of parameters from the original.
For our project, the Matchinfo () and the offsets () function "Subtract" are the most influential! This also hardened our resolve to use custom word breakers and custom helper functions.
FTS5 Extension
The SQLite team took the Matchinfo () and the offsets () to be considered and did come up with a practical solution to the problem, the more developed C API. First, SQLite provides three APIs to create a custom word breaker, find the currently registered word breaker, and create a custom SQL function.
typedef struct fts5_api fts5_api;struct fts5_api { int iversion; /* Current value for 2 */ /* create custom word breaker */ int (*xcreatetokenizer) ( fts5_api *pApi, const char *zName, void *pcontext, fts5_tokenizer * ptokenizer, void (*xdestroy) (void*) ); /* Find the currently registered word breaker */ int (*xfindtokenizer) ( fts5_api *pApi, const char *zName, void **ppContext, fts5_tokenizer *ptokenizer ); /* Create custom SQL functions */ int (*xcreatefunction) ( fts5_api *papi, const char *zname, /* The Zname parameter specifies the name of the custom SQL function */ void *pContext, fts5_extension_function xFunction, void (*xdestroy) (void*) );};
Custom word Breakers
To implement a custom word breaker, you need to implement three functions xcreate, Xdelete, and Xtokenize.
Typedef struct fts5tokenizer fts5tokenizer;typedef struct fts5_tokenizer fts5_ tokenizer;struct fts5_tokenizer {int (*xcreate) (void*, const char **azarg, Int narg, fts5tokenizer **ppout);void (*xdelete) (fts5tokenizer*);int (*xTokenize) ( fts5tokenizer*, void *pctx, int flags, /* some with Fts5_tokenize_ A constant flag bit for the prefix that indicates the source of the call */ const char *pText, int nText, int (*xtoken) ( void *pctx, /* xtokenize () function pointer copy of the second argument */ int tflags, /* Some constant flag bits prefixed with Fts5_token_ to identify whether to open a synonymDon't */ const char *pToken, /* Point to buffer */ int ntoken, containing tokens /* token size, in bytes */ int istart, /* token the byte offset in the input text */ int iEnd /* token the last character in the input text offset +1 */ );};
In practice, we tend to generate contexts in xcreate, destroy contexts in Xdelete, and xtokenize are the core logic for real word segmentation. When Xtokenize is called, SQLite gives us a few parameters:
Flags are used to specify the source of the call, whether from creating a document or full-text search
Ptext used to make input text
ntext to indicate the size of the input text
The Xtoken is a back-off function. When the word breaker determines a word, it needs to call this back to tell the FTS5 drive frame the word's message.
Seemingly simple process, actually implies some pits, you need to distinguish between the byte or the word number, and these FTS5 official documents are not significantly described. I was in the writing mmfts5, was a pit, and later by reading SQLite source code combined with certain reasoning to determine the details.
The following code gives an example of a very simple (and even rudimentary) word breaker that shows the implementation of the Xtokenize:
Int mytokenize (Fts5tokenizer*, void *pctx, int flags, const char *pText, int nText, int (*xtoken) (Void *, int, const char *, int, int, int)) { int rc; int start = -1; int end; for (end = 0; end < ntext; end++) { if (Isspace (ptext[end)) { if (start != -1) { rc = Xtoken (pctx, 0, ptext, ntext, start, end); start = -1; if (RC&NBSP;!=&NBSP;SQLITE_OK) { return rc; } } } else { if (start = =&NBSP;-1) { start = end; } } } if (Start &NBSP;!=&NBSP;-1) { return xtoken (Pctx, 0, ptext, ntext, start , end); &NBSP;&NBSP;&NBSP;&NBSP;}&NBSP;&NBSP;&NBSP;&NBSP;RETURN&NBSP;SQLITE_OK;}
It is important to note that if Xtoken returns non-SQLITE_OK, the word breaker needs to be terminated immediately, and our MMFTS5 uses this in query mode to stop the participle after the target is found to avoid unnecessary performance loss.
Custom SQL functions
We can create and register custom SQL functions by Fts5_api->xcreatefunction, as described above. The custom SQL function is defined as follows:
Typedef struct fts5extensionapi fts5extensionapi;typedef struct fts5context Fts5Context;typedef struct Fts5PhraseIter Fts5PhraseIter;typedef void (*fts5_ Extension_function) ( const Fts5ExtensionApi *pApi, /* API object itself */ Fts5Context *pFts, /* fts Context */ sqlite3_context *pctx, /* return Value Context */ int nVal, /* apval Number of parameters */ sqlite3_value **apval /* parameter list pointer */);
As an example, we write one of the simplest custom functions, as follows:
void Mysqlfunc (const FTS5EXTENSIONAPI *PAPI, Fts5context *pfts, Sqlite3_context *pctx, int nval, Sqlite3_value **apval) {sqlite3_result_int (pCtx, 0);}
This function returns the integer 0 forever, if the function is registered as "MyFunc", which means that you query the message table with a SQL statement similar to the following to get an empty set or a series of rows containing only 0.
SELECT MyFunc (message) from message WHERE message MATCH ' in '
To make a difference in custom SQL functions, the Fts5extensionapi object provides a rich set of APIs that are sufficient to combine more powerful features than Matchinfo (), offsets (). Here is a brief description of them using annotations:
struct fts5extensionapi { int iversion; /* Current fixed value is 1 */ /* Gets the context of the custom function, which is given in the third argument of Xcreatefunction */ void * (*xuserdata) (fts5context*); /* gets the total number of columns for the table */ int (*xcolumncount) (fts5context*); /* gets the total number of rows in the table */ int (*xrowcount) (Fts5context*, sqlite3_int64 *pnrow); /* Gets the number of words in column icol and returns the number of words token if icol is negative */ int (*xcolumntotalsize) (fts5context*, Int icol, sqlite3_int64 *pntoken); /* Word segmentation for the specified text */ int (*xtokenize) (fts5context*, const char *ptext, int ntext, void *pctx, int (*xtoken) (void*, int, const char*, int, int, int) ); /* Returns the number of phrases in the current query expression (phrase, see previous) */ int (*xphrasecount) (fts5context*); /* returns the number of words in the iphrase phrase iphrase based on 0 */ int (*xphrasesize) (fts5context*, int iphrase); / * returns the number of hit phrases in the current line */ int (*xinstcount) (fts5context*, int *pninst ; /* Query the details of the IIDX hit of the current row. Piphrase, PiCol, Pioff return hit phrase number, hit column, column offset */ int (*xinst) (fts5context*, Int iidx, int *piphrase, int *picol, int *pioff); /* Current line rowid */ sqlite3_int64 (*xrowid) (Fts5context*); /* get data for the current row icol column */ int (* Xcolumntext) (FTS5CONTEXT*,&NBSP;INT&NBSP;ICOL,&NBSP;CONST&NBSP;CHAR&NBSP;**PZ,&NBSP;INT&NBSP;*PN); /* gets the number of words for a column of rows, and returns the number of words for the entire row if Icol is negative */ int (*xcolumnsize) ( Fts5context*, int icol, int *pntoken); /* rowID in ascending order to traverse all lines that hit the iphrase phrase */ int (*xqueryphrase) (fts5context*, int iphrase, void *puserdata, int (*) (const fts5extensionapi*,fts5context*,void*) ); /* The following two for custom secondary data operations For data transfer between single or multiple APIs, Xdelete is responsible for destroying data */ int (*xsetauxdata) (fts5context*, void *paux, void (*xdelete) (void*)) void * (*xgetauxdata) (Fts5Context*, int bclear); /* The following two iterations for the wordThe iphrase phrase is efficient, convenient but not easy to traverse through all phrases */ void (*xphrasefirst) ( fts5context*, int iphrase, fts5phraseiter*, int*, int*); void (*xphrasenext) (Fts5context*, fts5phraseiter*, int *picol, int *pioff);};
As an example, we will then modify the previous Mysqlfunc to the following custom function and register as "Myrowid":
void Myrowidfunc (const FTS5EXTENSIONAPI *PAPI, Fts5context *pfts, Sqlite3_context *pctx, int nval, Sqlite3_value **apval) {Sqlite3_result_int64 (PCtx, Papi->xrowid (PCTX));}
Then execute the following SQL statement to return the empty set or a series of rowid that contain "medium" data rows:
SELECT MyFunc (message) from message WHERE message MATCH ' in '
In short, FTS5 is very customizable, it is time to thoroughly and FTS3/FTS4 say goodbye ~
Reference
Http://sqlite.org/fts5.html
NetEase Cloud Free Experience Pavilion, 0 cost experience 20+ Cloud products!
More NetEase technology, products, operating experience sharing please click.
Related articles:
"Recommended" VUE framework core Data hijacking
"Recommended" from the suspect China residence group 493 million open room information disclosure See how individuals Prevent information disclosure
FTS5 and DIY