Using Wukong full-text search engine

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

Recent projects have a full-text index and full-text search business requirements, group colleagues in this area have no experience, find a meeting our needs of open source full-text search engine is imperative. Our demand for full-text search engine is not complicated, the main thing is that the engine can support Chinese word segmentation, indexing and searching, and can quickly realize the function. In the field of full-text search, Elasticsearch who but myself, based on Apache Lucene, is an industry leader in its powerful distributed system capabilities, support for ultra-large data, friendly restful APIs, and near real-time search performance, and its development community is also quite active, Information is numerous. But also formally because of its large volume, we did not choose to use Elasticsearch in this project, but chose another "fame" is not so loud engine: Wukong.

I. Introduction of Wukong

Wukong is a full-text search engine with high performance and support for Chinese word segmentation golang. I personally think its biggest feature is not as large and functional as elasticsearch, but can be quickly integrated into your application or service in a library form, which is probably the most important reason to choose it at the current stage, Of course, the Golang technology stack is another reason for me to covet it:).

The first to know Wukong, in fact, at this year's Gopherchina conference, the author Chen Hui as the first speaker at the conference to share "go and AI." In this presentation, Chen Hui detailed the Wukong search engine and several other related open source projects, such as: Sego and so on.

In the Golang world, there is more than Wukong to do full text search. Another more well-known is Bleve, but by default, Bleve does not support Chinese word segmentation and search, need to combine Chinese word breaker to support, such as: Gojieba.

Wukong is basically Chen Hui a person to build a project, in the Chen Hui in Ali during his tenure, he will be used in some of Ali's internal projects, but overall, Wukong application is very small, the relevant information is not much, the basic focus on its GitHub site. About Wukong source analysis, but in the foreign site found an article: "Code Reading:wukong full-text search Engine".

This article focuses more on applying the Wukong engine rather than analyzing the Wukong code.

Ii. Full-text indexing and retrieval

1. The simplest example

Let's take a look at one of the simplest examples written using the Wukong engine:

 //example1.gopackage mainimport ("FMT" "Github.com/huichen/wukong/engine" "github.com/huichen/wukong/ Types ") var (searcher = engine. engine{} docId UInt64) const (Text1 = ' At the FIFA Awards ceremony in Zurich, Barcelona star, Argentina captain Lionel Messi won the 5th Golden Globe Award, continued to create a new record of football ' Text2 = ' December 6, online The appearance of the photo shows that the rear wing of the domestic fifth generation fighter -20 has been coated with a five-digit force number ') Func main () {searcher. Init (types. engineinitoptions{indexerinitoptions: &types. indexerinitoptions{Indextype:types. Docidsindex,}, Segmenterdictionaries: "./dict/dictionary.txt", Stoptokenfile: "./dict/stop_t Okens.txt ",}) defer searcher. Close () docid++ searcher. Indexdocument (docId, types. DOCUMENTINDEXDATA{CONTENT:TEXT1}, False) docid++ searcher. Indexdocument (docId, types. DOCUMENTINDEXDATA{CONTENT:TEXT2}, False) searcher. Flushindex () fmt. Printf ("% #v \ n", searcher. Search (types. Searchrequest{text: "Bassamesi"})) Fmt. Printf ("% #v \ n", searcher. Search (types. Searchrequest{text: "Fighter Golden Ball Award"})}  

In this example, we created the Wukong Engine index of two doc:text1 and Text2, after establishing a good index, we use the engine for keyword query, we look at the results of the query:

$go run example1.go2016/12/06 21:40:04 载入sego词典 ./dict/dictionary.txt2016/12/06 21:40:08 sego词典载入完毕types.SearchResponse{Tokens:[]string{"巴萨", "梅西"}, Docs:[]types.ScoredDocument{types.ScoredDocument{DocId:0x1, Scores:[]float32{0}, TokenSnippetLocations:[]int(nil), TokenLocations:[][]int(nil)}}, Timeout:false, NumDocs:1}types.SearchResponse{Tokens:[]string{"战斗机", "金球奖"}, Docs:[]types.ScoredDocument{}, Timeout:false, NumDocs:0}

You can see that when the query "Bassamesi", the engine correctly matched to the first document (docid:0x1). The second query keyword combination "fighter Golden Globes" does not match any document. From this example we can also see that the Wukong engine for keyword query support is the keyword and query, only the document contains all the keywords in order to be matched to. This is also the current Wukong engine only support a keyword search combination mode.

The index key for the Wukong engine is a UInt64 value, and we need to ensure that the value is unique, otherwise the created index will be override.

In addition we see: When initializing Indexerinitoptions, Our incoming indextype is Types.docidsindex, which instructs the engine to keep only the matching docid information in the indexed and search results that it establishes, which minimizes the memory footprint of the Wukong engine.

If you do not assign a value to Stoptokenfile when initializing engineinitoptions, when we search for "Bassamesi", the engine divides the keywords into three keywords: "Barca", "Space" and "Lionel Messi" respectively Search and merge results:

$go run example1.go2016/12/06 21:57:47 载入sego词典 ./dict/dictionary.txt2016/12/06 21:57:51 sego词典载入完毕types.SearchResponse{Tokens:[]string{"巴萨", " ", "梅西"}, Docs:[]types.ScoredDocument{}, Timeout:false, NumDocs:0}types.SearchResponse{Tokens:[]string{"战斗机", " ", "金球奖"}, Docs:[]types.ScoredDocument{}, Timeout:false, NumDocs:0}

2, Frequenciesindex and Locationsindex

The other two types supported by Wukong engine Indextype are Frequenciesindex and Locationsindex, which correspond to the retention of the word frequency information and the location of the keyword in the document. The memory consumption of these two types of indextype is also increasing gradually, after all, the retained information is incremented:

When indextype = Frequenciesindex:

$go run example1.go2016/12/06 22:03:47 载入sego词典 ./dict/dictionary.txt2016/12/06 22:03:51 sego词典载入完毕types.SearchResponse{Tokens:[]string{"巴萨", "梅西"}, Docs:[]types.ScoredDocument{types.ScoredDocument{DocId:0x1, Scores:[]float32{3.0480049}, TokenSnippetLocations:[]int(nil), TokenLocations:[][]int(nil)}}, Timeout:false, NumDocs:1}types.SearchResponse{Tokens:[]string{"战斗机", "金球奖"}, Docs:[]types.ScoredDocument{}, Timeout:false, NumDocs:0}

When indextype = Locationsindex:

$go run example1.go2016/12/06 22:04:31 载入sego词典 ./dict/dictionary.txt2016/12/06 22:04:38 sego词典载入完毕types.SearchResponse{Tokens:[]string{"巴萨", "梅西"}, Docs:[]types.ScoredDocument{types.ScoredDocument{DocId:0x1, Scores:[]float32{3.0480049}, TokenSnippetLocations:[]int{37, 76}, TokenLocations:[][]int{[]int{37}, []int{76}}}}, Timeout:false, NumDocs:1}types.SearchResponse{Tokens:[]string{"战斗机", "金球奖"}, Docs:[]types.ScoredDocument{}, Timeout:false, NumDocs:0}

3, the effect of participle on the result

In the front, when we do not assign a value to Stoptokenfile, we initially see the effect of participle on search results. Wukong's Chinese participle is entirely based on the author's other Open source project, Sego. The accuracy of Word segmentation directly affects the establishment of indexes and search results of keywords. Sego dictionaries and Stoptokenfile come from the web, and if you need more accurate segmentation results, you need to update dictionary.txt and Stop_tokens.txt regularly.

For example, if your source document content is: "You are very interested in. NET Core 1.1来 Oh," Your search keyword is: interest. As expected, you should be able to search for this source document. But the actual output is:

types.SearchResponse{Tokens:[]string{"兴趣"}, Docs:[]types.ScoredDocument{}, Timeout:false, NumDocs:0}

The reason for this is that sego to "you are very interested in the. NET Core 1.1来 Oh," The result of this sentence is:

你们/r 很感兴趣/l 的/uj  /x ./x net/x  /x core/x  /x 1/x ./x 1/x  /x 来/v 了/ul 哦/zg

Sego did not divide the "interest", but the "very interesting" four words together, the Wukong engine will not be alone for "interest" alone to establish a document index, the search can be understood. Therefore, Sego can be used to verify the Wukong engine segmentation, which will help you understand the Wukong of document indexing.

Iii. Persistent indexing and startup recovery

In the above example, the document index created by the Wukong engine is stored in memory, and the data disappears when the program exits. It is obviously a very unwise idea to re-index each startup program based on the source document. Wukong supports persisting an established index to a disk file and recovering from the File intermediate index data when the program restarts, and is used for subsequent keyword searches. The Wukong bottom supports two persistence engines, one boltdb and the other cznic/kv. Boltdb is used by default.

Let's look at an example of a persistent index (consider the article size, omitting some code):

 //Example2_index_create.go ... func main () {searcher. Init (types. engineinitoptions{indexerinitoptions: &types. indexerinitoptions{Indextype:types. Docidsindex,}, Usepersistentstorage:true, Persistentstoragefolder: "./index", Segmenterdi Ctionaries: "./dict/dictionary.txt", Stoptokenfile: "./dict/stop_tokens.txt",}) defer searcher.c Lose () OS. Mkdirall ("./index", 0777) docid++ searcher. Indexdocument (docId, types. DOCUMENTINDEXDATA{CONTENT:TEXT1}, False) docid++ searcher. Indexdocument (docId, types. DOCUMENTINDEXDATA{CONTENT:TEXT2}, False) docid++ searcher. Indexdocument (docId, types. DOCUMENTINDEXDATA{CONTENT:TEXT3}, False) searcher. Flushindex () log. PRINTLN ("Created index number:", Searcher. Numdocumentsindexed ())}  

This is a source file that creates a persisted index. As you can see, if you want to persist the index, simply set Usepersistentstorage to true at Engine init and set Persistentstoragefolder, the path where the index persistence file will be stored. Execute the source file:

$go run example2_index_create.go2016/12/06 22:41:49 载入sego词典 ./dict/dictionary.txt2016/12/06 22:41:53 sego词典载入完毕2016/12/06 22:41:53 Created index number: 3

After execution, we will see the persisted index data file under the./index Path:

$tree indexindex├── wukong.0├── wukong.1├── wukong.2├── wukong.3├── wukong.4├── wukong.5├── wukong.6└── wukong.70 directories, 8 files

Now let's build a program that recovers indexes from persistent index data into memory and gives search results for search keywords:

// example2_index_search.go... ...var (    searcher = engine.Engine{})func main() {    searcher.Init(types.EngineInitOptions{        IndexerInitOptions: &types.IndexerInitOptions{            IndexType: types.DocIdsIndex,        },        UsePersistentStorage:    true,        PersistentStorageFolder: "./index",        SegmenterDictionaries:   "./dict/dictionary.txt",        StopTokenFile:           "./dict/stop_tokens.txt",    })    defer searcher.Close()    searcher.FlushIndex()    log.Println("recover index number:", searcher.NumDocumentsIndexed())    fmt.Printf("%#v\n", searcher.Search(types.SearchRequest{Text: "巴萨 梅西"}))}

Execute this program:

$go run example2_index_search.go2016/12/06 22:48:37 载入sego词典 ./dict/dictionary.txt2016/12/06 22:48:41 sego词典载入完毕2016/12/06 22:48:42 recover index number: 3types.SearchResponse{Tokens:[]string{"巴萨", "梅西"}, Docs:[]types.ScoredDocument{types.ScoredDocument{DocId:0x1, Scores:[]float32{0}, TokenSnippetLocations:[]int(nil), TokenLocations:[][]int(nil)}}, Timeout:false, NumDocs:1}

The program successfully recovers the index data from the previously established program and gives the correct results for search request.

It is important to note that the BOLTDB uses flock to guarantee mutually exclusive access to the underlying file data, so when a program opens the BOLTDB, if another program tries to open the same boltdb, the latter will block the open Boltdb link.

Iv. dynamic addition and deletion of indexes

The Wukong engine supports runtime dynamic additions and deletions and affects search results in real time.

Based on the persistence index established in the above section, start a program that supports dynamic indexing of indexes:

Example3.gofunc Main () {searcher. Init (types. engineinitoptions{indexerinitoptions: &types. indexerinitoptions{Indextype:types. Docidsindex,}, Usepersistentstorage:true, Persistentstoragefolder: "./index", persistents Torageshards:8, Segmenterdictionaries: "./dict/dictionary.txt", Stoptokenfile: "./dict/stop_tok Ens.txt ",}) defer searcher. Close () searcher. Flushindex () log. PRINTLN ("Recover index number:", Searcher. Numdocumentsindexed ()) DocId = searcher. Numdocumentsindexed () OS.             Mkdirall ("./source", 0777) go func () {for {var paths []string//update index dynamically Time. Sleep (time. Second *) var path = "./source" ERR: = FilePath. Walk (Path, func (path string, f OS).  FileInfo, err Error) error {if f = = nil {return err} if F.isdir () {RETUrn Nil} FC, err: = Ioutil. ReadFile (PATH) if err! = Nil {fmt. PRINTLN ("Read file:", Path, "Error:", err)} docid++ FMT. Println ("Indexing file:", Path, "...") searcher. Indexdocument (docId, types. Documentindexdata{content:string (FC)}, True) Fmt.            PRINTLN ("Indexed file:", Path, "OK") paths = append (paths, path) return nil}) If err! = Nil {fmt. Printf ("filepath. Walk () returned%v\n ", err) return} for _, P: = range Paths {err: = Os. Remove (P) if err! = Nil {fmt. Println ("Remove file:", p, "error:", err) continue} FMT.            Println ("Remove file:", p, "ok!")} If Len (paths)! = 0 {//wait for index to finish refreshing FMT. PrintlN ("Flush Index ...") searcher. Flushindex () fmt. PRINTLN ("Flush Index OK")}}} () for {var s string fmt. PRINTLN ("Please input your search keywords:") fmt. SCANF ("%s", &s) if s = = "Exit" {break} fmt. Printf ("% #v \ n", searcher. Search (types. Searchrequest{text:s})}}

Example3 This program initiates a goroutine that periodically reads the source document to be indexed and updates the index data in real time. Main routine waits for the user to enter a keyword and returns the result by engine search. Let's run this program:

$go run example3.go2016/12/06 23:07:17 载入sego词典 ./dict/dictionary.txt2016/12/06 23:07:21 sego词典载入完毕2016/12/06 23:07:21 recover index number: 3Please input your search keywords:梅西types.SearchResponse{Tokens:[]string{"梅西"}, Docs:[]types.ScoredDocument{types.ScoredDocument{DocId:0x1, Scores:[]float32{0}, TokenSnippetLocations:[]int(nil), TokenLocations:[][]int(nil)}}, Timeout:false, NumDocs:1}Please input your search keywords:战斗机types.SearchResponse{Tokens:[]string{"战斗机"}, Docs:[]types.ScoredDocument{types.ScoredDocument{DocId:0x2, Scores:[]float32{0}, TokenSnippetLocations:[]int(nil), TokenLocations:[][]int(nil)}}, Timeout:false, NumDocs:1}Please input your search keywords:

As you can see: Based on the currently restored index, we can search correctly for the documents of "Messi", "fighter" and other keywords.

When we enter: "King of the King", we get the following search results:

Please input your search keywords:球王types.SearchResponse{Tokens:[]string{"球王"}, Docs:[]types.ScoredDocument{}, Timeout:false, NumDocs:0}

No documents were matched.

That's okay, now we're going to add a document that contains keywords like the king of the King. We create a document: Soccerking.txt, the content is:

《球王马拉多纳》是一部讲述世界上被公认为现代足球坛上最伟大的传奇足球明星迭戈·马拉多纳的影片。他出身于清贫家庭,九岁展露过人才华,十一岁加入阿根廷足球青少年队,十六岁便成为阿根廷甲级联赛最年轻的>球员。1986年世界杯,他为阿根廷队射入足球史上最佳入球,并带领队伍勇夺金杯。他的一生充满争议、大起大落,球迷与人们对他的热爱却从未减少过,生命力旺盛的他多次从人生谷底重生。

After you move the soccerking.txt to the source directory, you can see the program output the following log in a few moments:

indexing file: source/soccerking.txt ... ...indexed file: source/soccerking.txt  okremove file: source/soccerking.txt  ok!flush index....flush index ok

We will try to search for "king", "Maradona" and other key words:

Please input your search keywords:球王types.SearchResponse{Tokens:[]string{"球王"}, Docs:[]types.ScoredDocument{types.ScoredDocument{DocId:0x4, Scores:[]float32{0}, TokenSnippetLocations:[]int(nil), TokenLocations:[][]int(nil)}}, Timeout:false, NumDocs:1}Please input your search keywords:马拉多纳types.SearchResponse{Tokens:[]string{"马拉多纳"}, Docs:[]types.ScoredDocument{types.ScoredDocument{DocId:0x4, Scores:[]float32{0}, TokenSnippetLocations:[]int(nil), TokenLocations:[][]int(nil)}}, Timeout:false, NumDocs:1}

As you can see, the engine correctly searches for the corresponding doc.

V. Distributed indexing and search

From the previous chapters, we have a general understanding of how Wukong works. Wukong stores the index in Boltdb, each Wukong instance exclusive data and cannot be shared with other Wukong instance. When there is not enough memory space on a node to meet the data volume requirements, the Wukong engine needs to be distributed for distributed indexing and searching. In this regard, Wukong official provided a description of the scheme:

分布式搜索的原理如下:当文档数量较多无法在一台机器内存中索引时,可以将文档按照文本内容的hash值裂分(sharding),不同块交由不同服务器索引。在查找时同一请求分发到所有裂分服务器上,然后将所有服务器返回的结果归并重排序作为最终搜索结果输出。为了保证裂分的均匀性,建议使用Go语言实现的Murmur3 hash函数:https://github.com/huichen/murmur按照上面的原理很容易用悟空引擎实现分布式搜索(每个裂分服务器运行一个悟空引擎),但这样的分布式系统多数是高度定制的,比如任务的调度依赖于分布式环境,有时需要添加额外层的服务器以均衡负载

The essence is the Shard processing of index and search. At present, we do not need such a distributed wukong at the stage of the project, so there is no actual combat experience to share.

Vi. limitations of the Wukong engine

With the above description, you can basically master and use the Wukong engine. But before choosing the Wukong engine, it's important to understand some of the limitations of the Wukong engine:

1, the development is not active, the information is few, the community is smaller
Wukong engine is basically the author of a person's project, the community participation is not high, the information is very small. In addition, because the author is starting a business, busy building wheel ^_^, so Wukong project update frequency is not high.

2. Lack of plans and vision

It seems that the author does not continue to continue to improve the Wukong engine and promote the idea and momentum. There was no increase on the feature. This is a lot worse than Bleve.

3, query function is simple, only support the keyword and query

If you want to support flexible full-text search queries, the current version of Wukong is probably not for you.

4, the accuracy of the search based on the size of Dictionary.txt

As mentioned earlier, Wukong's indexing and search accuracy depends in part on the accuracy of the word segmentation engine's word segmentation, so that the Dictionary.txt file is comprehensive and will become an important factor affecting the accuracy of the search.

5. Missing plug-in support for storing indexes in relational db

The current Wukong engine can only persist index persistence in files and cannot be stored and queried in conjunction with databases such as MySQL.

In short, Wukong is not a perfect full-text search engine, whether to choose, depends on the context you are in.

Vii. Summary

Choosing the Wukong engine and the context in which our project is currently located is irrelevant: we need to quickly implement a simple yet available full-text search service. Perhaps in a later version, the query method, the size of the data further requirements, it is possible to consider the time to replace the engine. Bleve and Elasticsearch will be considered by us.

This article code is available for download here.

, Bigwhite. All rights reserved.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.