Differences between Coreseek, sphkernel-for-chinaese, sphashes + Scws, and coreseekscws
Sphtracing is an SQL-based full-text search engine. It is widely used in many websites.
Sphinx has the following features:
A) high-speed indexing (in contemporary CPUs, peak performance can reach 10 Mb/s );
B) high-performance search (on 2-4 GB of text data, the average response time for each retrieval is less than 0.1 seconds );
C) Processing of massive data (it is known that it can process over 100 GB of text data, and 100 M of documents can be processed on a single CPU system );
Sphinx itself does not support Chinese characters.
It is mainly reflected in a broken word. English only needs to be segmented by space, but it is difficult for a wide and profound Chinese.
Word Segmentation is used in two places;
1. Index raw data based on Word Segmentation
2. Word Segmentation for user input during search and query in the Index
Currently, the three most common solutions are Coreseek, sphsf--for-chinaese, and sphsf-+ Scws.
1. Coreseek is a program developed by Chinese people based on Sphinx. Currently, the most stable version is based on the classic Sphinx0.9.9 version.
Advantages:Mature documents and communities are available. The mmseg word segmentation is currently the most useful word segmentation in China, and can be used for indexing and search word segmentation;
Disadvantages:Slow development and version updates; slow Indexing
Policy: A dictionary management background is used to maintain the dictionary. dictionaries are generated on a regular basis. This suite automatically performs word segmentation and indexing;
Applicable scenarios: Common young people, similar searches, applicable to common websites
2. sphsf-for-chinaese is an extended version of Chinese 2 developed based on the classic Sphinx0.9.9 version.
Advantages:Easy to deploy, easy to operate, Embedded Word Segmentation and word segmentation, indexes and search word segmentation can be used;
Disadvantages:Version updates are slow, word segmentation is weak, and indexing is slow.
Policy: Same
Applicable scenarios: ordinary youth, quick building of search sites
3. Sphinx + Scws are two independent systems deployed separately. The so-called high cohesion and low coupling are strongly recommended.
Advantages:The two systems are relatively independent, with their respective servers. Word Segmentation can be used for other purposes; version updates are faster;
Disadvantages:The deployment is a little complicated and the use is a little complicated; the index word segmentation can only use one dollar word segmentation, a large amount of data
Policy: The word segmentation service is called before the word segmentation service is called.
Applicable scenarios: Young people in literature and art, building a decent search, good young people in literature and art
The differences between Coreseek, sphsf-for-chinaese, and sphsf-+ Scws in this article are all the content shared by Alibaba Cloud. I hope you can give us a reference and support for the customer's house.