Using postgreSQL + bamboo to build a full-text search that is N times more convenient than lucene

Source: Internet
Author: User
Using postgreSQL + bamboo to build a full-text search that is N times more convenient than lucene, all the packages used include: Batch

Use postgreSQL + bamboo build N times more convenient than lucene full text search all used to package: cmake-2.6.4.tar.gz (for programming nlpbamboo) CRF++-0.53.tar.gz (for Word Segmentation) nlpbamboo-1.1.1.tar.bz2 (for indexing) install pgsql tar-zxvf postgreSQL-8.3.3.tar.gz

Using postgreSQL + bamboo to build a full-text search that is N times more convenient than lucene

All packages used include:

Cmake-2.6.4.tar.gz (used to compile nlpbamboo)

Crf1_+53.tar.gz (same as above)

Nlpbamboo-1.1.1.tar.bz2 (Word Segmentation)

PostgreSQL-8.3.3.tar.gz (for indexing)

Install pgsql

Tar-zxvf postgreSQL-8.3.3.tar.gz

Cd postgre-8.3.3

./Configure-prefix =/opt/pgsql


Make install

Useradd postgre

Chown-R postgre. postgre/opt/pgsql


Vi ~ Postgre/. bash_profile


Export PATH

PGLIB =/opt/pgsql/lib

PGDATA =/data/PGSearch

PATH = $ PATH:/opt/pgsql/bin

MANPATH = $ MANPATH:/opt/pgsql/man


# Mkdir-p/data/PGSearch

# Chown-R postgre. postgre/data/PGSearch

# Chown-R postgre. postgre/opt/pgsql

# Sudo-u postgre/opt/pgsql/bin/initdb-locale = zh_CN.UTF-8-encoding = utf8-D/data/PGSearch

# Sudo-u postgre/opt/pgsql/bin/postmaster-I-D/data/PGSearch & // allow Network Access

# Sudo-u postgre/opt/pgsql/bin/createdb kxgroup

# Vim/data/PGSearch/pg_mirror.conf Add the following accessible machines:

Host all trust

# Su-postgre

$ Pg_ctl stop

$ Postmaster-I-D/data/PGSearch &

Install Chinese word segmentation (Cmake CRF ++ bamboo)

Cmake is used to compile bamboo, and CRF ++ is dependent on bamboo.

Tar-zxvf cmake-2.6.4.tar.gz

Cd cmake-2.6.4



Make install

Tar-zxvf crf1_+53.tar.gz

Cd CRF ++-0.53



Make install

Tar-jxvf nlpbamboo-1.1.1.tar.bz2

Cd nlpbamboo-1.1.1

Mkdir build

Cd build/

Cmake...-DCMAKE_BUILD_TYPE = release

Make all

Make install

Cp index.tar.bz2/opt/bamboo/


Tar-jxvf index.tar.bz2



ERROR: libcrfpp. so.0: cannot open shared object file: No such file or directory


Ln-s/usr/local/lib/libcrfpp. so. */usr/lib/


Added Chinese Word Segmentation and extended to pgsql

# Vim/root/. bash_profile is also added:

PGLIB =/opt/pgsql/lib

PGDATA =/data/PGSearch

PATH = $ PATH:/opt/pgsql/bin

MANPATH = $ MANPATH:/opt/pgsql/man


# Source ~ /. Bash_profile



Make install




Psql kxgroup

\ I chinese_parser. SQL Import

Run the following SQL statement to split a word:

SELECT to_tsvector ('chinesecfg ',' the result is only known when bamboo is executed in the command line ');

Here, the next section describes how to index and query TEXT fields and build a complete search engine.

I. Basics

This time starts with an SQL statement:

Select * from dbname where field_name @ 'aa | bb 'order by rank (field_name, 'aa | BB ');

Explain from this SQL literal: query the words field_name matches aa or bb in the dbname table and sort them by their matched RANK.

After understanding the above section, we will learn four concepts: tsvector, tsquery, @, and gin.

1. tsvector:

PostgreSQL 8.3 supports full-text retrieval. in earlier versions, you must install and configure tsearch2. It provides two data types (tsvector and tsquery), and dynamically searches the collection of natural language documents to locate the most matched query results. tsvector is one of them.

The value of a tsvector is a list of Unique Word Segmentation categories. A word is formatted as a different word entry. During word segmentation, tsvector automatically removes duplicate words in word segmentation, load data in a certain order. For example

SELECT 'a fat cat sat on a mat and ate a fat rat ': tsvector;



'A ''on ''and ''ate ''cat'' 'fat' 'Mat ''rat ''sat'

The tsvector is used to split a string by space. This can sort the words after word segmentation into a row based on the number of occurrences (and the word length ).

For full-text searches in English and Chinese, we also need to look at the following SQL:

SELECT to_tsvector ('English ', 'the Fat Rats ');



'Fat': 2 'rat ': 3

To_tsvector is normalized by tsvector, where you can specify the word segmentation used.

2. tsquery:

As the name suggests, tsquery indicates query-related information. tsquery stores the entries used for retrieval. you can also use the boolean operator to join, & (AND), | (OR), and! (NOT). brackets () can be forcibly divided into one group.

At the same time, tsquery can also use weights when searching, and each word can use one or more weight tags, so that information with the same weight will be matched during retrieval. like the preceding tsvector, tsquery also has a to_tsquery function.

3 .@@:

In postgresql, full-text search and matching operations use the @ operator. If

If tsvector (document) matches tsquery (query), true is returned.

Let's look at a simple example:

SELECT 'a fat cat sat on a mat and ate a fat rat ': tsvector @ 'cat & rat': tsquery;

? Column?



When processing indexes, we still need to use their functions as follows:

SELECT to_tsvector ('fat cats ate fat rats ') @ to_tsquery ('fat & rat ');

? Column?



And the operator @ Can use text as tsvector and tsquery. The following operator can be used

Tsvector @ tsquery

Tsquery @ tsvector

Text @ tsquery

Text @ text

We have already used the first two types, but the last two types,

Text @ tsquery is equivalent to to_tsvector (x) @ y.

Text @ text is equivalent to to_tsvector (x) @ plainto_tsquery (y ).(~) Plainto_tsquery will be discussed later...

4. gin:

Gin is an index name, which is used for full-text indexing.

You can create gin indexes to accelerate the retrieval speed. For example

Create index pgweb_idx ON pgweb USING gin (to_tsvector ('inc', body ));

You can create an index in multiple ways. You can create an index or even connect two columns:

Create index pgweb_idx ON pgweb USING gin (to_tsvector ('inc', title | body ));

Ii. Improvement

After learning the basic knowledge, we should get started. To achieve full-text search, we need to create a document in the tsvector format and use tsquery to perform user queries, in the query, we return a query result sorted by importance.

First look at a to_tsquery SQL:

SELECT to_tsquery ('English ', 'fat | Rats: AB ');



'Fat' | 'rat ': AB

It can be seen that when the to_tsquery function processes the query text, the logical operators (& (AND), | (OR) and must be used between a single word in the query text! (NOT) join (or use parentheses ).

If you execute the following SQL statement, an error occurs:

SELECT to_tsquery ('English ', 'fat Rats ');

The plainto_tsquery function provides a standard tsquery. In the preceding example, plainto_tsquery automatically adds the logic & operator.

SELECT plainto_tsquery ('English ', 'fat Rats ');



'Fat' & 'rat'

However, the plainto_tsquery function cannot recognize logical operators and weight tags.

SELECT plainto_tsquery ('English ', 'the Fat & Rats: C ');



'Fat' & 'rat '& 'C'

Iii. Final

After reading the above pile, a thousand words are merged into one sentence. This article mainly talks about an SQL statement. After adding the extension described in the first part, the following SQL statement is used, search a sentence from a field and sort it out:

Select * from tabname where to_tsvector ('chinesecfg ', textname) @ plainto_tsquery ('search for Who') order by ts_rank (to_tsvector ('chinesecfg', textname ), plainto_tsquery ('search dos') limit 10;

The previous create table create index will not be written. It is critical to teach people to fish.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.