Using postgreSQL + bamboo to build a full-text search that is N times more convenient than lucene, all the packages used include: Batch
Use postgreSQL + bamboo build N times more convenient than lucene full text search all used to package: cmake-2.6.4.tar.gz (for programming nlpbamboo) CRF++-0.53.tar.gz (for Word Segmentation) nlpbamboo-1.1.1.tar.bz2 (for indexing) install pgsql tar-zxvf postgreSQL-8.3.3.tar.gz
Using postgreSQL + bamboo to build a full-text search that is N times more convenient than lucene
All packages used include:
Cmake-2.6.4.tar.gz (used to compile nlpbamboo)
Crf1_+53.tar.gz (same as above)
Nlpbamboo-1.1.1.tar.bz2 (Word Segmentation)
PostgreSQL-8.3.3.tar.gz (for indexing)
Install pgsql
Tar-zxvf postgreSQL-8.3.3.tar.gz
Cd postgre-8.3.3
./Configure-prefix =/opt/pgsql
Make
Make install
Useradd postgre
Chown-R postgre. postgre/opt/pgsql
Su-postgre
Vi ~ Postgre/. bash_profile
Add
Export PATH
PGLIB =/opt/pgsql/lib
PGDATA =/data/PGSearch
PATH = $ PATH:/opt/pgsql/bin
MANPATH = $ MANPATH:/opt/pgsql/man
Export PGLIB PGDATA PATH MANPATH
# Mkdir-p/data/PGSearch
# Chown-R postgre. postgre/data/PGSearch
# Chown-R postgre. postgre/opt/pgsql
# Sudo-u postgre/opt/pgsql/bin/initdb-locale = zh_CN.UTF-8-encoding = utf8-D/data/PGSearch
# Sudo-u postgre/opt/pgsql/bin/postmaster-I-D/data/PGSearch & // allow Network Access
# Sudo-u postgre/opt/pgsql/bin/createdb kxgroup
# Vim/data/PGSearch/pg_mirror.conf Add the following accessible machines: www.2cto.com
Host all 10.2.19.178 255.255.255.0 trust
# Su-postgre
$ Pg_ctl stop
$ Postmaster-I-D/data/PGSearch &
Install Chinese word segmentation (Cmake CRF ++ bamboo)
Cmake is used to compile bamboo, and CRF ++ is dependent on bamboo.
Tar-zxvf cmake-2.6.4.tar.gz
Cd cmake-2.6.4
./Configure
Gmake
Make install
Tar-zxvf crf1_+53.tar.gz
Cd CRF ++-0.53
./Configure
Make
Make install
Tar-jxvf nlpbamboo-1.1.1.tar.bz2
Cd nlpbamboo-1.1.1
Mkdir build
Cd build/
Cmake...-DCMAKE_BUILD_TYPE = release
Make all
Make install
Cp index.tar.bz2/opt/bamboo/
Cd/opt/bamboo/
Tar-jxvf index.tar.bz2
#/Opt/bamboo/bin/bamboo
If:
ERROR: libcrfpp. so.0: cannot open shared object file: No such file or directory
Run:
Ln-s/usr/local/lib/libcrfpp. so. */usr/lib/
Ldconfig
Added Chinese Word Segmentation and extended to pgsql
# Vim/root/. bash_profile is also added:
PGLIB =/opt/pgsql/lib
PGDATA =/data/PGSearch
PATH = $ PATH:/opt/pgsql/bin
MANPATH = $ MANPATH:/opt/pgsql/man
Export PGLIB PGDATA PATH MANPATH
# Source ~ /. Bash_profile
Cd/opt/bamboo/exts/postgres/chinese_parser/
Make
Make install
Su-postgre
Cd/opt/pgsql/share/contrib/
Touch/opt/pgsql/share/tsearch_data/chinese_utf8.stop
Psql kxgroup
\ I chinese_parser. SQL Import
Run the following SQL statement to split a word:
SELECT to_tsvector ('chinesecfg ',' the result is only known when bamboo is executed in the command line ');
Here, the next section describes how to index and query TEXT fields and build a complete search engine.
Www.2cto.com
I. Basics
This time starts with an SQL statement:
Select * from dbname where field_name @ 'aa | bb 'order by rank (field_name, 'aa | BB ');
Explain from this SQL literal: query the words field_name matches aa or bb in the dbname table and sort them by their matched RANK.
After understanding the above section, we will learn four concepts: tsvector, tsquery, @, and gin.
1. tsvector:
PostgreSQL 8.3 supports full-text retrieval. in earlier versions, you must install and configure tsearch2. It provides two data types (tsvector and tsquery), and dynamically searches the collection of natural language documents to locate the most matched query results. tsvector is one of them.
The value of a tsvector is a list of Unique Word Segmentation categories. A word is formatted as a different word entry. During word segmentation, tsvector automatically removes duplicate words in word segmentation, load data in a certain order. For example
SELECT 'a fat cat sat on a mat and ate a fat rat ': tsvector;
Tsvector
------------------
'A ''on ''and ''ate ''cat'' 'fat' 'Mat ''rat ''sat'
The tsvector is used to split a string by space. This can sort the words after word segmentation into a row based on the number of occurrences (and the word length ).
For full-text searches in English and Chinese, we also need to look at the following SQL:
SELECT to_tsvector ('English ', 'the Fat Rats ');
To_tsvector
------
'Fat': 2 'rat ': 3
To_tsvector is normalized by tsvector, where you can specify the word segmentation used.
2. tsquery:
As the name suggests, tsquery indicates query-related information. tsquery stores the entries used for retrieval. you can also use the boolean operator to join, & (AND), | (OR), and! (NOT). brackets () can be forcibly divided into one group.
At the same time, tsquery can also use weights when searching, and each word can use one or more weight tags, so that information with the same weight will be matched during retrieval. like the preceding tsvector, tsquery also has a to_tsquery function.
3 .@@:
In postgresql, full-text search and matching operations use the @ operator. If
If tsvector (document) matches tsquery (query), true is returned.
Www.2cto.com
Let's look at a simple example:
SELECT 'a fat cat sat on a mat and ate a fat rat ': tsvector @ 'cat & rat': tsquery;
? Column?
----
T
When processing indexes, we still need to use their functions as follows:
SELECT to_tsvector ('fat cats ate fat rats ') @ to_tsquery ('fat & rat ');
? Column?
----
T
And the operator @ Can use text as tsvector and tsquery. The following operator can be used
Tsvector @ tsquery
Tsquery @ tsvector
Text @ tsquery
Text @ text
We have already used the first two types, but the last two types,
Text @ tsquery is equivalent to to_tsvector (x) @ y.
Text @ text is equivalent to to_tsvector (x) @ plainto_tsquery (y ).(~) Plainto_tsquery will be discussed later...
4. gin:
Gin is an index name, which is used for full-text indexing.
You can create gin indexes to accelerate the retrieval speed. For example
Create index pgweb_idx ON pgweb USING gin (to_tsvector ('inc', body ));
You can create an index in multiple ways. You can create an index or even connect two columns:
Create index pgweb_idx ON pgweb USING gin (to_tsvector ('inc', title | body ));
Www.2cto.com
Ii. Improvement
After learning the basic knowledge, we should get started. To achieve full-text search, we need to create a document in the tsvector format and use tsquery to perform user queries, in the query, we return a query result sorted by importance.
First look at a to_tsquery SQL:
SELECT to_tsquery ('English ', 'fat | Rats: AB ');
To_tsquery
------
'Fat' | 'rat ': AB
It can be seen that when the to_tsquery function processes the query text, the logical operators (& (AND), | (OR) and must be used between a single word in the query text! (NOT) join (or use parentheses ).
If you execute the following SQL statement, an error occurs:
SELECT to_tsquery ('English ', 'fat Rats ');
The plainto_tsquery function provides a standard tsquery. In the preceding example, plainto_tsquery automatically adds the logic & operator.
SELECT plainto_tsquery ('English ', 'fat Rats ');
Plainto_tsquery
------
'Fat' & 'rat'
However, the plainto_tsquery function cannot recognize logical operators and weight tags.
SELECT plainto_tsquery ('English ', 'the Fat & Rats: C ');
Plainto_tsquery
-------
'Fat' & 'rat '& 'C'
Www.2cto.com
Iii. Final
After reading the above pile, a thousand words are merged into one sentence. This article mainly talks about an SQL statement. After adding the extension described in the first part, the following SQL statement is used, search a sentence from a field and sort it out:
Select * from tabname where to_tsvector ('chinesecfg ', textname) @ plainto_tsquery ('search for Who') order by ts_rank (to_tsvector ('chinesecfg', textname ), plainto_tsquery ('search dos') limit 10;
The previous create table create index will not be written. It is critical to teach people to fish.