Detailed explanation of the use of Full-text search in PostgreSQL _ database other

Source: Internet
Author: User
Tags create index documentation logical operators one table postgresql solr

When developing Web applications, you often have to add search capabilities. I didn't even know what to search, so I drew a magnifying glass on the sketch.

Search is a very important feature, so lucene based tools like Elasticsearch and SOLR become popular. They're all great. But before using these massive "killer" search weapons, you might want to have a lightweight, but good enough search tool.

The so-called "good enough", I mean a search engine has the following features:

    • Root (stemming)
    • Ranking/promotion (Ranking/boost)
    • Support multiple languages
    • Fuzzy search for spelling errors
    • The support of the dialect

Fortunately, PostgreSQL is fully supported for these features.

The target audiences for this article are:

    • Use PostgreSQL and do not want to install other search engines at the same time.
    • Using other databases, such as MySQL, requires better full-text search capabilities.

In this article we will illustrate PostgreSQL's Full-text search capabilities with the following table and data.

 CREATE TABLE author (id serial PRIMARY KEY, name TEXT not NULL); CREATE TABLE post (ID serial PRIMARY KEY, title text NOT NULL, content TEXT isn't null, author_id INT NOT null reference
s author (id));
CREATE TABLE tag (id serial PRIMARY KEY, name TEXT not NULL);
CREATE TABLE posts_tags (post_id int not NULL references post (ID), tag_id int not NULL references tag (ID));
INSERT into author (ID, name) VALUES (1, ' Pete Graham '), (2, ' Rachid Belaid '), (3, ' Robert Berry ');
INSERT into tag (ID, name) VALUES (1, ' SciFi '), (2, ' Politics '), (3, ' science '); INSERT into post (ID, title, content, author_id) VALUES (1, ' Endangered species ', ' pandas are ', endangered ', 1)  , (2, ' Freedom of Speech ', ' Freedom of Speech is a necessary right missing in many ', 2), (3, ' Star countries vs
Star Trek ', ' Few words from a big fan ', 3);
INSERT into Posts_tags (post_id, tag_id) VALUES (1, 3), (2, 2), (3, 1); 

This is the application of a class blog. It has a post table with Title and content fields. Post is associated to author through a foreign key. Post itself has more than one label (tag).

What is Full-text search

First, let's look at the definition:

In text retrieval, Full-text Search is the technique of searching for a single or multiple documents (document) stored in a computer from a full-text database. Full-text search is different from a metadata based search or a search based on the original text in the database.


This definition introduces the concept of the document, which is important. When you search for data, you are looking for meaningful entities that you want to find, and these are your documents. PostgreSQL's documentation is well explained.

A document is a search unit in a full-text search system. For example, an impurity article or an e-mail message.

--Postgres Document

The documentation here can span multiple tables, representing the logical entities we want to search for.

Build our Documents (document)

In the previous section, we introduced the concept of a document. The document has nothing to do with the schema of the table, but is related to the data, combining the fields into a meaningful entity. Based on the pattern of the tables in the example, our document (documents) consists of these:

    • Post.title
    • Post.content
    • of Post
    • All associated with the post

Generate documents based on these requirements, which should be the case for SQL queries:

SELECT Post.title | | ' ' || 
  post.content | | ' ' || | | ' ' ||
  COALESCE (String_agg (, '), ') as document from Post JOIN author on = post.author_id JOIN posts_tags On posts_tags.post_id = posts_tags.tag_id JOIN tag on = posts_tags.tag_id GROUP by,;
 Endangered species pandas are an endangered species Pete Graham Politics
 Freedom of Speech Freedom of Speech is a necessary right missing in many countries Rachid Belaid Politics
 Star Wars vs star Trek Few words from a big fan Robert Berry politics
(3 rows)

Because multiple tags are associated with a post, we use the String_agg () as the aggregate function, because the post and author are grouped. Even if author is a foreign key and a post cannot have more than one author, it also requires adding aggregate functions to author or adding author to group by.

We also used the coalesce (). Using the COALESCE () function is a good idea when the value can be null, otherwise the result of string concatenation will be null.

At this point, our document is just a long string, which is no use. We need to use To_tsvector () to convert it to the correct format.

SELECT to_tsvector (post.title) | | 
  To_tsvector (post.content) | |
  To_tsvector ( | |
  To_tsvector (COALESCE (String_agg (, '), ') as Documentfrom post
JOIN author on = post.author_id
JOIN posts_tags on posts_tags.post_id = posts_tags.tag_id
JOIN tag on = posts_tags.tag_id
GROUP by,;
' Endang ': 1,6 ' Graham ': 9 ' panda ': 3 ' Pete ': 8 ' Polit ': speci ': 2,7 ' belaid ': Countri ': ' Freedom ': 1,4 ' mani ': ' Miss '
: One ' Necessari ': 9 ' polit ': 17 ' Rachid ': ' Right ': A ' speech ': 3,6
' Berri ': ' Big ': ' Fan ': One ' polit ': ' Robert ': ' Star ': 1,4 ' Trek ': 5 ' vs ': 3 ' War ': 2 ' word ': 7
(3 rows)

This query will return documents that are suitable for full-text search in tsvector format. Let's try to convert a string into a tsvector.

SELECT to_tsvector (' Try not to become a mans of success, but rather Try to become a man of value ');

This query will return the following result:

' becom ': 4,13 ' man ' : 6,15 ' rather ': Ten ' success ': 8 ' tri ': 1,11 ' Valu ': (1 row)

Something strange has happened. First there are fewer words than the original, some words have changed (try becomes tri), and there are numbers behind them. What's going on?

A tsvector is a standard word-bit sequence list (sorted list), and a standard word bit (distinct lexeme) means that the various variants of the same word are normalized to the same.

The normalization process almost always replaces uppercase letters with lowercase, and often removes suffixes (such as s,es and ING in English). This allows you to search for variations of the same word instead of tedious typing in all possible variants.

A number represents the position of a word bit in the original string, such as "man" appearing in the position of 6th and 15. You can count it yourself.

The text search for the default configuration of To_tesvetor in Postgres is "English". It ignores the word stop in English (Stopword, i.e. AM is are a).

This explains why Tsvetor's results are less than the words in the original sentence. Later we will see more language and Text search configuration.


We know how to build a document, but our goal is to search for documents. We can use the @@ 操作符 for Tsvector search, which is shown here. Look at several examples of query documents.

> select To_tsvector (' If can dream it, can do it ') @@ ' dream ';
 ? Column?
(1 row)
> select To_tsvector (' It ' s kind of fun to do the Impossible ') @@ ' impossible ';
 ? Column?
(1 row)

The second query returns false because we need to build a tsquery, and when we use the @@ 操作符, the string Transformation (cast) becomes tsquery. The difference between this L transformation and the use of To_tsquery () is shown below.

SELECT ' Impossible ':: Tsquery, To_tsquery (' impossible ');
 Tsquery | To_tsquery
 ' Impossible ' | ' Imposs ' (1 row)

But the word "dream" is the same as itself.

SELECT ' dream ':: Tsquery, To_tsquery (' dream ');
 Tsquery | To_tsquery
 ' dream '  | ' Dream ' (1 row)

From now on we use To_tsquery to query the document.

SELECT to_tsvector (' It ' s kind of fun to do the Impossible ') @@ to_tsquery (' impossible ');
 ? Column?
(1 row)

Tsquery stores the word bits to search, and you can use the & (with), | (OR) and! (non) logical operators. You can use parentheses to group operators.

> SELECT to_tsvector (' If the facts don t fit the theory, change the facts ') @@ to_tsquery ('! fact ');
 ? Column?
(1 row)
> SELECT to_tsvector (' If the facts don ' t fit the theory, change the facts ') @@ to_t Squery (' Theory &!fact ');
 ? Column?
(1 row)
> SELECT to_tsvector (' If the facts don ' t fit the theory, change the facts. ") @@ to _tsquery (' fiction | theory ');
 ? Column?
(1 row)

We can also use: * to express a query that starts with a word.

> SELECT to_tsvector (' If the facts don ' t fit the theory and change the facts. ') @@ to_tsquery (' theo:* ');
 ? Column?
(1 row)

Now that we know how to use Full-text search queries, we go back to the table schema we started and try to query the document.

Select PID, P_titlefrom (select as PID,
    Post.title as P_title,
    to_tsvector (post.title) | | 
    To_tsvector (post.content) | |
    To_tsvector ( | |
    To_tsvector (COALESCE (String_agg (, ')) as document from
  JOIN author on = Post.author_ ID
  JOIN posts_tags on posts_tags.post_id = posts_tags.tag_id
  join tag on = posts_tags.tag_id
  GROUP by, p_search WHERE p_search.document @@ to_tsquery (' Endangered & Species ');
 PID |  P_title
 1 | Endangered Species
(1 row)

This query will find the words in the document that contain endangered and species or proximity.

Language support

The Postgres built-in text search feature supports multiple languages: Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.

SELECT to_tsvector (' 中文版 ', ' We are running ');
 ' run ': 3
(1 row) SELECT to_tsvector (' French ', ' We are running ');
 ' are ': 2 ' running ': 3 ' we ': 1
(1 row)

Based on our initial model, column names can be used to create tsvector. Suppose the Post table contains the contents of a different language, and it contains a list of language.

ALTER TABLE Post ADD language text not NULL DEFAULT (' 中文版 ');

To use the language column, now we recompile the document.

SELECT To_tsvector (post.language::regconfig, post.title) | | 
  To_tsvector (Post.language::regconfig, post.content) | |
  To_tsvector (' simple ', | |
  To_tsvector (' Simple ', COALESCE (String_agg (, "),") as Documentfrom postjoin author on = Post.autho R_idjoin posts_tags on posts_tags.post_id = posts_tags.tag_idjoin tag on = Posts_tags.tag_idgroup by, Autho;

If the displayed converter is missing:: Regconfig, an error is generated in the query:

Error:function to_tsvector (text, text) does not exist

Regconfig is an object identifier type that represents a Postgres text search configuration item. :

Now, the semantics of the document are compiled using the correct language in Post.language.

We also use simple, which is also a text search configuration item provided by Postgres. Simple does not ignore disabling the thesaurus, nor does it try to find the root of the word. When using simple, each set of characters for a space split is a semantics; for data, an easy Text search configuration item is useful, like a person's name, and we may not want to look up the root of a name.

SELECT to_tsvector (' Simple ', ' We are running ');
----------------------------' are ': 2 ' running ': 3 ' we ': 1 (1 row)

Accent character

When you build a search engine to support multiple languages, you also need to consider stress issues. Stress is very important in many languages and can change the meaning of the word. Postgres with a unaccent extension to invoke unaccentuate content is useful.

CREATE EXTENSION unaccent; SELECT unaccent (' èéê? ');
(1 row)

Let's add some accent to your content into our post table.

INSERT into post (ID, title, content, author_id, language) 
VALUES (4, ' ilétait une fois ', ' ilétait une fois un h?te L ... ', 2, ' French ')

If we want to ignore stress when we create a document, then we can simply do the following:

SELECT To_tsvector (Post.language, Unaccent (post.title)) | | 
  To_tsvector (Post.language, Unaccent (post.content)) | |
  To_tsvector (' Simple ', unaccent ( | |
  To_tsvector (' Simple ', unaccent (COALESCE (String_agg (, ')) "JOIN author on = Post.author_idjoin posts _tags on posts_tags.post_id = posts_tags.tag_idjoin tag on = Post.author_idgroup by

If you work like this, it's a bit of a hassle if you have more room for error. We can also create a new text search configuration that supports accent-free characters.

CREATE TEXT SEARCH CONFIGURATION fr (COPY = French); Alter TEXT SEARCH CONFIGURATION FR Alter Mappingfor Hword, Hword_part, Word with unaccent, french_stem;

When we use this new text search configuration, we can see the word bit

SELECT to_tsvector (' French ', ' ilétait une fois ');
 ' fois ': 4
(1 row) SELECT to_tsvector (' fr ', ' ilétait une fois ');
 ' etait ': 2 ' fois ': 4
(1 row)

This gives us the same result, first as application unaccent and from the result establishes tsvector.

SELECT to_tsvector (' French ', unaccent (' ilétait une fois '));
 ' etait ': 2 ' fois ': 4
(1 row)

The number of word bits is different, because ilétait une in France is a useless word. Is it a question to let these words stop in our file? I don't think so. Etait is not really a useless word but a spelling mistake.

SELECT to_tsvector (' fr ', ' H?tel ') @@ to_tsquery (' hotels ') as result;
(1 row)

If we create an accent-free search configuration for each language so that our post can write and we keep this value in Post.language, then we can keep the previous document query.

SELECT To_tsvector (post.language, post.title) | | 
  To_tsvector (Post.language, post.content) | |
  To_tsvector (' simple ', | |
  To_tsvector (' Simple ', COALESCE (String_agg (, ")) JOIN author on = Post.author_idjoin posts_tags on POS ts_tags.post_id = posts_tags.tag_idjoin tag on = Post.author_idgroup by

If you need to create an accent-free text search configuration for each language supported by Postgres, then you can use gist

Our current document size may increase because it can include unwanted words without accent but we don't care about accented character queries. This may be useful if someone uses the English keyboard to search for French content.


When you create a search engine that you want to search for related results (sorted by relevance), the collation can be based on a number of factors, and its documentation roughly explains these (categorized by) content.

The collation attempts to process a particular context search, so when there are many pairs, the one with the highest relevance is ranked in the first position. PostgreSQL provides two predefined collation functions that take into account lexical interpretation, proximity, and structural information, and they take into account the word frequency in the context, how to approach the same words in the context, and where they appear in the text and how important they are.

--PostgreSQL documentation

Some of the functions provided by PostgreSQL get the correlation results we want, and in our case we will use 2 of them: Ts_rank () and Setweight ().

Function Setweight allows us to assign values to importance (weights) by tsvector functions; values can be ' A ', ' B ', ' C ' or ' D '.

Select PID, P_titlefrom (select as PID,
    Post.title as P_title,
    setweight (to_tsvector:: Regconfig, Post.title), ' A ') | | 
    Setweight (To_tsvector (Post.language::regconfig, post.content), ' B ') | |
    Setweight (To_tsvector (' simple ',, ' C ') | |
    Setweight (To_tsvector (' Simple ', COALESCE (String_agg (, ")), ' B ') as document from  post  JOIN author on = post.author_id  join posts_tags on posts_tags.post_id = posts_tags.tag_id  join tag on = Posts_ta gs.tag_id  GROUP by, p_searchwhere p_search.document @@ to_tsquery (' 中文版 ', ' Endangered & Species ') Order by Ts_rank (P_search.document, To_tsquery (' 中文版 ', ' Endangered & Species ')) DESC;

In the above query, we assign different weights to different columns in the text. The post.title is more important than the sum of post.content and tag. The least important is

This means that if we search for the keyword "Alice", the document that contains the keyword in the title will be in front of the search results, followed by the document containing the keywords in the content, and finally the document containing the keywords in the author's name.

Based on the weight allocation of various parts of the document Ts_rank () returns a floating-point number that represents the relevance of the document and the query keyword.

SELECT Ts_rank (To_tsvector (' This is a example of document '), 
    to_tsquery (' example | document ')) as relevancy;
(1 row) SELECT Ts_rank (To_tsvector (' This is a example of document '), 
    to_ Tsquery (' example ')) as relevancy;
(1 row) SELECT Ts_rank (To_tsvector (' This is a example of document '), 
    to_ Tsquery (' example | unkown ')) as relevancy;
(1 row) SELECT Ts_rank (To_tsvector (' This is a example of document ') to
    _tsquery (' Example & Document ')) as relevancy;
(1 row) SELECT Ts_rank (To_tsvector (' This is a example of document ') to 
    _tsquery (' Example & Unknown ')) as relevancy;
(1 row)

However, the concept of relevance is ambiguous and is related to a particular application. Different applications may require additional information to get the desired sort results, such as the time the document was modified. Built-in sorting functions such as Asts_rank are just an example. You can write your own sorting function and/or mix the resulting results with other factors to suit your specific needs.

Here, if we want the new article to be more important than the old one, you can tell the value of the Ts_rank function divided by the age of the document by +1 (to prevent being 0 apart).

Optimization and indexing

Optimizes the search results in one table for straight forward. PostgreSQL supports indexing functionality, so you can easily create gin indexes using the Tsvector () function.

CREATE INDEX idx_fts_post on post 
USING gin (setweight (To_tsvector (language, title), ' A ') | | 
   Setweight (To_tsvector (language, content), ' B '));

Gin or gist index? These two indexes will be the subject of the blog posts associated with them. GIST will export an error match and then need an extra table row lookup to verify the resulting match. GIN, on the other hand, can be found more quickly but will be larger and slower when created.

An experience, gin index is suitable for static data because the lookup is rapid. For dynamic Data, GiST can be updated faster. Specifically, the gist index is useful on dynamic data and if individual words (word bits) are fast under 100,000, the gin index is better at processing 100,000 words, but the update is slower.

--Postgres Document: 12th Chapter Full-Text Search

In our example, we choose Gin. But the choice is not certain, you can make a decision based on your own data.

There is a problem in our architecture example; The points were distributed in different tables with different weights. In order to run better, it is necessary to make data non normalization through triggers and materialized views.

We don't always need to be non-standard and sometimes we need to add indexed functionality, as we've done above. In addition, you can Tsvector_update_trigger (...) by Postgres trigger function. or Tsvector_update_trigger_column (...) Implement non-normalization of data in the same table. See the Postgres documentation for more detailed information.

There are some acceptable delays in our application before the result returns. This is a good case of using the materialized view to load an extra index.

CREATE materialized VIEW Search_index as SELECT,
  setweight (To_tsvector (post.language:: Regconfig, Post.title), ' A ') | | 
  Setweight (To_tsvector (Post.language::regconfig, post.content), ' B ') | |
  Setweight (To_tsvector (' simple ',, ' C ') | |
  Setweight (To_tsvector (' Simple ', COALESCE (String_agg (, ')), ' A ') as Documentfrom Postjoin-author on = Post.author_idjoin posts_tags on posts_tags.post_id = posts_tags.tag_idjoin tag on = Posts_tags.tag_idgroup by pos,

Indexing the search engine after that is a regular run of refresh materialized VIEW Search_index so simple.

Now we can add an index to the materialized view.

CREATE INDEX idx_fts_search on Search_index USING Gin (document);

The query also becomes as simple as it is.

SELECT ID as post_id, titlefrom search_indexwhere document @@ to_tsquery (' 中文版 ', ' Endangered & species ') Order by TS _rank (P_search.document, To_tsquery (' 中文版 ', ' Endangered & Species ')) DESC;

If the delay becomes unbearable, you should look at an alternative method of using triggers.

The way you store your documents is not unique; Depending on your document: Tanku, multiple tables, multiple languages, data volumes ... published the article "Implementing multi-table full Text Search with Postgres in Rails" I recommend reading the following.

spelling errors

PostgreSQL provides a very useful extender PG_TRGM. See PG_TRGM doc for related documents.


PG_TRGM supports n-ary syntax such as n==3. The N-ary syntax is useful because it can look for similar strings, which is, in fact, the definition of a spelling error-a similar but incorrect word.

SELECT similarity (' Something ', ' something ');
(1 row) SELECT similarity (' Something ', ' samething ');
(1 row) SELECT similarity (' Something ', ' unrelated ');
(1 row) SELECT similarity (' Something ', ' everything ');
(1 row) SELECT similarity (' Something ', ' omething ');
(1 row)

As you can see from the example above, the similarity function returns a floating-point value that represents the similarity between two strings. Detecting spelling Errors is the process of a collection of word bits used in a document, comparison between the word bits and the input text. I find it appropriate to set the similarity critical value to 0.5 when checking spelling errors. First, we need to create a unique word list based on the document, where each word is unique.

CREATE materialized VIEW unique_lexeme asselect Word from Ts_stat (' SELECT to_tsvector "(' Simple ', post.title) | | 
 To_tsvector (' simple ', post.content) | |
 To_tsvector (' simple ', | |
 To_tsvector (' Simple ', COALESCE (String_agg (, ")) from
JOIN author on = post.author_id< C6/>join posts_tags on posts_tags.post_id = posts_tags.tag_id
JOIN tag on = posts_tags.tag_id
GROUP by post . ID, ');

The script above uses Word columns to create a view in which word columns come from the list of words. We use the simple keyword so that the table table can store text in multiple languages. Once this materialized view is created, we need to add a lasso to make the similarity query faster.

CREATE INDEX words_idx on search_words USING gin (word gin_trgm_ops);

Fortunately, the list of unique words used in search engines does not change quickly, so that we do not have to refresh the materialized view frequently through the following script:

REFRESH materialized VIEW Unique_lexeme;

Once we build this table, it's easy to find the closest match.

SELECT word 
where similarity (word, ' samething ') > 0.5 order by word <-> ' samething ' LIMIT 1;

This query returns a semantics that satisfies the similarity (>0.5), and then samething it closest to the first in terms of the input. The operator <-> returns the "distance" between parameters and is a value minus similarity ().

When you decide to handle spelling mistakes in your search, you don't want to see it (misspelled) out of every query. Conversely, when you search for no results, you can query for spelling errors and use the results provided by the query to give the user some advice. If the data comes from informal communication, such as a social network, your data may contain spelling mistakes. You can get a good result by appending a similar semantics to your tsquery.

"Super Fuzzy searching on PostgreSQL" is a good reference article about using the three-letter group for spelling errors and search Postgres.

In the example I used, the table with the unique semantics would not be greater than 2000 rows, and my understanding was that if you had more than 1M of text using the unique semantics, you would encounter performance problems with the method.

about MySQL and RDS (Remote Data Services)

Does this work on Postgres RDS?

All of the above examples can be run on RDS. As far as I know, the only limitation in the RDS Search feature is the need to access file systems when searching for certain data, such as custom dictionaries, spell checker, synonyms, and thesaurus. See the Amazon AWS Forum for relevant information.

I am using the MySQL database, can I use the built-in Full-text search function?

If it is me, I will not use this function. Without arguing, MySQL's Full-text search function is very limited. By default, it does not support stem extraction in any language. I stumbled across a function that could be installed stemming from the stem extraction, but MySQL does not support indexed functions.

So what can you do? In view of our discussion above, if Postgres is capable of all the scenarios you use, consider replacing the database with Postgres. Database migration can be done easily through tools such as Py-mysql2pgsql. Or you can look at more advanced solutions such as SOLR (Full-text Search server based on Lucene) and Elasticsearch (open source, distributed, restful search engine based on Lucene).


We've learned how to build a good, multilingual text search engine based on a particular document. This article is just an overview, but it has provided you with enough background and examples so you can start building your own search engine. In this article, I may have made some mistakes, and I would appreciate it if you could send the wrong message to

Postgres's Full-text search feature is very good, and the search speed is fast enough. This allows the data in your application to grow, without relying on other tools for processing. Postgres's search function is silver bullet? If your core business revolves around a search, it may not be.

It removes some of the features, but you don't use them in most scenarios. There is no doubt that you need to analyze and understand your needs carefully to determine which search method to use.

Personally, I want the Postgres full text Search feature to continue to improve and add some of the following features:

    • Additional built-in language support: Chinese, Japanese ...
    • The Foreign data packaging program around Lucene. Lucene is still the best tool for full text search, and there are many benefits to integrating it into postgres.
    • More ranking results or scoring characteristics will be first-class. Elasticsearch and SOLR have provided advanced solutions.
    • It would be nice not to use trigram when making a fuzzy query (tsquery). Elasticsearch provides a very simple way to implement a fuzzy search query.
    • The ability to dynamically create and edit features such as dictionary content, synonyms, and thesaurus by using SQL, instead of adding files to the file system.

Postgres is not as advanced as Elasticsearch and SOLR, after all, Elasticsearch and SOLR are dedicated to Full-text search, while full text search is just postgressql a better feature.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.