PostgreSQL region settings
PostgreSQL region settings
For Chinese users, we recommend that you set the encoding to UTF8 unconditionally in PostgreSQL. for simplified and unified regions (loacle), we recommend that you set the encoding to C, however, Collate and Ctype may affect the performance or functions.
Environment
- Rhel 6.3 x64 VM (4C/8G/300g hdd)
- PostgreSQL 9.6.2
Database
en_US=# \l+ List of databases Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description -----------+----------+----------+------------+------------+-----------------------+---------+------------+-------------------------------------------- en_US | postgres | UTF8 | en_US.UTF8 | en_US.UTF8 | | 7343 kB | pg_default | postgres | postgres | UTF8 | C | C | | 414 MB | pg_default | default administrative connection database template0 | postgres | UTF8 | C | C | =c/postgres +| 7225 kB | pg_default | unmodifiable empty database | | | | | postgres=CTc/postgres | | | template1 | postgres | UTF8 | C | C | =c/postgres +| 7225 kB | pg_default | default template for new databases | | | | | postgres=CTc/postgres | | | zh_CN | postgres | UTF8 | zh_CN.UTF8 | zh_CN.UTF8 | | 7225 kB | pg_default | (5 rows)
Impact of Collate on Functions
Collate will affect the sorting of Chinese characters. In the zh_CN region, Chinese characters are sorted by pinyin, and other regions are sorted by character encoding.
Postgres = # select * from (values ('King'), ('marw'), ('west'), ('yang') a order by; column1 --------- Yang wangxi mink (4 rows) postgres = # \ c en_USYou are now connected to database "en_US" as user "postgres ". en_US = # select * from (values ('King'), ('marw'), ('west'), ('yang') a order by; column1 --------- Yang wangxi mink (4 rows) en_US = # \ c zh_CNYou are now connected to database "zh_CN" as user "postgres ". zh_CN = # select * from (values ('King'), ('marw'), ('west'), ('yang') a order by; column1 --------- mink Wang Xiyang (4 rows)
Test Method for the impact of Collate on Performance
postgres=# create table tb1(c1 text);CREATE TABLETime: 5.653 mspostgres=# insert into tb1 select md5(generate_series(1,1000000)::text);INSERT 0 1000000Time: 2671.929 mspostgres=# vacuum ANALYZE tb1;VACUUMTime: 398.817 mspostgres=# select * from tb1 order by c1 limit 1; c1 ---------------------------------- 0000104cd168386a335ba6bf6e32219d(1 row)Time: 176.779 mspostgres=# create index idx1 on tb1(c1);CREATE INDEXTime: 1549.436 ms
Test Results
Collate/Ctype C en_US.UTF8 zh_CN.UTF8insert 2671 2613 2670vacuum ANALYZE 398 250 396order by 176 388 401create index 1549 7492 7904insert(with index) 11199 15621 16128
Impact of Ctype
Ctype affects the results of pg_trgm and some regular expression matching. For example, if the Ctype is 'C', pg_trgm cannot support Chinese characters.
Postgres = # select show_trgm ('aaabbbc to x'); show_trgm then {"a", "x", "aa", "x", aaa, aab, abb, bbb, bbc, "bc"} (1 row) en_US = # select show_trgm ('aaabbbc to x'); show_trgm done {"a", "aa", 0x27bdf1, 0x30bd19, 0x4624bc, aaa, aab, abb, bbb, bbc, 0x6a2ad5} (1 row) zh_CN = # select show_trgm ('aaabbbc to x '); show_trgm resume {"a", "aa", 0x27bdf1, 0x30bd19, 0x4624bc, aaa, aab, abb, bbb, bbc, 0x6a2ad5} (1 row)
Conclusion
For scenarios with low performance requirements, we recommend that you set both Collate and Ctype to zh_CN.UTF8 and other regions to C.
initdb -E UTF8 --locale=C --lc-collate=zh_CN.UTF8 --lc-ctype=zh_CN.UTF8 ...
For scenarios with high performance requirements, we recommend that you set Ctype to zh_CN.UTF8 and other regions to C. If some queries need to be sorted by pinyin, you can specify Collate as zh_CN in the column definition and SQL expression.
initdb -E UTF8 --locale=C --lc-ctype=zh_CN.UTF8 ...
Reference
- Regions and codes in PostgreSQL