PostgreSQL Advantage2016-10-20 21:36 686 People read comments (0) favorite reports Classification:MySQL Database (5)
PostgreSQL is a free object-relational database server (database management system), which is very powerful. Includes support for the world's richest data types, such as IP types and geometry types, among others.
Many readers have asked the question: If you plan to choose a free, open-source database for your project, you may be hesitant between MySQL and PostgreSQL. In response to this question, we interviewed the Postgres China User 2016 conference to share the ping An technology database technology director Wang Yang, asked him from the perspective of practitioners to give some experience sharing and advice.
InfoQ: What are the advantages of using PostgreSQL versus MySQL for applications in the financial sector?
Wang Yang: The nature of the problem is returning to why safe Technology chooses PostgreSQL as a substitute for Oracle.
We didn't use MySQL, and in fact we started to introduce and use MySQL database at the end of 2013, a year or more earlier than PostgreSQL. But finally, after the evaluation and practice test, it is found that PostgreSQL is more suitable for most financial applications, so it chooses the PostgreSQL database as the mainstream open source database of technology.
The bread here contains many factors. Payment scenarios are a more focused part of the financial world. Ping An group is a comprehensive financial services business, covering the insurance, funds, investment, securities and so on. The insurance business is the most obvious example of an extremely complex business logic that involves accessing and computing large amounts of data, even in an OLTP system. As a result, many of the business logic is implemented within the database through the package or stored procedure, which is a great way for PostgreSQL to handle it.
I think MySQL belongs to Thin database, and Oracle and PostgreSQL belong to Thick database. The concept of Thin database is to re-apply lightweight databases. In other words, the database is only stored as data, providing simple query access. and complex business logic is moved forward to the application server side to complete. The MySQL database itself is not very rich in features, such as the InnoDB storage engine only provides a data storage format in the form of indexed organization tables, which limits its usage scenarios to some extent. Support for triggers and stored procedures is weak and is not recommended for use. The applied CRUD operations are done as much as possible through primary keys, although two-level indexes are supported, but performance is lost through two-level index operations. In the case of an essential table association operation in a relational database, only the Nested Loops Association method is supported, and support for the sort merge join and hash join is missing. When the associated table exceeds 2, the MySQL Optimizer sometimes generates suboptimal execution plans, resulting in degraded performance.
Because of this, we have developed specifications for MySQL, such as the size of the table not more than how big, as far as possible to write a simple query, through the primary key to access the data, do not write more than 2 tables of interrelated SQL and so on. MySQL is more suitable for simple OLTP applications with business logic. For PostgreSQL, regardless of the business logic is simple or complex, OLTP or OLAP load, PostgreSQL can support, also have a very mature products, many well-known OLAP database products such as Greenplum, AWS Redshift, etc. are the base Developed in PostgreSQL. The query optimizer for PostgreSQL is very powerful, and for the three Table association methods Nested Loops,sort-merge Join and Hash join all support, it is not an issue to optimize for complex SQL statements. This is exactly what the insurance industry needs to do with business logic.
PostgreSQL also provides a strong database internal function support, and can be written in multiple languages, for complex business logic computing and large data volume access can be fully implemented in the database localization, greatly reducing the network interaction costs, thereby improving the overall application performance.
Therefore, some people say that PostgreSQL is not only a database, but also a powerful development platform, it is a strong feature of the rich.
I think the process architecture of PostgreSQL differs from the thread architecture of MySQL in that it is more suitable for vertical scaling. Now that we're all talking about distributed architectures, scale-out, share nothing, let's take a microscopic look at the different processes and threading architectures. The process-oriented architecture, within the operating system, is a distributed architecture, each process has its own address space, share nothing, only when the need for interaction through the IPC mechanism shared memory and semaphore to communicate, Most of the cases are dealing with their own affairs, non-interference, and thread-based architecture, the shared address space, equivalent to share everything, need to handle the relationship between the threads and access control of resources. Therefore, the cost of creating a process is much higher than the creation thread, but it is easy to scale. For enterprises with such volume of safety, the system load is very large, although scale-out is very important, but the vertical expansion of the ability is beyond doubt to consider the factors. And both PostgreSQL and MySQL's scale-out capabilities need to be implemented by the middle tier or the agent layer, not by the database kernel, at which point there is no difference, at least for now. The ability to vertically scale the database kernel is more important. When it comes to scale-out technology, MySQL has Fabric and Tddl,postgresql in recent years in the distributed development is also very fast, there are postgres-xl,pg_shard,postgres-xz and other products.
There is also an important point. When it comes to TCO (total cost of Ownership), there is no difference between PostgreSQL and MySQL, both of which are open source products. But in fact, TCO, in addition to the possible License costs, also includes learning costs and operational costs. For the financial sector, it used to be Oracle, DB2 and other commercial databases. Before Oracle and DB2 these commercial databases developed for decades, the database itself is very rich in features, and PostgreSQL belong to Thick database, it is no wonder PostgreSQL is known as the open source domain of Oracle. You can think about what kind of database is more familiar to developers or DBAs who are accustomed to using commercial databases? Lower learning costs? Years of experience more easily converted, reused? Lower cost of operation and maintenance? I think the answer is obvious, which is one of the important considerations for the safe choice of PostgreSQL.
InfoQ: When the volume of data is very large, how does ping an technology use PostgreSQL to achieve offsite backup?
Wang Yang: Ping An attaches great importance to the high availability and data security of the system. We have built up a MAA (Maximum availability Architecture) architecture for PostgreSQL, combining years of accumulated database operations experience. I understand that the backup here should be a broad-sense backup concept, not just tape backup. Our tape backup of PostgreSQL is done on the local main library. Because the direct upper band is not supported, the backup is written to the WOS (Web Object Scaler) distributed storage and then to the WOS.
Since the main library backup will compete with the application for IO resources, originally also evaluated on-premises from the library backup, but we use the Pg_rman tool still need to initiate the start Backup and end Backup command in the main library, considering the network impact, coupled with the current IO resources are more abundant, so the backup temporarily or On the main library, subsequent consideration of migrating the backup operations to the slave library. Offsite database backup is achieved by building a remote disaster recovery library, using PostgreSQL's own streaming Replication technology to transmit and apply the WAL log asynchronously to maintain data synchronization between the landlord library and the remote repository.
In this case, the difference between master and slave depends on the speed of the log volume, which is not related to the database size. At present, the largest PostgreSQL cluster is 3TB, the WAL produced every minute around 80MB, under such a load, the landlord library basically and remote from the library to maintain quasi-synchronous relationship.
InfoQ: Using the Open source PostgreSQL database, how does ping an technology optimize the stability and performance of the system?
Wang Yang: Indeed, PostgreSQL is an open source database, and we are especially concerned about the stability of the system when we introduce it. In addition to peace is an integrated financial services group, the scope of business covers all aspects of the financial industry, the various lines of business also have complex mutual call relationship. If there is a problem with a system, there is often a "reaching" effect. We do a lot of work on the high availability, reliability and stability of the system, as it is not only possible to cause economic loss but also to influence reputation.
Initially we adopted a master two from the highly available architecture. Local one master one from, remote one from, everyone uses local disk. This way, when there is a problem with the primary database, you can quickly switch to the local service from the library to continue. We can quickly find that data loss can occur during a failover! For some financial applications, data loss is unacceptable. Because of the slow performance of OLTP applications due to the network latency between master and slave, there is no synchronous replication between master and slave (despite PostgreSQL support, this is better than MySQL).
Since the strong consistency between the master and slave is not guaranteed, database loss may occur during switchover, affecting RPO. At the same time, the use of local disks makes the computation and storage strongly coupled. The data shows that the system is having problems more CPU and RAM than storage. This way, even if it is not a storage problem, the failure of the CPU or RAM causes the switchover to result in data loss. So we optimized the architecture by switching from using a local disk to using a shared disk and creating a primary and standby cluster for PostgreSQL on the other. Thus, in the event of a server failure, the cluster can quickly switch the Postgres instance from the host to the standby, and since the use of the shared disk, the consistency and integrity of the data is guaranteed, there is no loss of data.
In terms of performance, the shared disk also provides better IO performance than the local disk, both in terms of IOPS and throughput. Each time the optimization of the architecture, not only theoretically prove its feasibility, but also after rigorous testing and verification will be implemented. This is also reflected in the choice of Linux IO Scheduler. The Linux kernel has four strategies for IO scheduling, namely NOOP,CFQ (complete fairness queueing), Deadline and anticipatory. We conduct IO benchmarks for both mechanical disks and SSDs, simulating the behavior of database read-write load IO in tests, including random reads, Random writes, sequential reads, different combinations of sequential writes, and tests of different IO sizes to monitor performance data under different IO scheduling strategies under various combinations. Develop different IO Scheduler Setup specifications based on performance data. Based on mechanical disks, CFQ and Deadline perform better than the other two strategies, but the difference between CFQ and Deadline is small, and CFQ is chosen to maintain greater adaptability, while for SSD Flash, tested NOOP and Deadline are due to CFQ and Anti Cipatory, and the Linux 7 kernel already uses Deadline by default, so we use Deadline on SSDs.
For an open source database, monitoring nature is also essential. All of our databases are Zabbix with a unified monitoring framework that significantly reduces operational costs. Based on PostgreSQL's own characteristics, many years of experience in Oracle operations have been combined to customize a number of monitoring metrics. For example, monitoring of slow queries, monitoring of the remaining transaction age is used to determine whether an early vacuum is required, and the monitoring of checkpoint efficiency. In order to have a comprehensive understanding of the performance of each cluster, and to facilitate performance trend analysis and location-related anomalies, we draw on the very best AWR (Automatic Workload Repository) concept in Oracle and develop our own Pgawr Database performance Reporting tool. It has not only the performance data of the OS, but also the overall load and performance data of the DB, as well as the list of SQL statements sorted by different dimensions, which can quickly diagnose the problem and find the exception SQL statement when the DB extremely runs the host.
In order to enable PostgreSQL to run more stable, in the event of the problem can be quickly positioned, analyzed and resolved to meet the financial industry's need for system operation and recovery of aging.
InfoQ: Can you tell me about the architecture design of the open source database for Ping An technology, and what are the specifications?
Wang Yang: From a highly available architecture, the PostgreSQL database is divided into two levels, first host failures, and then storage failures. Previously, most of the problems were caused by host failures, which accounted for only a small percentage of storage failures. Due to the misalignment of the host and the probability of failure, we have decoupled the computation and storage. To protect against host failures, we use shared storage and set up a Linux server cluster to switch quickly when a host problem occurs. Due to the decoupling of computing and storage, it is more convenient to actively maintain the host, which greatly reduces the possibility of data loss. Although the probability of storage failure is small, but the impact surface is very large, so we do not neglect the protection of storage failures. Both locally and remotely we have built PostgreSQL's slave libraries, both of which use asynchronous replication strategies between them and the main repository. If there is a problem with the main library storage, the short time can not be recovered, will be cut to the local to continue to provide services from the library, if a regional problem occurs, switch to remote continue to provide services. To achieve the protection of the three centers of two places.
In addition to the highly available architecture specification, in the PostgreSQL schema specification we also define:
- The PostgreSQL version number used in the production environment, as well as the Patch policy and upgrade strategy of PostgreSQL;
- Naming conventions for server and operating system users;
- Character set settings used by the database application;
- Each cluster the volume and directory naming rules;
- Standard parameter setting specification for PostgreSQL database.
In addition, we have developed a user architecture specification, which defines the standard initialization users of PostgreSQL cluster, such as DBA users, deploying dedicated users, replicating users, monitoring users, and so on. For each Database, create a master user and apply connected users, to minimize the permissions to the principle of granting the application to connect users the necessary permissions to avoid the risk of harmful operations. For operators, there is also a clear definition in the specification to create different dedicated users according to their responsibilities. This allows for the separation of DBA, operational, and developer permissions, minimizing the potential harm to the database, and facilitating subsequent audits and tracking.
For the healthy and stable operation of database system, development specification is just as important. In the development specification for the developer, we describe the naming specification and the design criterion respectively. Naming conventions include naming requirements for database objects such as tables, fields, indexes, and views, while design specifications include descriptions of some of the principles of tables, indexes, and partitioned tables at design time. These specifications allow developers to design the application system relatively simply and reduce the performance risk when the system is running. Furthermore, we integrate these specifications into the database Automation audit platform, for example, in the version submission phase of the automatic Audit database version compliance, whether SQL statement contains full table scan, implicit conversion and other high-risk operations, and so on, further strengthen the quality control, reduce the risk of the system on-line, simplifying the operation and maintenance work.
InfoQ: What are the new technologies that have been introduced to optimize the architecture and specifications in the post-operation process? What is the purpose?
Wang Yang: PostgreSQL after all, after the introduction of peace is not long, to today is almost 1.5 of the time. Although, we have developed the PostgreSQL promotion strategy, according to the system complexity, system importance level gradually spread, but in the process of promoting the use, still encounter such problems. In this respect, we require every problem to find the root cause, only in this way can be targeted to constantly adjust and optimize.
The
previously mentioned from the original use of master-slave architecture and local disk tuning to the use of Linux server cluster plus shared storage is an obvious example. Here, I would like to share one more example. Once, on a running database, found that the resulting Wal log is very fast, the database itself is not large, only 100GB, but the Wal generated every hour can reach more than 300GB. This is a typical case of a high load of small libraries, which, through detailed analysis, finds that the library itself is highly loaded, but also related to the configuration of checkpoint-related parameters that are not optimized. High load and checkpoint configuration form a vicious circle, resulting in a huge amount of WAL generated, resulting in the system disk space for this kind of availability risk. Although the parameters were adjusted once, they did not fundamentally solve the problem because they did not find the root cause. With more and more high-load important database using PostgreSQL, for the future to continue to promote the existence of a large hidden trouble, we must find the root cause and thoroughly solve the problem. After research on the principle of Wal generation, the log analysis, and even the PostgreSQL source code to understand, finally to the Wal-generation mechanism has a clearer understanding. and formulated the parameter adjustment strategy and the future use of the new version of the specification. The problem is mainly related to two parameters, one is Wal segment size, and the other is checkpoint_segments. The default setting is low, Wal_segment_size is 16mb,checkpoint_segments is 32, and is less suitable for high-load applications running the financial system. Since the Wal segment size can only be adjusted when the PostgreSQL software is compiled, only checkpoint_segments can be adjusted for the existing system. For this particular database, the checkpoint_segments was adjusted to 128 for the first time without finding the root cause, but did not fundamentally solve the problem. After the analysis, adjusted to the 512,wal generated amount of significantly reduced, to achieve the desired results. When using the new PostgreSQL version in the future, we adjusted the specification to adjust the Wal segment size to 64MB at compile time (4 times times the default), and flexibly adjusted the parameter max_wal_size introduced from 9.5, based on the load situation of the system. Enables the system to run financial system applications more stably.
In addition, as PostgreSQL's support for JSON grows richer and more mature, we are also starting to recommend the use of JSON data types on PostgreSQL. In particular, the JSON support for PostgreSQL is earlier than MySQL and Oracle in the relational database domain, and MySQL supports json,oracle from 5.7 to 11 for JSON support from 12CR1. Backport. 2.0.4,postgresql is walking in the front. Compared to the best MongoDB in the Document DB domain that natively supports JSON data types, we found that PostgreSQL performance was not bad when tested and validated. While MongoDB does not support the association between tables (or collection) because it belongs to NoSQL, PostgreSQL fills this gap. For some applications that have JSON document storage requirements, but need to be associated with relational or structured data, PostgreSQL is undoubtedly a very suitable database.
Of course, we are constantly learning about PostgreSQL and constantly exploring it. But as we become more aware of it, our confidence is increasing. Especially shortly before the release of PostgreSQL 9.6, many of the new features are even more people look forward to, but also let us see the future of PostgreSQL!
Infoq: With the perfect combination of postgresql+ finance, what are some of the insights you can share with InfoQ's readers?
Wang Yang: 2015 can be called the peace of PostgreSQL first-time, from the end of 2014 to contact PostgreSQL, to the beginning of 2015 to start preparation, research, evaluation, testing, norm-setting, promotion strategy, to the real use of promotion, to now have more than 1000 The example of PostgreSQL is running, all the way, both for the achievement and happy, but also feel difficult. But today I can tell you, the original choice is correct, I am glad to choose PostgreSQL such an excellent database products, and in the team's efforts to let PostgreSQL take root in peace, germination and thrive. Process, not no one questioned, but we believe in our own judgment and choice, more importantly, we believe that the PostgreSQL product itself can help us to support the stable operation of financial system applications, reduce database operating costs, embrace open source technology, to achieve the goal of self-controlled information. As a result, we have patiently persuaded, fully cooperate, and carefully guide, in order to get everyone's approval. Gradually, facts and figures prove that PostgreSQL is a very stable, high-performance database product capable of running financial-grade applications.
Each kind of database product has its application scenario, each company also has the different resources, when chooses a database product, cannot conformity, with the trend, must according to own situation, to include the future database technology development, the business characteristic, the Financial Regulation request, Company personnel have the skills to make a full range of analysis, and then choose a most suitable for their own database. Moreover, in today's booming database technology, enterprises may also choose a variety of database products according to the application scenario, which can meet the needs of storing and accessing different data types. This is in itself a process of finding the most optimal configuration, maximizing the value.
Another point I think is very important, is that every database, according to its original design ideas and design goals to use it, and not as a black box. First of all to understand it, and then to adapt it, only in this way to achieve the best running-in effect, it is possible to play the ultimate performance. Conversely, there will be strong twist melon is not sweet, leading to performance problems.
Although the security technology of the PostgreSQL application scale has been very large, made a lot of achievements, but peace in the PostgreSQL database has a long way to go. For example, how to use distributed architecture to enhance the scale-out of PostgreSQL, support VLDB (Very Large Database), not only can run OLAP load, but also can run OLTP load, and how to enrich PostgreSQL performance data and diagnostic data, Further improve the analysis, positioning and resolution of the problem, the impact of the system will be minimized.
These are the challenges we face next, especially for integrated financial services groups like Ping An. But it is gratifying to see that PostgreSQL is growing more mature and more functional, and many of the features we want are either already implemented or have been incorporated into the planning. PostgreSQL's community is growing, becoming more active, and new companies are starting to adopt PostgreSQL. In such a good situation, I believe PostgreSQL will be able to consolidate in the field of open source database leadership, not only become the most advanced open source database, but also become the financial services industry's most powerful open source database!
PostgreSQL Advantage, MySQL database itself is not very rich, trigger and stored process support is weak, Greenplum, AWS Redshift, etc. are based on PostgreSQL developed