Original Sticker Address: https://www.oschina.net/translate/showdown-mysql-8-vs-postgresql-10?lang=chs&page=2#
English Original address: http://rachbelaid.com/introduction-to-postgres-physical-storage/
Now that MySQL 8 and PostgreSQL 10 have been released, it's time to review how these two open source relational databases compete with each other.
Prior to these editions, it was widely believed that Postgres performed better in the feature set and was praised for its "college style", while MySQL was much better at massively concurrent read/write.
But with the release of their latest version, the gap between the two has become noticeably smaller.
Feature comparison
Let's take a look at the "funky" features we all like to talk about.
features |
MySQL 8 |
PostgreSQL |
Inquiry & Analysis |
|
|
Common table Expressions (CTEs) |
? New |
? |
Window functions |
? New |
? |
Data Type |
|
|
JSON Support |
? Improved |
? |
Gis/srs |
? Improved |
? |
Full Text Search |
? |
? |
| Scalability
|
|
Logical replication |
? |
? New |
Semi-synchronous replication |
? |
? New |
Declarative partitioning |
? |
? New |
It used to be said that MySQL is best suited for online transactions and PostgreSQL is best suited for analytic processes. But now it's not.
Common table Expressions (CTEs) and window functions are the main reasons for choosing PostgreSQL. But now, by referencing the boss_id in the same table to recursively traverse an employee table, or to find a median (or 50%) in a sorted result, this is no longer a problem on MySQL.
Replication in PostgreSQL lacks configuration flexibility, which is why Uber turns to MySQL. But now, with the logical copy feature, you can achieve a zero-downtime upgrade by creating a new version of Postgres and switching to it. It is also much easier to truncate a stale partition in a large time series event table.
In terms of features, these two databases are now consistent.
What are the differences?
Now, we only have one question--so what is the reason for choosing one instead of the other?
Ecosystems are one of the factors. MySQL has a vibrant ecosystem, including MariaDB, Percona, Galera, and more, as well as storage engines other than InnoDB, but this can also be confusing. Postgres's high-end options are limited, but this will change as new features are introduced in the latest version.
Governance is another factor. When Oracle (or the original SUN) acquired MySQL, everyone feared they would ruin the product, but for the past 10 years it was not true. In fact, after the acquisition, development has accelerated. Postgres has extensive experience in the work management and collaboration community.
The infrastructure will not change very often, and although there has been no detailed discussion of this in the recent past, it is worth considering again.
To review the following:
features |
MySQL 8 |
PostgreSQL |
Architecture |
Single process |
Multi-process |
Concurrent |
Multithreading |
Fork (2) |
Table structure |
Clustered index |
Heap |
Page compression |
Transparent |
TOAST |
Update |
In-place/rollback Segments |
Append Only/hot |
Garbage collection |
Purge Threads |
Automatic emptying process |
Transaction log |
REDO Log (WAL) |
WAL |
Copy log |
Separate (Binlog) |
WAL |
Process vs Thread
When Postgres derives a child process to establish a connection, each connection can occupy up to 10MB. Compared to the MySQL threading model, its memory pressure is greater, and on a 64-bit platform, the thread's default stack size is 256KB. (Of course, the thread-local sort buffer makes this overhead less important, even if it cannot be ignored.) )
Although copy-on-write preserves some of the immutable memory states that are shared with the parent process, the basic overhead of a process-based architecture is onerous when you have more than 1000 concurrent connections, and it is probably one of the most important factors in capacity planning.
That is, if you run a Rails application on 30 servers, each server has 16 CPU cores 32 threads, then you have 960 connections. It is possible that less than 0.1% of applications will go beyond this range, but this is a need to remember.
Clustered index vs Heap Table
A clustered index is a table structure in which rows are directly embedded in the B-tree structure of their primary key. A (nonclustered) heap is a regular table structure that populates data rows with indexes.
With clustered indexes, when you look up records through a primary key, a single I/O can retrieve an entire row, instead of a cluster, you always need to find references, at least two times I/O. Because foreign key references and joins trigger a primary key lookup, the impact can be very large, which results in a large number of queries.
One theoretical disadvantage of clustered indexes is that when you query with a level two index, it needs to traverse twice times the tree node, scan the two-level index for the first time, and then traverse the clustered index, which is also a tree.
However, if you follow the conventions of modern table design, using an AutoIncrement integer as the primary key [1]--it is called a surrogate key--it is almost always desirable to have a clustered index. What's more, if you do a lot of the ORDER by ID to retrieve the most recent (or oldest) N records, I think that's a good fit.
Postgres does not support clustered indexes, and MySQL (InnoDB) does not support heaps. But anyway, if you have a lot of memory, the difference should be very small.
Page structure and compression
Both Postgres and MySQL have page-based physical storage. (8KB vs 16KB)
Introduction to PostgreSQL Physical storage
The page structure looks like the picture on the right. It contains some items that we don't intend to discuss here, but they contain metadata about the page. The item following the entry is an array identifier consisting of (offset, length) pairs that point to tuples or rows of data. In Postgres, multiple versions of the same record can be stored in this way on the same page.
MySQL's table structure is similar to Oracle's, and it has multiple levels, including layers, sections, pages, and row layers.
In addition, it has a separate segment for revocation, called a rollback segment. Unlike Postgres, MySQL will save multiple versions of the same record in a separate area.
If there is one row that must fit on a single page of two databases, this means that a row must be less than 8KB. (At least 2 lines must fit the MySQL page, which happens to be 16KB/2 = 8KB)
So what happens when you have a large JSON object in a column?
Postgres uses TOAST, which is a dedicated shadow table (shadow table) store. When rows and columns are selected, large objects are pulled out. In other words, a large number of black boxes will not contaminate your precious cache. It also supports the compression of TOAST objects.
MySQL has a more complex feature called transparent page compression, thanks to the contribution of high-end SSD storage provider Fusio-io. It is designed to better use SSDs, where write volumes are directly related to the life of the device.
The compression of MySQL applies not only to large objects outside the page, but also to all pages. It does this by using punch holes in sparse files, which are supported by modern file systems such as EXT4 or Btrfs.
For more details, see: Achieve significant performance gains with new MariaDB page compression on Fusionio.
Cost of updates
Another feature that is often overlooked, but has a big impact on performance, and may be the most controversial topic, is the update.
This is another reason that Uber has abandoned postgres, prompting many Postgres supporters to refute it.
Both are MVCC databases, which can isolate multiple versions of data.
To do this, Postgres saves the old data in the heap until it is emptied, while MySQL moves the old data to a separate area called the rollback segment.
In Postgres, when you try to update, the entire row must be copied, and the index entry that points to it is also copied. This is partly because Postgres does not support clustered indexes, so the physical location of a row referenced from an index is not abstracted by a logical key.
To solve this problem, Postgres uses the heap tuple (hot) to not update the index if possible. However, if the update is frequent enough (or if a tuple is larger), the history of the tuple can easily exceed the 8 KB page size, spanning multiple pages and limiting the validity of the feature. The time to trim and/or defragment depends on the heuristic solution. In addition, setting a fill parameter of not more than 100 reduces space efficiency-a tradeoff that is difficult to consider when creating tables.
This restriction is more thorough; Because index tuples do not have any information about transactions, index-only scanning is not supported until 9.2. It is one of the oldest and most important optimization methods supported by all major databases, including MYSQL,ORACLE,IBM DB2 and Microsoft SQL Server. But even with the latest version, when there are many updates that set the dirty bit in the visibility map, Postgres does not fully support index-only scanning and often chooses a SEQ scan when we don't need it.
On MySQL, the update takes place, and the old row data is sealed in a separate area called a rollback segment. The result is that you don't need a vacuum, and the commit is very fast, and the rollback is relatively slow, which is a desirable tradeoff for most use cases.
It is also smart enough to clear history as soon as possible. If the isolation level of the transaction is set to read-committed or lower, the history is cleared when the statement completes.
The size of the transaction record does not affect the main page. Fragmentation is a pseudo proposition. As a result, it's better to be able to predict overall performance on MySQL.
Garbage Collection Garbage Collection
Vacuum in the postgres is expensive because it works mainly in the heap area, causing direct resource competition. It feels like a garbage collection in a programming language-it stops on the road and stops you at any time.
Configuring Autovacuum for tables with billions of records remains a challenge.
Clearing (Purge) on MySQL can also be quite onerous, but because it is run in a separate rollback segment using a dedicated thread, it does not affect the concurrency of reads in any way. Even with the default configuration, the expanded rollback segment makes it possible to slow down your execution.
A busy table with billions of records does not cause historical data bloat on MySQL, such as file size on storage and query performance, which are almost predictable and stable.
Logs and replicas
Postgres has a single source transaction history known as the pre-write log (WAL). It has been used for replicas, and a new feature called logical replication can quickly decode binary content into easier-to-digest logical statements, allowing fine-grained control of data.
MySQL maintains two separate logs: 1. InnoDB-specific redo logs for crash recovery, and 2. Binary logs for replication and incremental backups.
The redo log on the InnoDB is consistent with Oracle, which is a maintenance-free circular buffer that does not grow over time and is only created at startup with a fixed size. This design ensures that a continuous continuous area is retained on the physical device, thereby improving performance. Larger redo logs produce higher performance, but at the cost of crash recovery time.
With the new replication feature added to Postgres, I think they pitches.
The previous article is too long to read, please look at the following summary
Surprisingly, it proves that the prevailing view is still there; MySQL is best suited for online trading, and PostgreSQL is best suited for append only mode, like a data warehouse analysis process. [2]
As we saw in this article, the vast majority of Postgres's problems come from the Append only mode, which is too redundant for the heap structure.
A future version of Postgres may require significant improvements to its storage engine. You don't have to accept what I'm saying-there's actually a discussion about it on the official wiki, which suggests that it's time to learn some good ideas from InnoDB.
People say again and again that MySQL is chasing Postgres, but this time the tide has changed.
——————————————————————————————————————————
The UUID as the primary key is a terrible idea, by the way--password randomness is designed to kill the locality of the reference, so performance can be lost.
When I say postgres is particularly suitable for analysis, I'm serious: in case you don't know timescaledb, it's a package on PostgreSQL top that allows you to insert 1 million data per second, and 100 billion rows per server. What a crazy thing. No wonder Amazon chooses PostgreSQL as the basis for redshift.
Reprint "King Battle" MySQL 8 vs PostgreSQL 10