Goodbye to MongoDB, hello PostgreSQL
Olery was founded almost five years ago. It started with the single product (Olery Reputation) developed by Ruby proxy. Over time, we began to work on a series of different products and applications. Today, our products include not only (Olery) Reputation, but also Olery Feedback, Hotel Review Data API, and widgets, which can be embedded into websites and more products/services in the near future.
We have increased the number of applications. Today, we have deployed more than 25 different applications (all Ruby), some of which are web applications (Rails or Sinatra), but most of them are background running programs.
What we are most proud of is what we have achieved so far, but there is always one thing behind these achievements, that is, the basic database. Since the founding of Olery, We have installed a database, which uses MySQL to store (user, contract, etc.) core data, use MongoDB to store comments and similar data (that is, which data is easily restored when data is lost ). At the beginning, such installation and operation were very good. However, as the company grew up, we began to encounter various problems, especially MongoDB problems. Some of these problems are caused by the interaction between applications and databases, and some are caused by the Databases themselves.
For example, at a certain time point, we need to delete 1 million documents from MongoDB and then re-insert the data into MongoDB. This processing method makes the entire database almost locked for several hours, and the natural service performance will be reduced. It will not be unlocked until the database is repaired (that is, the repairDatabase command is executed on MongoDB. In addition, it takes several hours to complete the restoration, and the number of hours it takes depends on the size of the database.
In another instance, we noticed the performance degradation of our applications and managed to track MongoDB clusters. However, after further checks, we cannot find the real cause of the problem. No matter how we install the tool or what command we use, we cannot find the cause. It is not until we replace the primary election of the cluster that the performance returns to normal.
This is just two examples. We already have many such cases. At the core of this problem is that this is not only the database is running, but there is no absolute indication at any time as to what causes the problem.
No Mode Problems
In addition, the core issue we face is one of the important features of mongoDB: the lack of models. The lack of patterns may sound interesting and, in some cases, beneficial. However, for many usage of the schema-free storage engine, it causes internal problems between some modes. These models are not defined by your storage engine but by the behavior of your application and its possible needs.
For example, you may have a set of string-type title fields that your application requires. This pattern is very consistent with the current situation, even if it is not clearly defined. However, if this data result changes and times out, especially if the original data is not migrated to the new data structure, this becomes a problem (some non-mode storage engines are quite problematic ). For example, you may have the following Ruby code:
Post_slug = post. title. downcase. gsub (/\ W + /,'-')
In this way, it works normally for every document that has a title field and returns a String. However, for documents that use different field names (such as post_title) or do not have a title field at all, it will not work properly. To handle this situation, you need to adjust the code to the following:
If post. title
Post_slug = post. title. downcase. gsub (/\ W + /,'-')
Else
#...
End
Another way is to define a pattern in your model. For example, Mongoid, a popular MongoDB ODM for Ruby, allows you to do this. However, when you use these tools to define a schema, you may wonder why they do not define the schema in the database. In fact, this can solve another problem: reusability. If you only have one application, defining patterns in code is not a big problem. However, if you have many applications, this will soon become a big headache.
The schema-free storage engine wants to delete restrictions on the schema to make your work easier. But the reality is that the responsibility for ensuring data consistency is put on the user's own. Sometimes a non-mode engine can work, but I bet more times are counterproductive.
Good Database Requirements
With more special requirements, Olery forced me to seek a better database to solve the problem. For systems, especially databases, we pay great attention to the following points:
Consistency
Visualization of data and system behavior
Correctness and clarity
Scalable
Consistency is important because it helps us set clear expectations for the system. If the data is always stored in the same way, the system can easily use the data. If the database layer requires that one column of the table must exist, the application layer does not need to check whether the column data exists. Even if the database is under high pressure, the integrity of each operation must be ensured. Nothing is more frustrating than simply inserting data. After a few minutes, you cannot find the data.
Visibility includes two points: the system itself and the degree to which data is obtained. If a system error occurs, debugging should be easy. In turn, the user should easily find the data to be queried.
Correctness refers to the behavior of the system as we expected. If a field is defined as a numeric type, no one can insert text like it. In this regard, MySQL is notorious. Once you do this, you will get pseudo results.
Scalability not only targets performance, but also involves financial aspects and how well the system can respond to changing needs. A system can hardly perform well without a large amount of money or slowing down the development cycle on which the system depends.
Move from MongoDB
With the above requirements in mind, we began to look for a database that replaces MongoDB. The features mentioned above are usually a core set of traditional RDBM features, so we have locked two candidates: MySQL and PostgreSQL.
Originally, MySQL was the first candidate because some of our key data is already being stored. However, MySQL also has some problems. For example, when a field is defined as int (11), you can easily insert text data into the field because MySQL will try to convert it. The following are some examples:
Mysql> create table example ('number' int (11) not null );
Query OK, 0 rows affected (0.08 sec)
Mysql> insert into example (number) values (10 );
Query OK, 1 row affected (0.08 sec)
Mysql> insert into example (number) values ('wat ');
Query OK, 1 row affected, 1 warning (0.10 sec)
Mysql> insert into example (number) values ('what is this 10 nonsense ');
Query OK, 1 row affected, 1 warning (0.14 sec)
Mysql> insert into example (number) values ('10 ');
Query OK, 1 row affected, 1 warning (0.09 sec)
Mysql> select * from example;
+ -------- +
| Number |
+ -------- +
| 10 |
| 0 |
| 0 |
| 10 |
+ -------- +
4 rows in set (0.00 sec)
It is worth noting that MySQL will issue a warning in these cases. However, they are only warnings. They are usually ignored (if not always.
Another problem with MySQL is that any modification to a table (for example, adding a column) will cause the table to be locked and cannot be read or written. This means that any operation to use such a table has to wait until the modification is complete. For tables that contain a large amount of data, this may take several hours to complete, and may cause application downtime. This has led some companies (such as SoundCloud) to have to develop their own tools (such as lhm) to solve the problem.
After learning about the problem above, we started to investigate PostgreSQL. PostgreSQL can solve many problems that MySQL cannot solve. For example, you cannot insert a numeric field into text data in PostgreSQL:
Olery_development = # create table example (number int not null );
CREATE TABLE
Olery_development = # insert into example (number) values (10 );
INSERT 0 1
Olery_development = # insert into example (number) values ('wat ');
ERROR: invalid input syntax for integer: "wat"
LINE 1: insert into example (number) values ('wat ');
^
Olery_development = # insert into example (number) values ('What is this 10 nonsense ');
ERROR: invalid input syntax for integer: "what is this 10 nonsense"
LINE 1: insert into example (number) values ('What is this 10 nonsen...
^
Olery_development = # insert into example (number) values ('10 ');
ERROR: invalid input syntax for integer: "10"
LINE 1: insert into example (number) values ('10 ');
PostgreSQL also has the ability to rewrite tables in many ways without locking every operation. For example, you can add a column that has no default value but can be set to null and can quickly complete the operation without locking the entire table.
There are other interesting functions, such as the trigram-based index and retrieval, full-text retrieval, JSON query, and key-value pairs, supports publishing/subscription and more.
The most important thing is that PostgreSQL can weigh performance, reliability, correctness, and consistency.
Migrate to PostgreSQL
Finally, we decided to use PostgreSQL to balance the various projects of interest. However, it is not easy to migrate the entire platform from MongoDB to a different database. To simplify the transfer, we have divided the process into three steps:
Build a PostgreSQL database and migrate a small subset of data.
Update all MongoDB-dependent applications, together with any required refactoring, and use PostgreSQL-dependent programs instead.
Migrate product data to the new database and deploy the new platform.
Data Migration
Before migrating all the data to the new database, we migrated a small amount of data for testing. If you only migrate a small amount of data, there will be a lot of trouble, then database migration will be meaningless.
Although there are ready-made tools available for use, some data (such as column Rename and inconsistent data types) need to be converted. We have developed some tools for these data. Most of these tools are a one-time step written by Ruby to delete some comments, sort out data encoding, and correct the sequence of primary keys.
At the beginning of the test, despite some data problems, there was no major obstacle to migration. For example, the data submitted by some users is not fully encoded in the format, so that the data cannot be imported to the new database before being re-encoded. An interesting exception is that the comment data stored in the comment is the name of the comment language (for example, "Dutch" or "English"). Now the encoding of the comment language is changed, because our new semantic analysis system uses language encoding instead of language names.
Update an application
So far, the most time-consuming is to update applications, especially those that are heavily dependent on the MongoDB aggregation framework. Let's get rid of the few legacy Rails Applications. Testing alone will take you a few weeks. The process of updating an application is roughly as follows:
Use PostgreSQL-related code to replace the MongoDB driver/setup module code
Run the test
Fix Bugs
Run the test repeatedly until all tests pass
Sequel is recommended for non-Rails applications. For Rails Applications, ActiveRecord (at least now) cannot be removed ). Sequel is a very good database tool set that supports the vast majority (if not all) of the PostgreSQL features we want to use. Compared with ActiveRecord, DSL-based query is much more powerful, although it may take a little longer.
For example, suppose you want to calculate how many users use a language and calculate the proportion of each language (relative to the entire set ). The pure SQL query statement is as follows:
SELECT locale, count (*) AS amount,
(Count (*)/sum (count (*) OVER () * 100.0 AS percentageFROM users
Group by localeORDER BY percentage DESC;
In our example, the following output will be generated (when using the PostgreSQL command line interface ):
Locale | amount | percentage
-------- + --------------------------
En | 2779 | 85.193133047210300429000
Nl | 386 | 11.833231146535867566000
It | 40 | 1.226241569589209074000
D | 25 | 0.766400980993255671000
Ru | 17 | 0.521152667075413857000
| 7 | 0.214592274678111588000
Fr | 4 | 0.122624156958920907000
Ja | 1 | 0.030656039239730227000
Ar-AE | 1 | 0.030656039239730227000
Eng | 1 | 0.030656039239730227000
Zh-CN | 1 | 0.030656039239730227000
(11 rows)
Sequel allows you to write the preceding query using pure Ruby without the need for string segmentation (ActiveRecord is often required ):
Star = Sequel. Loads ('*') User. select (: locale)
. Select_append {count (star). as (: amount )}
. Select_append {(count (star)/sum (count (star). over) * 100.0). as (: percentage )}
. Group (: locale)
. Order (Sequel. desc (: percentage ))
If you do not like "Sequel. Sequence (" * ")", you can also use the following syntax:
User. select (: locale)
. Select_append {count (users. *). as (: amount )}
. Select_append {(count (users. *)/sum (count (users. *). over) * 100.0). as (: percentage )}
. Group (: locale)
. Order (Sequel. desc (: percentage ))
Although this may be lengthy, both of the above queries make them easier to reuse without the need for string connection.
Our Rails application may also be migrated to Sequel in the future, but considering that Rails is so tightly coupled with ActiveRecord, we are not sure whether it is worth the time and effort.
Migrate production data
Finally, we came to the process of migrating production data. There are two ways to do this:
Turn off the entire platform until all data has been migrated.
Migrate data while keeping the system running.
The first option has an obvious drawback: downtime. The second option does not require downtime but is difficult to handle. For example, in this solutionAt the same time, You must consider all the data to be added, or you will lose the data.
Fortunately, one unique solution of Olery is that the vast majority of write operations in our database are quite regular and frequently changed data (such as user Address Book information) it only accounts for a small portion of the total data volume. It takes a relatively small amount of time to migrate data compared to checking data.
The basic workflow of this part is:
Migrating key data such as users and contacts is basically all data that we cannot compensate for in any case.
Migrate less important data (for example, data that can be crawled and computed ).
Test everything that runs normally on a group of independent servers
Switch the product environment to the new server.
Migrate the data in step 1 to ensure that the data created during the average failure time is not lost.
Step 1 is the most time-consuming, and it takes about 24 hours. On the contrary, the data migration mentioned in steps 1 and 5 takes only 45 minutes.
Conclusion
We finished the migration and waited until we were very satisfied for about a month. So far, in addition to the positive effects, the performance of applications has been greatly improved in various situations. For example, the interaction latency of our Hotel Review Data API (running in Sinatra) is much lower than that before migration:
The migration started on July 15, January 21. The peak hours indicate the hard restart of Application Performance (resulting in slight slow interaction time during processing ). The average interaction time after the 21st is roughly half of the original time.
In another process called "review persistence" (the Translator's note: storing comments), we found a huge improvement in performance. The background program aims to save the comment data (comment content, comment scores, etc ). When we finally completed many major changes for the migration, the results were exciting:
The faster the capture speed is:
The performance improvement of capturing tokens is not as big as that of storing comments. Because capturing tokens only uses databases to check whether a comment exists (a relatively fast operation), this result is not surprising.
Finally, we came to the program to schedule the capture process (simply called the "scheduler "):
Because the Scheduler only runs at a fixed frequency, this figure may be a bit difficult to understand, but in any case, there is a clear decrease in the average processing time after the migration.
Finally, we are very satisfied with the current results, and we will certainly not miss MongoDB. Its performance is very good, its processing scheme is eclipsed by other databases, and the Data Query Process is very satisfactory compared with MongoDB (especially for non developers ). Even though we still have a service (Olery Feedback) that uses MongoDB (even though it runs on an independent, relatively small cluster), we plan to port it to PostgreSQL in the future.
------------------------------------ Lili split line ------------------------------------
Install PostgreSQL 6.3 on yum in CentOS 9.3
PostgreSQL cache details
Compiling PostgreSQL on Windows
Configuration and installation of LAPP (Linux + Apache + PostgreSQL + PHP) Environment in Ubuntu
Install and configure phppgAdmin on Ubuntu
Install PostgreSQL9.3 on CentOS
Configure a Streaming Replication cluster in PostgreSQL
How to install PostgreSQL 7/6 and phpPgAdmin in CentOS 5/6. 4
------------------------------------ Lili split line ------------------------------------
PostgreSQL details: click here
PostgreSQL: click here
Original article: Goodbye MongoDB, Hello PostgreSQL
This article permanently updates the link address: