PostgreSQL Archive Error

Source: Internet
Author: User
Tags symlink rsync



Running out of disk space in the Pg_xlog directory is a fairly common Postgres problem. This important directory holds the WAL (Write Ahead Log) files. (WAL files contain a record of all changes made to the Database-see the link for more details). Because of the near write?only nature of this directory, it's often put on a separate disk. Fixing the out of space error is fairly easy:i would discuss a few remedies below.



When the Pg_xlog directory fills up and new files cannot is written to it, Postgres'll stop running, try to Automaticall Y restart, fail to do, and give up. The Pg_xlog directory is so important that Postgres cannot function until there are enough space cleared out to start Writi ng files again. When this problem occurs, the Postgres logs would give you a pretty clear indication of the problem. They would look similar to this:


Panic:could not write to file ' pg_xlog/xlogtemp.559 ': No space left on device
Statement:insert to ABC (a) Select 123 from Generate_series (1,12345)
Log:server process (PID 559) was terminated by signal 6:aborted
Detail:failed process was running:insert to ABC (a) Select 123 from Generate_series (1,12345)
Log:terminating any other Active server processes
Warning:terminating connection because of crash of another server process
Detail:the postmaster have commanded this server process to roll back the current transaction and exit, because another SE RVer Process exited abnormally an
D possibly corrupted shared memory.
Hint:in a moment you should is able to reconnect to the database and repeat your command.
Log:all server processes terminated; Reinitializing
Log:database system was interrupted; Last known up at 2014-09-16 10:36:47 EDT
Log:database system is not properly shut down; Automatic Recovery in progress
Fatal:the database system is in recovery mode
Log:redo starts at 0/162fe44
Log:redo done at 0/1ffff78
Log:last completed transaction is at log time 2014-09-16 10:38:50.010177-04
Panic:could not write to file ' pg_xlog/xlogtemp.563 ': No space left on device
Log:startup process (PID 563) was terminated by signal 6:aborted
Log:aborting startup due to startup process failure


The "PANIC" seen above is the most severe log_level Postgres have, and it basically causes a "full stop right now!". You'll note in the above snippet that a normal SQL command caused the problem, which then caused all other Postgres proc Esses to terminate. Postgres then tried to restart itself, but immediately ran into the same problem (no disk space) and thus refused to start Back up. (the "FATAL" line above is another client trying to connect whilst all of this is going on.)



Before we can look at how to fix things, a little background would help. When Postgres was running normally, there is a finite number of WAL files (roughly twice the value of checkpoint_segments) That's exist in the Pg_xlog directory. Postgres deletes older WAL files, so the total number of files never climbs too high. When something prevents Postgres from removing the older files, the number of the WAL files can grow quite dramatically, Culmi Nating in the space condition seen above. Our solution is therefore two-fold:fix whatever are preventing the old files from being deleted, and clear out enough disk Space to allow Postgres to start up again.



The first step is to determine, the WAL files are not being removed. The most common case is a failing archive_command. If This is the case, you'll see archive-specific errors in your Postgres log. The usual causes is a failed network, downed remote server, or incorrect copying permissions. You might see some errors like this:



2013-05-06 23:51:35 EDT [19421]: [206-1] user=,db=,remote= log:archive command failed with exit code 14
2013-05-06 23:51:35 EDT [19421]: [207-1] user=,db=,remote= detail:the failed archive command Was:rsync–whole-file–igno Re-existing–delete-after-a pg_xlog/000000010000006b00000016 backup:/archive/000000010000006b00000016
Rsync:failed to exec Ssh:permission denied (13)


The above is from a actual bug report; The problem was SELinux


There is some other reasons what WAL would not being removed, such as failure to complete a checkpoint, but they is very rar e so we'll focus on Archive_command. The quickest solution is to fix the underlying problem by bringing the remote server back up, fixing the permissions, etc. (To debug, try emulating the Archive_command is using with a small text file, as the Postgres user.) It is a generally safe to ship Non-wal files to the same remote directory). If you cannot easily or quickly get your archive_command working, the change it to a dummy command is always returns true:


On Nix boxes:


Archive_command = '/bin/true '


On BSD boxes:


Archive_command = '/usr/bin/true '


On Windows boxes:


Archive_command = ' REM '



This would allow the Archive_command to complete successfully, and thus lets Postgres start removing older, unused WAL f Iles. Note that changing the archive_command means you'll need to change the Archive_command back later and create fresh base backups, so does, as a last resort. Even after changing the Archive_command, you cannot start the server yet, because the lack of disk space is still a proble M. Here's what's the logs would look like if you tried-to-start it up again:



Log:database system shutdown was interrupted; Last known up at 2014-09-16 10:38:54 EDT
Log:database system is not properly shut down; Automatic Recovery in progress
Log:redo starts at 0/162fe44
Log:redo done at 0/1ffff78
Log:last completed transaction is at log time 2014-09-16 10:38:50.010177-04
Panic:could not write to file ' pg_xlog/xlogtemp.602 ': No space left on device
Log:startup process (PID 602) was terminated by signal 6:aborted
Log:aborting startup due to startup process failure



At this point, you must provide Postgres a little bit of the "the Partition/disk", the Pg_xlog directory is in. There is four approaches to doing so:removing Non-wal files to clear space, moving the Pg_xlog directory, resizing the P Artition it is on, and removing some of the WAL files yourself.



The easiest solution is to clear up space by removing any non-wal files and that was on the same partition. If you don't have the pg_xlog on it own partition, just remove a few files (or move them to another partition) and then star T Postgres. You don ' t need much space-a few hundred megabytes should is more than enough.



This problem occurs often enough that I had a best practice:create a dummy file on your pg_xlog partition whose sole pur Pose is to get deleted after this problem occurs, and thus free up enough space to allow Postgres to start! Disk space is cheap these days, so just create a 300MB file and put it in place like so (on Linux):



DD If=/bin/zero of=/pgdata/pg_xlog/do_not_move_this_file BS=1MB count=300



This was a nice trick, because you don ' t has to worry about finding a file to remove, or determine which WALs to delete- Simply move or delete the file and you is done. Once things is back to normal, and don ' t forget to put it on place.



The best-of-the-get more than-a-simply move your pg_xlog directory to another partition. Simply Create a directory for it in the other partition, copy through all the files, then make Pg_xlog a symlink to this new Directory. (Thanks to Bruce in the comments below)



Another-to-get more space in your pg_xlog partition are to resize it. Obviously this was only an option if your os/filesystem have been setup to allow resizing, but if it was, this is a quick and Easy-to-give Postgres enough space to startup again. No example code on this one, as the-the-resize disks varies so much.



The final-is-to-remove some older WAL files. This should is done as a last resort! It's far from better to create space, as removing important WAL files can render your database unusable! If you go this route, first determine which files is safest to remove. The one-to-determine-is-use-the-pg_controldata program. Just run it with the location of your data directory as the-argument, and you should is rewarded with a screenful of Arcane information. The important lines would look like this:



Latest checkpoint ' s REDO location:0/4000020
Latest checkpoint ' s REDO WAL file:000000010000000000000005



This second line represents the last WAL file processed, and it should is safe to remove any files older than. (Unfortunately, older versions of PostgreSQL won't show that line, and only the REDO location. While the canonical-to translate-a-filename is with the Pg_xlogfile_name () function, it is of little u Se in this situation, as it requires a live database! Thus, need another solution.)



Once know which WAL file to keep by looking on the pg_controldata output, you can simply delete all WAL files older than that one. (as Craig points out in the comments below, you can use the-pg_archivecleanup program in standalone mode, which would actua Lly work all the the-the-same-to version 8.0). As with all mass deletion actions, I recommend a three-part approach. First, back everything up. This could is as simple as copying all the files in the Pg_xlog directory somewhere else. Second, do a trial run. This means seeing what the deletion would do without actually deleting the files. For some commands, this means using a–dry-run or similar option, but in our example below, we can simply leave out the "- Delete "argument. Third, carefully perform the actual deletion. In our example above, we could clear the old WAL files by doing:



$ cp-r/pgdata/pg_xlog/*/home/greg/backups/
$ find-l/pgdata/pg_xlog-not-newer/pgdata/pg_xlog/000000010000000000000005 | Sort | Less
$ find-l/pgdata/pg_xlog-not-newer/pgdata/pg_xlog/000000010000000000000005-delete



Once you has straightened out of the archive_command and cleared out some disk space, your is ready-to-start Postgres up. You could want to adjust your pg_hba.conf to keep everyone else out until you verify all are working. When you start Postgres, the logs would look like this:



Log:database system was shut down at 2014-09-16 10:28:12 EDT
Log:database system is ready to accept connections
Log:autovacuum Launcher started



After a few minutes, check in the Pg_xlog directory, and you should see that Postgres have deleted all the extra WAL files, And the number left should is roughly twice the checkpoint_segments setting. If you adjusted pg_hba.conf, adjust it again-let clients back in. If you changed your archive_command to always return truth, remember to change it back as well as generate a new base back Up



Now and the problem is a fixed, how does prevent it from happening again? First, you should use the ' Tail_n_mail ' program to monitor your Postgres log files, so the moment the Archive_command Starts failing, you'll receive an e-mail and can deal with it right away. Making sure your pg_xlog partition has plenty of space are a good strategy as well, as the longer it takes to fill up, the More correct the problem before your run out of disk space.



Another-stay on top of the problem are to get alerted when the Pg_xlog directory starts filling up. Regardless of whether it is in its own partition or not, and you should being using a standard tool like Nagios to alert when The disk space starts to run low. You can also with the Check_postgres program to alert if the number of the WAL files in the Pg_xlog directory goes above A specified number.



In summary, things you should does now to prevent, detect, and/or mitigate the problem of running of disk space in PG_XL og


Move pg_xlog to its own partition. This not only increases performance, but keeps things simple and makes things like disk resizing easier.
Create a dummy file in the pg_xlog directory as described above. This is a placeholder file that will prevent the partition from being completely filled with WAL files when 100% disk space is reached.
Use tail_n_mail to instantly detect archive_command failures and deal with them before they lead to a disk space error (not to mention the stale standby server problem!)
Monitor the disk space and/or number of WAL files (via check_postgres) so that you are notified that the WALs are growing out of control. Otherwise your first notification may be when the database PANICs and shuts down!


In summary, don ' t panic if-run out of space. Do the steps above, and rest assured that no data corruption or data loss have occurred. It's not a fun, but there is far worse Postgres problems to run into! :)
2
8 Comments:



Michael Heaney said ...


A dummy log file - very clever! I‘ll be implementing this myself.
September 26, 2014 at 9:04:00 AM EDT


Bruce Momjian said ...


Another idea is to move the pg_xlog directory to another partition that has more space, and use a symlink to point to the new location.
September 26, 2014 at 9:26:00 AM EDT


Greg Sabino Mullane said ...


Thanks Bruce, good point. September at 4:01:00 PM EDT


Robins Tharakan said ...


Neat! Ran into this a few years back and this post would have helped a lot that day.

The cause for that instance is the only thing that I wanted to add to this list.

There, the cause was a rogue SQL that created an audaciously large temp folder (not WAL). Since WAL was not in a separate partition, the first alarm we saw was a PANIC !! (Someone in IT realized then that some basic Nagios alerts on the box would go a long way!)
September 26, 2014 at 10:18:00 PM EDT


Craig Ringer said ...


Please don‘t manually remove transaction logs except as a last resort.

Use pg_archivecleanup - see http://www.postgresql.org/docs/9.3/static/pgarchivecleanup.html - by preference. Or, as Bruce says, symlink the xlog dir to a different location.

Removing still-required transaction logs will render your database unable to start and the recovery steps required may cause corruption. You should have a big fat warning about this.

Similarly, while bypassing archiving with an archive_command that returns true without archiving will get the master running, you will then have to re-take base backups for any replicas and backups. Failure to do so will leave you with unrecoverable PITR-based backups that you‘ll only discover when something goes pear shaped.

While I understand the advice you‘re giving, and I think it‘s useful, I think you need some more warnings about consequences and the vital importance of backing everything up if you‘re going to go deleting WAL archives etc.

September 27, 2014 at 3:45:00 AM EDT


Greg Sabino Mullane said ...


Thanks Craig, I made it clearer that removing WAL is the last resort, that a base backup is needed, and added a mention of pg_archivecleanup.
September 27, 2014 at 9:07:00 AM EDT


Yang Shen said ...


My two cents:

we may use pg_switch_xlog to switch pg_xlog to a new one, and then safely remove the old ones.
November 20, 2014 at 6:03:00 PM EST


Yang Shen said ...


I should say if Prod, the way I mentioned is not good. If you setup a read-only server to absorb pg_xlog, ‘pg_xlog/‘ wouldn‘t grow too much. It could bring another topic up: how to set the number of pg_xlog.
November 20, 2014 at 6:29:00 PM EST


Post a Comment



PostgreSQL Archive Error


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.