XFS Logdev Perfect solution cgroup ioPS limit Ext4 data=writeback can solve the problem

Source: Internet
Author: User
Tags postgresql psql create database htons

Https://github.com/digoal/blog/blob/master/201601/20160107_02.md
Author

Digoal

Date

2016-01-07

Label

PostgreSQL, XFS, Ext4, Cgroup, ioPS, Iohang, writeback, ordered, XFS Logdev

Background

Linux below Ext4 and XFS are with Journal file system, before writing metadata, must first write metadata journal.

(Journal database-like Redo LOG, available for disaster recovery)

Metadata contains the file system inodes, directories, indirect blocks information. Create a file (including a directory), change the file size, change the file modification time involves metadata write operations.

In the Ext4,xfs file system, the metadata journal operation is serial, and this is similar to the redo log.

Cgroup's Blkio module, which controls the process of reading, writing IOPS, throughput, etc. for the specified block device.

When we limit IOPS, we may encounter interference problems because "metadata journal operations are serial".

For example:

There are 1 pieces of equipment to find its Major,minor number.

#ll/dev/mapper/aliflash-lv0*  
lrwxrwxrwx 1 root 7  , 7 11:12/dev/mapper/aliflash-lv01->. /dm-0  
#ll/dev/dm-0  
BRW-RW----1 root disk 253, 0 out of  7 11:22/dev/dm-0  

Create XFS or EXT4 file systems on this block device and mount to/DATA01.

Initializes two PostgreSQL database instances, which are placed in separate directories in/DATA01.

Limit one of the PostgreSQL clusters to (253:0) The write ioPS of this block device to 100.

Ps-ewf|grep postgres digoal 24259 1 0 12:58 PTS/4 00:00:00/home/digoal/pgsql9.5/bin/postgres--monitoring 1921 Digo        Al 24260 24259 0 12:58?        00:00:00 Postgres:logger process Digoal 24262 24259 0 12:58?        00:00:00 postgres:checkpointer process Digoal 24263 24259 0 12:58?        00:00:00 postgres:writer process Digoal 24264 24259 0 12:58?        00:00:00 postgres:wal writer Process digoal 24265 24259 0 12:58?        00:00:00 postgres:autovacuum launcher Process digoal 24266 24259 0 12:58? 00:00:00 postgres:stats Collector process Digoal 24293 1 0 12:58 PTS/4 00:00:00/home/digoal/pgsql9.5/bin/po        stgres-d/data01/digoal/pg_root--monitoring 1922 Digoal 24294 24293 0 12:58?        00:00:00 Postgres:logger process Digoal 24296 24293 0 12:58? 00:00:20 postgres:checkpointer process Digoal 24297 24293 012:58?        00:00:00 postgres:writer process Digoal 24298 24293 0 12:58?        00:00:00 postgres:wal writer Process digoal 24299 24293 0 12:58?        00:00:00 postgres:autovacuum launcher Process digoal 24300 24293 0 12:58?    00:00:00 postgres:stats Collector Process

Limit IOPS for 1921 instances

cd/sys/fs/cgroup/blkio/  
mkdir cg1  
cd cg1  
echo "253:0" > Blkio.throttle.write_iops_device  
Echo 24259 > Tasks  
echo 24260 > Tasks  
echo 24262 > Tasks  
echo 24263 > Tasks  
echo 24264 > Tasks
  echo 24265 > Tasks  

Open a pressure test that will modify the metadata in large quantities. You can use the CREATE database.

(CREATE database will be a large number of Copy template Library data files, call Fsync.) This results in a large number of metadata modified actions that trigger metadata journal modifications. )

VI test.sh  
#!/bin/bash for  

((i=0;i<100;i++))  
do  
psql-h 127.0.0.1-p "CREATE Database $i"  
done  

. ./test.sh  

Watch block Device ioPS, written ioPS limited to 100.

Iostat-x 1  
avg-cpu  %user   %nice%system    %iowait%steal%idle 0.00 0.00 0.03    3.12    0.00   96.84  
Device:         rrqm/s   wrqm/s r/s w/s rsec/s wsec/s Avgrq-sz Avgqu-sz   await  svctm  %util  dm-0     0.00 0.00 0.00 100.00 0.00  1600.00    16.00     0.00    0.00   0.00   0.00  

Now connected to the 1922 instance, go to the test performance:

pgbench -i -s 100 -h 127.0.0.1 -p 1922 

pgbench -M prepared -n -r -P 1 -c 96 -j 96 -T 100 -h 12   
7.0.0.1-p 1922 progress:1.0 S, 33.0 TPS, lat 2.841 ms StdDev 1.746 S, progress:2.0 TPs, 0.0 MS Lat-nan progress:3.0 S, 0.0 TPS, Lat-nan Ms Stddev-nan progress:4.0 S, 0.0 TPS, Lat-nan MS Stddev-nan progress:5.0 S , 0.0 TPS, Lat-nan Ms Stddev-nan progress:6.0 S, 197.2 TPS, lat 2884.437 ms StdDev 2944.982 progress:7.0 S, 556.6 TPS, lat 33.527 ms StdDev 34.798 progress:8.0 S, 0.0 TPS, Lat-nan Ms Stddev-nan progress:9.0 S, 0.0 TPS, Lat-nan  
Ms Stddev-nan progress:10.0 S, 0.0 TPS, Lat-nan Ms Stddev-nan progress:11.0 S, 0.0 TPS, Lat-nan MS Stddev-nan progress:12.0 S, 0.0 TPS, Lat-nan Ms Stddev-nan progress:13.0 S, 0.0 TPS, Lat-nan MS Stddev-nan S, 0.0 TPS, Lat-nan Ms Stddev-nan progress:15.0 S, 0.0 TPS, Lat-nan MS Stddev-nan 

As you can see, 1922 of the performance is affected by 1921, in fact the IO capability of the block device is hundreds of thousands of.

Why.

Because the metadata journal is serial operated, when the 1921 instance operation metadata journal slows down, it affects the metadata journal operation of the 1922 instance to the file system.

Even select 1; This operation is affected by the need to create a temporary catalog file Global/pg_internal.init.pid each time the front process and PostgreSQL establish a connection.

Track the postmaster process for a second database instance

[Root@digoal ~]# strace-t-f-p 24293 >./conn  

Connecting to a second database instance

Postgres@digoal-> strace-t psql-h 127.0.0.1-p 1922 Execve ("/opt/pgsql/bin/psql", "  
psql", "-H", "127.0.0.1", "- P "," 1922 "], [/* VARs *]) = 0 <0.009976>  
brk (0)                                  = 0x1747000 <0.000007>  

...
Poll ([{fd=3, events=pollin| Pollerr}], 1,-1)//will be stuck here
At this point in the system can see the startup process, is Postmaster fork out, pay attention to this process number, and the following conn file corresponding.

[Root@digoal postgresql-9.4.4]# ps-efw|grep start postgres 46147 24293 0 19:43?  Output interception of 00:00:00 postgres:postgres postgres 127.0.0.1 (17947) startup Strace-t psql-h 127.0.0.1-p 1922: setsockopt (3, Sol_socket, So_keepalive, [1], 4) = 0 <0.000008> Connect (3, {sa_family=af_inet, Sin_port=htons (1922), Sin_addr=in ET_ADDR ("127.0.0.1")} =-1 einprogress (Operation now in progress) <0.000943> poll ([{fd=3, events=pollout| Pollerr}], 1,-1) = 1 ([{fd=3, revents=pollout}]) <0.000011> getsockopt (3, Sol_socket, So_error, [0], [4]) = 0 ;0.000124> getsockname (3, {sa_family=af_inet, sin_port=htons (17947), sin_addr=inet_addr ("127.0.0.1")}, [16]) = 0 & Lt;0.000015> poll ([{fd=3, events=pollout| Pollerr}], 1,-1) = 1 ([{fd=3, revents=pollout}]) <0.000008> sendto (3, "\0\0\0\10\4\322\26/", 8, Msg_nosignal, NUL L, 0) = 8 <0.000050> poll ([{fd=3, events=pollin| Pollerr}], 1,-1) = 1 ([{fd=3, Revents=pollin}]) <0.000600> recvfrom (3,"N", 16384, 0, NULL, NULL) = 1 <0.000010> poll ([{fd=3, events=pollout| Pollerr}], 1,-1) = 1 ([{fd=3, revents=pollout}]) <0.000007> sendto (3, "\0\0\0t\0\3\0\0user\0postgres\0database\0  
 P "..., msg_nosignal, NULL, 0) = <0.000020>

Poll response time reached 67 seconds.

Poll ([{fd=3, events=pollin| Pollerr}], 1,-1) = 1 ([{fd=3, Revents=pollin}]) <67.436925>  , response time reached 67 seconds  
Recvfrom (3, "r\0\0\0\10\0\0\0\ 0s\0\0\0\32application_name\0p "..., 16384, 0, NULL, NULL) = 322 <0.000017>  

When the connection is established, view the tracking of the postmaster process. You can see startup process 46147, which takes 66 seconds to call write, because this call to write triggered the metadata action.

[Root@digoal ~]# grep "pid 46147" conn|less  
[pid 46147] Mmap (NULL, 528384, prot_read| Prot_write, map_private| Map_anonymous,-1, 0) = 0x7f0f1403d000 <0.000012>  
[pid 46147] Unlink ("global/pg_internal.init.46147") =-1 Enoent (No such file or directory) <0.000059>  
[PID 46147] Open ("global/pg_internal.init.46147", o_wronly| O_creat| O_trunc, 0666) = <0.000068>  
[pid 46147] Fstat ({st_mode=s_ifreg|0600, st_size=0, ...}) = 0 <0.000013> ;  
[PID 46147] Mmap (NULL, 4096, prot_read| Prot_write, map_private| Map_anonymous,-1, 0) = 0x7f0f1403c000 <0.000020>  
[pid 46147] Write (*, "f2w\0008\1\0\0\0\0\0\0\200\6\0\0\0\ 0\0\0u2\0\0\0\0\0\0\0\0\0\0 "..., 4096 <unfinished ...>  
[pid 46147] ... write resumed>)       = 4096 < 66.317440>  
[PID 46147]---sigalrm (Alarm clock) @ 0 (0)---  
found the corresponding code:

write_relcache_init_file@src/ Backend/utils/cache/relcache.c

Re-track this C file:

[root@digoal ~]# cat TRC.STP Global f_start[999999] probe process ("/opt/pgsql/bin/postgres" ). function ("*@/opt/soft_bak/postgresql-9.4.4/src/backend/utils/cache/relcache.c"). Call {F_start[execname (), PID ( ), Tid (), CPU ()] = Gettimeofday_ms ()} Probe process ("/opt/pgsql/bin/postgres"). function ("*@/opt/soft_bak/postgresql-  
  9.4.4/src/backend/utils/cache/relcache.c "). return {T=gettimeofday_ms () A=execname () b=cpu () C=pid () D=PP () E=tid () if (F_start[a,c,e,b] && t-f_start[a,c,e,b]>1) {# printf ("time:%d, execname:%s, pp:  %s, par:%s\n ", T-f_start[a,c,e,b], a, D, $ $locals $$) printf (" time:%d, execname:%s, pp:%s\n ", t-f_start[a,c,e,b), A, D)}} 

Because the startup process is dynamically generated, you can only do so > [Root@digoal ~]# cat t.sh !/bin/bash for ((i=0;i<1;i=0) do Pid=ps-ewf|grep start|gr Ep-v Grep|awk ' {print $} ' STAP-VP 5-dmaxskipped=9999999-dstp_no_overload-dmaxtrylock=100./trc.stp-x $PID done to re-track are as follows:

Postgres@digoal-> strace-t psql-h 127.0.0.1-p 1922
[Root@digoal ~]#. ./t.sh Pass 1:parsed User script and the library
Script (s) using 209296virt/36828res/3172shr/34516data KB, in
180usr/20sys/199real Ms. Pass 2:analyzed script:102 probe (s), 7
function (s), 4 embed (s), 1 global (s) using
223800virt/51400res/4172shr/49020data KB, in 80usr/60sys/142real Ms.
Pass 3:translated to C
"/TMP/STAPBW7MDQ/STAP_B17F8A3318CCF4B972F4B84491BBDC1E_54060_SRC.C"
Using 223800virt/51744res/4504shr/49020data KB, in 10usr/40sys/57real
Ms. Pass 4:compiled C into
"Stap_b17f8a3318ccf4b972f4b84491bbdc1e_54060.ko" in
1440usr/370sys/1640real Ms. Pass 5:starting Run. time:6134,
Execname:postgres,
Pp:process ("/opt/pgsql9.4.4/bin/postgres"). function ("write_item@/opt/soft_bak/postgresql-9.4.4/src/backend/ utils/cache/relcache.c:4979 "). Return Time:3, Execname:postgres,
Pp:process ("/opt/pgsql9.4.4/bin/postgres"). function ("write_item@/opt/soft_bak/postgresql-9.4.4/src/backend/ utils/cache/relcache.c:4979 "). Return Time:6, Execname:postgres,
Pp:process ("/opt/pgsql9.4.4/bin/postgres"). function ("write_item@/opt/soft_bak/postgresql-9.4.4/src/backend/ utils/cache/relcache.c:4979 "). Return ...
How to solve the above problem. How you isolate the ioPS of a database instance does not interfere with each other.

Solution 1

Different instances use different file systems.

For example

MKFS.EXT4/DEV/MAPPER/VGDATA01-LV01  
mkfs.ext4/dev/mapper/vgdata01-lv02  
mount/dev/mapper/vgdata01-lv01/ Data01  

Two database instances are placed in/DATA01 and/DATA02, respectively

Restricting/dev/mapper/vgdata01-lv01 IOPS does not affect the other file system.

The disadvantage of this method: if there are many instances, it needs to be split into many small file systems, which is not suitable for space flexible management and sharing.

Solution 2

For EXT4

The order in which the data is normally written if you want to modify the metadata, you must ensure that the metadata corresponding block changes have been dropped, so there may be write metadata forced to brush dirty data page.

Pic

If the dirty data page brushes slowly, it can cause metadata to write blocked. and write metadata journal is serial, it is bound to affect the other process of metadata journal modification.

Use Data=writeback to load the Ext4 file system.

The principle of this method is to write metadata, no need to wait for data to finish. Thus may appear metadata is new, but the data is old condition. (for example, the inode is new, the data is old, some of the blocks referenced by the inode do not exist or are old deleted blocks)

Write metadata do not wait to write data, the advantage is that the serial operation is not good because the data is blocked.

       Data={journal|ordered|writeback} Specifies the journalling mode for file data.  Metadata is always journaled.   To use modes the than ordered in the root filesystem, pass the "mode to" kernel as boot parameter, e.g.  

              Root-flags=data=journal.  

              The journal all data are committed into the journal prior to being written into the main filesystem.  Ordered this is the default mode.  

              The all data are forced directly out to the main file system prior to its metadata being committed to the journal. Writeback Data ordering is isn't preserved-data May to written into the main filesystem aft  Er its metadata has been committed to the journal.  This is rumoured to being the highest-throughput option.  It guarantees internal filesystem integrity, however it can allow old data into appear in files A Crash and journal RecoVery.   

Disadvantage, file system or operating system after crash, may cause metadata and data inconsistent, appear dirty block.

Solution 3

will be used as a stand-alone journal block device, and when you limit IOPS, do not restrict IO for Journal block devices (because metadata journal is very small and very fast, without the necessary restrictions), only the IO of the data block device is restricted.

This approach is only suitable for XFS file systems.

EXT4 File system Use this method does not achieve the effect, EXT4 separate journal Dev method as follows, but no effect, you can try.

Create logical volumes, one for data, one for journal

#pvcreate/dev/dfa  
#pvcreate/dev/dfb  
#pvcreate/dev/dfc  
#vgcreate ALIFLASH/DEV/DFA/DEV/DFB/DEV/DFC  
#lvcreate-I 3-i 8-l 1t-n lv01 Aliflash  

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.