gpdb: The global deadlock detected process of the greenplum master node has abnormal RSS memory usage.

1、02:09:43, See the deadlock information from the log

$ cat gpdb-2024-04-18_000000.csv | grep deadlock
2024-04-18 02:09:43.246943 UTC,,,p26809,th-1525422016,,,,0,con380,cmd5036,seg-1,,,,sx1,"LOG","00000","global deadlock detected! Final graph is :{""seg0"": [""p1241 of dtx3430845286 con1088326 waits for a transactionid lock on ShareLock mode, blocked by p1232 of dtx3430845283 con1088326"",""p1232 of dtx3430845283 con1088324 waits for a transactionid lock on ShareLock mode, blocked by p1225 of dtx3430845280 con1088324"",""p1237 of dtx3430845285 con1088325 waits for a transactionid lock on ShareLock mode, blocked by p1222 of dtx3430845278 con1088325"",""p1223 of dtx3430845279 con1088321 waits for a transactionid lock on ShareLock mode, blocked by p1232 of dtx3430845283 con1088321"",""p1250 of dtx3430845289 con1088328 waits for a transactionid lock on ShareLock mode, blocked by p1237 of dtx3430845285 con1088328""],""seg1"": [""p8451 of dtx3430845279 con1088321 waits for a transactionid lock on ShareLock mode, blocked by p8466 of dtx3430845285 con1088321"",""p8466 of dtx3430845285 con1088325 waits for a transactionid lock on ShareLock mode, blocked by p8478 of dtx3430845289 con1088325"",""p8460 of dtx3430845283 con1088324 waits for a transactionid lock on ShareLock mode, blocked by p8450 of dtx3430845278 con1088324"",""p8478 of dtx3430845289 con1088328 waits for a transactionid lock on ShareLock mode, blocked by p8470 of dtx3430845286 con1088328"",""p8453 of dtx3430845280 con1088322 waits for a transactionid lock on ShareLock mode, blocked by p8478 of dtx3430845289 con1088322"",""p8457 of dtx3430845281 con1088323 waits for a transactionid lock on ShareLock mode, blocked by p8450 of dtx3430845278 con1088323""],""seg2"": [""p26938 of dtx3430845281 con1088323 waits for a transactionid lock on ShareLock mode, blocked by p26933 of dtx3430845279 con1088323"",""p26959 of dtx3430845289 con1088328 waits for a transactionid lock on ShareLock mode, blocked by p26952 of dtx3430845286 con1088328"",""p26952 of dtx3430845286 con1088326 waits for a transactionid lock on ShareLock mode, blocked by p26933 of dtx3430845279 con1088326"",""p26947 of dtx3430845285 con1088325 waits for a transactionid lock on ShareLock mode, blocked by p26938 of dtx3430845281 con1088325"",""p26932 of dtx3430845278 con1088320 waits for a transactionid lock on ShareLock mode, blocked by p26941 of dtx3430845283 con1088320""]}",,,,,,,0,,"gddbackend.c",228,

The log shows that the process id is p26809

2、02:09:50, Abnormal growth of master memory

image-20240419102815128

3、02:17:55, kernel invoked oom-killer

Apr 18 02:17:55 mdw kernel: tuned invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
Apr 18 02:17:55 mdw kernel: CPU: 51 PID: 1965 Comm: tuned Tainted: G            E     5.4.258-1.el7.elrepo.x86_64 #1
Apr 18 02:17:55 mdw kernel: Hardware name: Amazon EC2 c6i.16xlarge/, BIOS 1.0 10/16/2017
Apr 18 02:17:55 mdw kernel: Call Trace:
Apr 18 02:17:55 mdw kernel: dump_stack+0x6d/0x8b
Apr 18 02:17:55 mdw kernel: dump_header+0x4f/0x1eb
Apr 18 02:17:55 mdw kernel: oom_kill_process.cold+0xb/0x10
Apr 18 02:17:55 mdw kernel: out_of_memory+0x1cf/0x500
Apr 18 02:17:55 mdw kernel: __alloc_pages_slowpath+0xdaa/0xeb0
Apr 18 02:17:55 mdw kernel: __alloc_pages_nodemask+0x2c6/0x300
Apr 18 02:17:55 mdw kernel: alloc_pages_current+0x87/0xe0
Apr 18 02:17:55 mdw kernel: __page_cache_alloc+0x72/0x90
Apr 18 02:17:55 mdw kernel: pagecache_get_page+0xbf/0x300
Apr 18 02:17:55 mdw kernel: filemap_fault+0x6ea/0xac0
Apr 18 02:17:55 mdw kernel: ? unlock_page_memcg+0x12/0x20
Apr 18 02:17:55 mdw kernel: ? alloc_set_pte+0x114/0x5c0
Apr 18 02:17:55 mdw kernel: ? _cond_resched+0x19/0x30
Apr 18 02:17:55 mdw kernel: ? down_read+0x13/0xa0
Apr 18 02:17:55 mdw kernel: __xfs_filemap_fault+0x6c/0x200 [xfs]
Apr 18 02:17:55 mdw kernel: xfs_filemap_fault+0x37/0x40 [xfs]
Apr 18 02:17:55 mdw kernel: __do_fault+0x3c/0x170
Apr 18 02:17:55 mdw kernel: do_fault+0x24b/0x620
Apr 18 02:17:55 mdw kernel: __handle_mm_fault+0x478/0x720
Apr 18 02:17:55 mdw kernel: handle_mm_fault+0xca/0x200
Apr 18 02:17:55 mdw kernel: do_user_addr_fault+0x1e2/0x430
Apr 18 02:17:55 mdw kernel: __do_page_fault+0x42/0x50
Apr 18 02:17:55 mdw kernel: do_page_fault+0x36/0x120
Apr 18 02:17:55 mdw kernel: do_async_page_fault+0x43/0xd0
Apr 18 02:17:55 mdw kernel: ? prepare_exit_to_usermode+0x8c/0xb0
Apr 18 02:17:55 mdw kernel: async_page_fault+0x3e/0x50
Apr 18 02:17:55 mdw kernel: RIP: 0033:0x7f855a6e9770
Apr 18 02:17:55 mdw kernel: Code: Bad RIP value.
Apr 18 02:17:55 mdw kernel: RSP: 002b:00007f8548b05dd8 EFLAGS: 00010202
Apr 18 02:17:55 mdw kernel: RAX: 00007f855a96fb60 RBX: 00007f855aa58800 RCX: 0000000000000000
Apr 18 02:17:55 mdw kernel: RDX: 00007f855a99a1c8 RSI: 0000000000000001 RDI: 00007f85482d6bd0
Apr 18 02:17:55 mdw kernel: RBP: 00007f85482d6bd0 R08: 00007f8548b05dd0 R09: 00007f8548b05b80
Apr 18 02:17:55 mdw kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00007f85520df360
Apr 18 02:17:55 mdw kernel: R13: 0000000000000001 R14: 00007f853c0131e0 R15: 00007f8552955488
......
Apr 18 02:17:55 mdw kernel: Tasks state (memory values in pages):
Apr 18 02:17:55 mdw kernel: [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
......
Apr 18 02:17:55 mdw kernel: [  26809]   996 26809 33710005 30046023 241467392        0             0 postgres
......
Apr 18 02:17:55 mdw kernel: oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=postgres,pid=26809,uid=996
Apr 18 02:17:55 mdw kernel: Out of memory: Killed process 26809 (postgres) total-vm:134840020kB, anon-rss:120147564kB, file-rss:0kB, shmem-rss:36528kB, UID:996 pgtables:235808kB oom_score_adj:0
Apr 18 02:17:55 mdw systemd: Started Push metric to node_exporter textfiles.
Apr 18 02:17:59 mdw kernel: oom_reaper: reaped process 26809 (postgres), now anon-rss:0kB, file-rss:0kB, shmem-rss:36528kB

26809 The process memory usage is abnormal and is killed by oom.

anon-rss:120147564kB = 114GB

4、02:18:00, Memory usage starts to drop

ENV

postgres=# select version();
                                                                                                                version

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------
 PostgreSQL 12.12 (Greenplum Database 7.0.0 build commit:0a7a3566873325aca1789ae6f818c80f17a9402d) on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit compiled on Oct 19 2023 16:40:25 Bhuvne
sh C.
(1 row)

$ pg_config
BINDIR = /usr/local/greenplum-db/bin
DOCDIR = /usr/local/greenplum-db/share/doc/postgresql
HTMLDIR = /usr/local/greenplum-db/share/doc/postgresql
INCLUDEDIR = /usr/local/greenplum-db/include
PKGINCLUDEDIR = /usr/local/greenplum-db/include/postgresql
INCLUDEDIR-SERVER = /usr/local/greenplum-db/include/postgresql/server
LIBDIR = /usr/local/greenplum-db/lib
PKGLIBDIR = /usr/local/greenplum-db/lib/postgresql
LOCALEDIR = /usr/local/greenplum-db/share/locale
MANDIR = /usr/local/greenplum-db/share/man
SHAREDIR = /usr/local/greenplum-db/share/postgresql
SYSCONFDIR = /usr/local/greenplum-db/etc/postgresql
PGXS = /usr/local/greenplum-db/lib/postgresql/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--prefix=/usr/local/gpdb' '--disable-orca' '--disable-gpcloud' '--disable-gpfdist' '--with-perl' '--with-python' '--with-libxml' 'CFLAGS=-g -O3 -march=native' 'CXXFLAGS=-g -O3 -march=native'
CC = gcc -std=gnu99
CPPFLAGS = -D_GNU_SOURCE -I/usr/include/libxml2
CFLAGS = -Wall -Wmissing-prototypes -Wpointer-arith -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-unused-but-set-variable -g -O3 -march=native  -Werror=uninitialized -Werror=implicit-function-declaration
CFLAGS_SL = -fPIC
LDFLAGS = -Wl,--as-needed -Wl,-rpath,'/usr/local/gpdb/lib',--enable-new-dtags
LDFLAGS_EX =
LDFLAGS_SL =
LIBS = -lpgcommon -lpgport -lpthread -lbz2 -lxml2 -lrt -lz -lreadline -lrt -lcrypt -ldl -lm  -lcurl   -lzstd
VERSION = PostgreSQL 12.12

But so far I haven’t found the steps to reproduce the problem. Does need any other information?

About this issue

  • Original URL
  • State: open
  • Created 2 months ago
  • Comments: 27 (9 by maintainers)

Most upvoted comments

@yanboer We have another good practice to trace similar OOM issue: using the following gdb script to print call stacks:

handle SIGUSR1 nostop
set pagination off
set logging file gdb.output
set logging on

break AllocSetAllocImpl if context == gddContext
continue
set $loop = 100000 # loop count
while ($loop > 0)
    	bt -10
	set $loop=$loop-1
end

set logging off
quit

When you find the memory usage is increasing, run the script immedicately: gdb --pid={the pid of gdd process} < gdb_script.in

Then provides the gdb.output to us, thanks.

$ cat /proc/sys/kernel/core_pattern core

Might go to COORDINATOR_DATA_DIRECTORY to see if any?


Next time when it happens, you can help use gcore to generate a core of the GDD process.