um pouco de fundo: eu queria largar todo o cache em um dos sistemas, pois temos a situação em que vários hosts consomem mais de 512GB de memória e oom-killer fica ativo. Eu escolhi aleatoriamente um host e tentei abandonar o cache (esse host ainda não consumiu toda a memória), mas a sessão ficou presa lá.
eu emiti o seguinte comando para descartar o cache
echo 3 | sudo tee /proc/sys/vm/drop_caches
mas eu não obtenho o controle de volta da sessão e ele entra em um loop. O CNTRL C não ajuda nessa sessão.
Informações do SO:
Linux version 3.13.0-98-generic (buildd@lgw01-52) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #145~precise1-Ubuntu SMP Sat Oct 8 20:16:31 UTC 2016
O vmstat não mostra nada notável:
raven@s295401x7712103c:~$ vmstat 2
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
6 0 0 254539840 2260 30588 0 0 5 14 0 0 9 6 85 0
15 0 0 254521488 2400 31196 0 0 322 624 181284 292469 7 14 79 0
8 0 0 254531584 2432 31776 0 0 136 1254 183746 295388 8 13 79 0
11 0 0 254552928 2576 31976 0 0 494 846 182298 296141 7 13 79 0
6 0 0 254552960 2604 32072 0 0 44 846 184103 296127 9 12 79 0
12 0 0 254536736 2612 32648 0 0 536 744 186093 298011 9 14 77 0
13 0 0 254542048 2624 32736 0 0 34 590 186877 299039 9 14 77 0
7 0 0 254550256 2644 32736 0 0 490 470 181235 297326 7 12 81 0
top mostra meus processos mastigando uma CPU inteira cada
top - 20:23:20 up 27 days, 1:03, 5 users, load average: 13.73, 13.88, 13.96
Tasks: 714 total, 4 running, 710 sleeping, 0 stopped, 0 zombie
Cpu(s): 7.9%us, 13.2%sy, 0.0%ni, 78.8%id, 0.0%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 528278628k total, 273635376k used, 254643252k free, 15312k buffers
Swap: 3906556k total, 0k used, 3906556k free, 56744k cached
PID PPID S %CPU TIME TIME+ P PR NI USER VIRT %MEM RES SHR SWAP CODE DATA nFLT nDRT WCHAN Flags COMMAND
5339 1 S 185 127,55 7675:37 54 20 0 libvirt- 12.3g 1.1 5.8g 7300 6.5g 2872 12g 0 0 - .84.618. /usr/bin/kvm -name 1f530008-5990-40c7-8a1c-e38bc8739
7327 1 S 105 142,31 8551:41 31 20 0 libvirt- 36.9g 1.9 9.7g 7348 27g 2872 36g 0 0 - .84.618. /usr/bin/kvm -name 88b328c0-7b3f-4f4d-96d9-1019927a0
13960 13949 R **100** 30500w 5124097h 18 20 0 root 5880 0.0 680 576 5200 24 316 1 0 - ..4.61.. tee /proc/sys/vm/drop_caches
57401 57400 R **100** 26:22 26:22.42 5 20 0 root 5880 0.0 680 576 5200 24 316 0 0 - ..4.61.. tee /proc/sys/vm/drop_caches
os processos não são elimináveis mesmo com o SIGKILL
raven@s295401x7712103c:~$ sudo kill 13960
[sudo] password for raven:
raven@s295401x7712103c:~$
raven@s295401x7712103c:~$ ps -f 13960
UID PID PPID C STIME TTY STAT TIME CMD
root 13960 13949 99 18:02 pts/16 R+ 21114751:35 tee /proc/sys/vm/drop_caches
raven@s295401x7712103c:~$ sudo kill -11 13960
raven@s295401x7712103c:~$ ps -f 13960
UID PID PPID C STIME TTY STAT TIME CMD
root 13960 13949 99 18:02 pts/16 R+ 21114751:48 tee /proc/sys/vm/drop_caches
raven@s295401x7712103c:~$ sudo kill -9 13960
raven@s295401x7712103c:~$ ps -f 13960
UID PID PPID C STIME TTY STAT TIME CMD
root 13960 13949 99 18:02 pts/16 R+ 21114752:00 tee /proc/sys/vm/drop_caches
raven@s295401x7712103c:~$ **sudo kill -9 13960**
raven@s295401x7712103c:~$ ps -f 13960
UID PID PPID C STIME TTY STAT TIME CMD
root 13960 13949 99 18:02 pts/16 R+ 21114752:27 tee /proc/sys/vm/drop_caches
raven@s295401x7712103c:~$
depois de vasculhar algumas postagens nos fóruns eu decidi pegar o dump da pilha e com certeza ele mostra duas CPUs indexadas em spin-locks como abaixo:
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074386] NMI backtrace for cpu 3
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074390] CPU: 3 PID: 57401 Comm: tee Tainted: G OX 3.13.0-98-generic #145~precise1-Ubuntu
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074392] Hardware name: Supermicro SYS-2028TP-HC0R-S4-GT009/X10DRT-PS, BIOS 2.0b 05/09/2017
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074393] task: ffff8829b9914800 ti: ffff882ce5a7c000 task.ti: ffff882ce5a7c000
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074394] RIP: 0010:[<ffffffff8176e987>] [<ffffffff8176e987>] _raw_spin_lock+0x37/0x50
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074400] RSP: 0018:ffff882ce5a7dcd8 EFLAGS: 00000202
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074401] RAX: 000000000000429e RBX: 0000000000000080 RCX: 0000000000008b44
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074402] RDX: 0000000000008b46 RSI: 0000000000008b46 RDI: ffffffffa0436818
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074403] RBP: ffff882ce5a7dcd8 R08: ffffea00ab1aa900 R09: ffffffffa03fc7ea
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074406] R10: dead000000000200 R11: 0000000000000000 R12: ffffffffa04331e0
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074410] R13: 0000000000001ec3 R14: ffff882ce5a7de48 R15: 0000000000000002
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074415] FS: 00007f507af9d700(0000) GS:ffff887e7ee60000(0000) knlGS:0000000000000000
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074420] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074424] CR2: ffffffffff600000 CR3: 0000002cdf1cb000 CR4: 00000000003427e0
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074428] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074432] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074434] Stack:
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074437] ffff882ce5a7dd38 ffffffffa03fe726 ffff887dc0e6f000 0000000000000001
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074450] ffff882ce5a7dcf8 ffff882ce5a7dcf8 ffffffffa0436818 0000000000000080
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074459] ffffffffa04331e0 0000000000001ec3 ffff882ce5a7de48 0000000000000002
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074463] Call Trace:
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074477] [<ffffffffa03fe726>] mmu_shrink_scan+0x26/0x220 [kvm]
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074484] [<ffffffff8116c851>] shrink_slab_node+0x121/0x2b0
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074490] [<ffffffff8139242b>] ? find_first_bit+0x1b/0x80
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074493] [<ffffffff8116e9e0>] shrink_slab+0xb0/0x110
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074497] [<ffffffff8122c073>] drop_caches_sysctl_handler+0x73/0xa0
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074502] [<ffffffff81240c5c>] proc_sys_call_handler+0xbc/0xd0
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074505] [<ffffffff81240c84>] proc_sys_write+0x14/0x20
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074509] [<ffffffff811ce2f5>] vfs_write+0xc5/0x1f0
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074512] [<ffffffff811ce7f2>] SyS_write+0x52/0xa0
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074517] [<ffffffff81106a60>] ? __audit_syscall_exit+0x230/0x2d0
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074520] [<ffffffff81777add>] system_call_fastpath+0x1a/0x1f
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.074521] Code: 02 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f b7 f2 b8 00 80 00 00 eb 0c 0f 1f 44 00 00 f3 90 83 e8 01 74 0a <0f> b7 0f 66 39 ca 75 f1 5d c3 0f 1f 80 00 00 00 00 eb da 66 0f
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076452] NMI backtrace for cpu 17
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076459] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1.286 msecs
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076467] CPU: 17 PID: 13960 Comm: tee Tainted: G OX 3.13.0-98-generic #145~precise1-Ubuntu
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076471] Hardware name: Supermicro SYS-2028TP-HC0R-S4-GT009/X10DRT-PS, BIOS 2.0b 05/09/2017
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076474] task: ffff887dbee79800 ti: ffff883807bc0000 task.ti: ffff883807bc0000
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076478] RIP: 0010:[<ffffffffa03fe1a2>] [<ffffffffa03fe1a2>] mmu_page_zap_pte+0x22/0x100 [kvm]
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076505] RSP: 0018:ffff883807bc1c58 EFLAGS: 00000282
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076506] RAX: 0000000000000000 RBX: ffff887d2b100000 RCX: c000000000000006
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076507] RDX: ffff882efa0c3b80 RSI: ffff881c9acf77a0 RDI: ffff887d2b100000
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076508] RBP: ffff883807bc1c78 R08: ffffea0076ff6700 R09: ffffffffa03fc7ea
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076509] R10: dead000000000200 R11: 0000000000000000 R12: ffff881c9acf77a0
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076510] R13: ffff887d2b100000 R14: ffff883807bc1cf8 R15: 0000000000000000
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076511] FS: 00007f18469a6700(0000) GS:ffff887e7f020000(0000) knlGS:0000000000000000
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076512] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076513] CR2: 00000000f6370000 CR3: 000000380a1cc000 CR4: 00000000003427e0
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076514] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076515] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076516] Stack:
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076517] ffff881c9acf77a0 ffff887d2b100000 0000000000000b88 ffff881c9acf77a0
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076522] ffff883807bc1cd8 ffffffffa03fe2d7 ffff883807bc1cf8 ffff883807bc1cf8
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076526] ffff887d2692c040 0000000000000000 ffff883807bc1cd8 ffff883807bc1cf8
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076529] Call Trace:
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076541] [<ffffffffa03fe2d7>] kvm_mmu_prepare_zap_page+0x57/0x2b0 [kvm]
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076550] [<ffffffffa03fe867>] mmu_shrink_scan+0x167/0x220 [kvm]
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076558] [<ffffffff8116c851>] shrink_slab_node+0x121/0x2b0
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076562] [<ffffffff8116e9e0>] shrink_slab+0xb0/0x110
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076567] [<ffffffff8122c073>] drop_caches_sysctl_handler+0x73/0xa0
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076573] [<ffffffff81240c5c>] proc_sys_call_handler+0xbc/0xd0
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076576] [<ffffffff81240c84>] proc_sys_write+0x14/0x20
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076581] [<ffffffff811ce2f5>] vfs_write+0xc5/0x1f0
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076584] [<ffffffff811ce7f2>] SyS_write+0x52/0xa0
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076591] [<ffffffff81106a60>] ? __audit_syscall_exit+0x230/0x2d0
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076595] [<ffffffff81777add>] system_call_fastpath+0x1a/0x1f
Apr 3 20:05:28 s295401x7712103c kernel: [2337358.076596] Code: f0 4c 8b 6d f8 c9 c3 66 90 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 48 8b 0d dc 80 03 00 48 89 5d f0 4c 89 65 f8 48 89 fb 48 8b 02 <a8> 01 41 89 c4 74 4f 48 89 c7 48 21 cf 48 39 f9 0f 84 a0 00 00
Alguém já se deparou com essa situação antes? se assim for, eu aprecio um ponteiro para qualquer patch para esta ou qualquer outra maneira de resolvê-lo
obrigado S.R.