drop_cache nunca retorna, mas consome uma CPU inteira para cada instância

0

um pouco de fundo: eu queria largar todo o cache em um dos sistemas, pois temos a situação em que vários hosts consomem mais de 512GB de memória e oom-killer fica ativo. Eu escolhi aleatoriamente um host e tentei abandonar o cache (esse host ainda não consumiu toda a memória), mas a sessão ficou presa lá.

eu emiti o seguinte comando para descartar o cache

 echo 3 | sudo tee /proc/sys/vm/drop_caches

mas eu não obtenho o controle de volta da sessão e ele entra em um loop. O CNTRL C não ajuda nessa sessão.

Informações do SO:

Linux version 3.13.0-98-generic (buildd@lgw01-52) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #145~precise1-Ubuntu SMP Sat Oct 8 20:16:31 UTC 2016

O vmstat não mostra nada notável:

raven@s295401x7712103c:~$ vmstat 2
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 6  0      0 254539840   2260  30588    0    0     5    14    0    0  9  6 85  0
15  0      0 254521488   2400  31196    0    0   322   624 181284 292469  7 14 79  0
 8  0      0 254531584   2432  31776    0    0   136  1254 183746 295388  8 13 79  0
11  0      0 254552928   2576  31976    0    0   494   846 182298 296141  7 13 79  0
 6  0      0 254552960   2604  32072    0    0    44   846 184103 296127  9 12 79  0
12  0      0 254536736   2612  32648    0    0   536   744 186093 298011  9 14 77  0
13  0      0 254542048   2624  32736    0    0    34   590 186877 299039  9 14 77  0
 7  0      0 254550256   2644  32736    0    0   490   470 181235 297326  7 12 81  0

top mostra meus processos mastigando uma CPU inteira cada

top - 20:23:20 up 27 days,  1:03,  5 users,  load average: 13.73, 13.88, 13.96
Tasks: 714 total,   4 running, 710 sleeping,   0 stopped,   0 zombie
Cpu(s):  7.9%us, 13.2%sy,  0.0%ni, 78.8%id,  0.0%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  528278628k total, 273635376k used, 254643252k free,    15312k buffers
Swap:  3906556k total,        0k used,  3906556k free,    56744k cached

  PID  PPID S %CPU   TIME    TIME+   P  PR  NI USER      VIRT %MEM  RES  SHR SWAP CODE DATA nFLT nDRT WCHAN     Flags    COMMAND
 5339     1 S  185 127,55   7675:37 54  20   0 libvirt- 12.3g  1.1 5.8g 7300 6.5g 2872  12g    0    0 -         .84.618. /usr/bin/kvm -name 1f530008-5990-40c7-8a1c-e38bc8739
 7327     1 S  105 142,31   8551:41 31  20   0 libvirt- 36.9g  1.9 9.7g 7348  27g 2872  36g    0    0 -         .84.618. /usr/bin/kvm -name 88b328c0-7b3f-4f4d-96d9-1019927a0
13960 13949 R  **100** 30500w  5124097h 18  20   0 root      5880  0.0  680  576 5200   24  316    1    0 -         ..4.61.. tee /proc/sys/vm/drop_caches
57401 57400 R  **100**  26:22  26:22.42  5  20   0 root      5880  0.0  680  576 5200   24  316    0    0 -         ..4.61.. tee /proc/sys/vm/drop_caches

os processos não são elimináveis mesmo com o SIGKILL

  raven@s295401x7712103c:~$ sudo kill 13960
  [sudo] password for raven:
  raven@s295401x7712103c:~$
  raven@s295401x7712103c:~$ ps -f 13960
  UID        PID  PPID  C STIME TTY      STAT   TIME CMD
  root     13960 13949 99 18:02 pts/16   R+   21114751:35 tee /proc/sys/vm/drop_caches
  raven@s295401x7712103c:~$ sudo kill -11 13960
  raven@s295401x7712103c:~$ ps -f 13960
  UID        PID  PPID  C STIME TTY      STAT   TIME CMD
  root     13960 13949 99 18:02 pts/16   R+   21114751:48 tee /proc/sys/vm/drop_caches
  raven@s295401x7712103c:~$ sudo kill -9 13960
  raven@s295401x7712103c:~$ ps -f 13960
  UID        PID  PPID  C STIME TTY      STAT   TIME CMD
  root     13960 13949 99 18:02 pts/16   R+   21114752:00 tee /proc/sys/vm/drop_caches
  raven@s295401x7712103c:~$ **sudo kill -9 13960**
  raven@s295401x7712103c:~$ ps -f 13960
  UID        PID  PPID  C STIME TTY      STAT   TIME CMD
  root     13960 13949 99 18:02 pts/16   R+   21114752:27 tee /proc/sys/vm/drop_caches
  raven@s295401x7712103c:~$

depois de vasculhar algumas postagens nos fóruns eu decidi pegar o dump da pilha e com certeza ele mostra duas CPUs indexadas em spin-locks como abaixo:

  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074386] NMI backtrace for cpu 3
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074390] CPU: 3 PID: 57401 Comm: tee Tainted: G           OX 3.13.0-98-generic #145~precise1-Ubuntu
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074392] Hardware name: Supermicro SYS-2028TP-HC0R-S4-GT009/X10DRT-PS, BIOS 2.0b 05/09/2017
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074393] task: ffff8829b9914800 ti: ffff882ce5a7c000 task.ti: ffff882ce5a7c000
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074394] RIP: 0010:[<ffffffff8176e987>]  [<ffffffff8176e987>] _raw_spin_lock+0x37/0x50
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074400] RSP: 0018:ffff882ce5a7dcd8  EFLAGS: 00000202
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074401] RAX: 000000000000429e RBX: 0000000000000080 RCX: 0000000000008b44
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074402] RDX: 0000000000008b46 RSI: 0000000000008b46 RDI: ffffffffa0436818
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074403] RBP: ffff882ce5a7dcd8 R08: ffffea00ab1aa900 R09: ffffffffa03fc7ea
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074406] R10: dead000000000200 R11: 0000000000000000 R12: ffffffffa04331e0
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074410] R13: 0000000000001ec3 R14: ffff882ce5a7de48 R15: 0000000000000002
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074415] FS:  00007f507af9d700(0000) GS:ffff887e7ee60000(0000) knlGS:0000000000000000
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074420] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074424] CR2: ffffffffff600000 CR3: 0000002cdf1cb000 CR4: 00000000003427e0
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074428] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074432] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074434] Stack:
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074437]  ffff882ce5a7dd38 ffffffffa03fe726 ffff887dc0e6f000 0000000000000001
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074450]  ffff882ce5a7dcf8 ffff882ce5a7dcf8 ffffffffa0436818 0000000000000080
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074459]  ffffffffa04331e0 0000000000001ec3 ffff882ce5a7de48 0000000000000002
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074463] Call Trace:
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074477]  [<ffffffffa03fe726>] mmu_shrink_scan+0x26/0x220 [kvm]
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074484]  [<ffffffff8116c851>] shrink_slab_node+0x121/0x2b0
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074490]  [<ffffffff8139242b>] ? find_first_bit+0x1b/0x80
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074493]  [<ffffffff8116e9e0>] shrink_slab+0xb0/0x110
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074497]  [<ffffffff8122c073>] drop_caches_sysctl_handler+0x73/0xa0
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074502]  [<ffffffff81240c5c>] proc_sys_call_handler+0xbc/0xd0
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074505]  [<ffffffff81240c84>] proc_sys_write+0x14/0x20
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074509]  [<ffffffff811ce2f5>] vfs_write+0xc5/0x1f0
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074512]  [<ffffffff811ce7f2>] SyS_write+0x52/0xa0
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074517]  [<ffffffff81106a60>] ? __audit_syscall_exit+0x230/0x2d0
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074520]  [<ffffffff81777add>] system_call_fastpath+0x1a/0x1f
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.074521] Code: 02 00 f0 0f c1 07 89 c2 c1 ea 10 66 39 c2 75 02 5d c3 83 e2 fe 0f b7 f2 b8 00 80 00 00 eb 0c 0f 1f 44 00 00 f3 90 83 e8 01 74 0a <0f> b7 0f 66 39 ca 75 f1 5d c3 0f 1f 80 00 00 00 00 eb da 66 0f


  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076452] NMI backtrace for cpu 17
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076459] INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 1.286 msecs
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076467] CPU: 17 PID: 13960 Comm: tee Tainted: G           OX 3.13.0-98-generic #145~precise1-Ubuntu
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076471] Hardware name: Supermicro SYS-2028TP-HC0R-S4-GT009/X10DRT-PS, BIOS 2.0b 05/09/2017
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076474] task: ffff887dbee79800 ti: ffff883807bc0000 task.ti: ffff883807bc0000
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076478] RIP: 0010:[<ffffffffa03fe1a2>]  [<ffffffffa03fe1a2>] mmu_page_zap_pte+0x22/0x100 [kvm]
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076505] RSP: 0018:ffff883807bc1c58  EFLAGS: 00000282
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076506] RAX: 0000000000000000 RBX: ffff887d2b100000 RCX: c000000000000006
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076507] RDX: ffff882efa0c3b80 RSI: ffff881c9acf77a0 RDI: ffff887d2b100000
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076508] RBP: ffff883807bc1c78 R08: ffffea0076ff6700 R09: ffffffffa03fc7ea
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076509] R10: dead000000000200 R11: 0000000000000000 R12: ffff881c9acf77a0
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076510] R13: ffff887d2b100000 R14: ffff883807bc1cf8 R15: 0000000000000000
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076511] FS:  00007f18469a6700(0000) GS:ffff887e7f020000(0000) knlGS:0000000000000000
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076512] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076513] CR2: 00000000f6370000 CR3: 000000380a1cc000 CR4: 00000000003427e0
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076514] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076515] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076516] Stack:
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076517]  ffff881c9acf77a0 ffff887d2b100000 0000000000000b88 ffff881c9acf77a0
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076522]  ffff883807bc1cd8 ffffffffa03fe2d7 ffff883807bc1cf8 ffff883807bc1cf8
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076526]  ffff887d2692c040 0000000000000000 ffff883807bc1cd8 ffff883807bc1cf8
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076529] Call Trace:
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076541]  [<ffffffffa03fe2d7>] kvm_mmu_prepare_zap_page+0x57/0x2b0 [kvm]
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076550]  [<ffffffffa03fe867>] mmu_shrink_scan+0x167/0x220 [kvm]
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076558]  [<ffffffff8116c851>] shrink_slab_node+0x121/0x2b0
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076562]  [<ffffffff8116e9e0>] shrink_slab+0xb0/0x110
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076567]  [<ffffffff8122c073>] drop_caches_sysctl_handler+0x73/0xa0
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076573]  [<ffffffff81240c5c>] proc_sys_call_handler+0xbc/0xd0
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076576]  [<ffffffff81240c84>] proc_sys_write+0x14/0x20
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076581]  [<ffffffff811ce2f5>] vfs_write+0xc5/0x1f0
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076584]  [<ffffffff811ce7f2>] SyS_write+0x52/0xa0
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076591]  [<ffffffff81106a60>] ? __audit_syscall_exit+0x230/0x2d0
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076595]  [<ffffffff81777add>] system_call_fastpath+0x1a/0x1f
  Apr  3 20:05:28 s295401x7712103c kernel: [2337358.076596] Code: f0 4c 8b 6d f8 c9 c3 66 90 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 48 8b 0d dc 80 03 00 48 89 5d f0 4c 89 65 f8 48 89 fb 48 8b 02 <a8> 01 41 89 c4 74 4f 48 89 c7 48 21 cf 48 39 f9 0f 84 a0 00 00

Alguém já se deparou com essa situação antes? se assim for, eu aprecio um ponteiro para qualquer patch para esta ou qualquer outra maneira de resolvê-lo

obrigado S.R.

    
por S.R. 04.04.2018 / 06:37

0 respostas