16.04 "lvm concatenado nvme lock up sob alta carga

0

Estou executando uma VM no Google Cloud com o Ubuntu 16.04; os patches estão atualizados em uma semana ou duas.

Estou usando 6 unidades "SSD voláteis", que são gerenciadas pelo driver nvme integrado.

Eu os tenho concatenados com lvm. XFS é o sistema de arquivos.

Às vezes (uma vez a cada poucos dias), quando estão sob alta carga, os processos que possuem arquivos abertos neles serão bloqueados e não serão eliminados. O Ubuntu não irá reiniciar. Eu tenho que fazer um hard-reset para recuperar.

iostat -x mostra geralmente um ou dois discos no cluster nvme para estar a 100% de utilização (junto com o mapeador de dispositivo dm-0). Entretanto, não acredito que haja realmente qualquer E / S acontecendo.

Curiosamente, enquanto estiver nesse estado, ainda posso fazer login e criar arquivos, excluir arquivos e navegar pelo sistema de arquivos com esses processos bloqueados.

O Suporte do Google sugeriu que você passasse o watchdog_thresh do padrão de 10 para 30. Eu fiz isso. Enquanto espero que esse problema aconteça novamente, pensei em pedir outras ideias sobre como evitar isso.

kern.log irá receber um conjunto dessas mensagens e, mesmo que eu não as veja novamente por horas, esses processos ficarão presos para sempre:

Jun  5 08:06:15 myserver kernel: [223272.311717] kworker/46:1H: page allocation failure: order:2, mode:0x2084020
Jun  5 08:06:15 myserver kernel: [223272.311720] CPU: 46 PID: 43371 Comm: kworker/46:1H Not tainted 4.4.0-127-generic #153-Ubuntu
Jun  5 08:06:15 myserver kernel: [223272.311722] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Jun  5 08:06:15 myserver kernel: [223272.311729] Workqueue: kblockd blk_mq_run_work_fn
Jun  5 08:06:15 myserver kernel: [223272.311732]  0000000000000286 6190c11485834004 ffff883c6c357980 ffffffff814001c3
Jun  5 08:06:15 myserver kernel: [223272.311734]  0000000002084020 0000000000000000 ffff883c6c357a10 ffffffff81198d6a
Jun  5 08:06:15 myserver kernel: [223272.311737]  0000000000000001 ffff88683fffb6f0 ffff883c6c3579d8 ffffffff00000030
Jun  5 08:06:15 myserver kernel: [223272.311739] Call Trace:
Jun  5 08:06:15 myserver kernel: [223272.311745]  [<ffffffff814001c3>] dump_stack+0x63/0x90
Jun  5 08:06:15 myserver kernel: [223272.311751]  [<ffffffff81198d6a>] warn_alloc_failed+0xfa/0x150
Jun  5 08:06:15 myserver kernel: [223272.311755]  [<ffffffff8119ca0d>] __alloc_pages_slowpath.constprop.88+0x48d/0xb00
Jun  5 08:06:15 myserver kernel: [223272.311759]  [<ffffffff8119d308>] __alloc_pages_nodemask+0x288/0x2a0
Jun  5 08:06:15 myserver kernel: [223272.311763]  [<ffffffff811e6f5c>] alloc_pages_current+0x8c/0x110
Jun  5 08:06:15 myserver kernel: [223272.311766]  [<ffffffff8119aec9>] alloc_kmem_pages+0x19/0x90
Jun  5 08:06:15 myserver kernel: [223272.311770]  [<ffffffff811b893e>] kmalloc_order_trace+0x2e/0xe0
Jun  5 08:06:15 myserver kernel: [223272.311772]  [<ffffffff811f3830>] __kmalloc+0x230/0x250
Jun  5 08:06:15 myserver kernel: [223272.311783]  [<ffffffffc0010613>] nvme_queue_rq+0xe3/0xa10 [nvme]
Jun  5 08:06:15 myserver kernel: [223272.311785]  [<ffffffff813db0d9>] __blk_mq_run_hw_queue+0x239/0x3a0
Jun  5 08:06:15 myserver kernel: [223272.311788]  [<ffffffff813db5e2>] blk_mq_run_work_fn+0x12/0x20
Jun  5 08:06:15 myserver kernel: [223272.311793]  [<ffffffff8109cd4b>] process_one_work+0x16b/0x490
Jun  5 08:06:15 myserver kernel: [223272.311796]  [<ffffffff8109d0bb>] worker_thread+0x4b/0x4d0
Jun  5 08:06:15 myserver kernel: [223272.311799]  [<ffffffff8109d070>] ? process_one_work+0x490/0x490
Jun  5 08:06:15 myserver kernel: [223272.311802]  [<ffffffff810a3487>] kthread+0xe7/0x100
Jun  5 08:06:15 myserver kernel: [223272.311804]  [<ffffffff810a33a0>] ? kthread_create_on_node+0x1e0/0x1e0
Jun  5 08:06:15 myserver kernel: [223272.311809]  [<ffffffff818510f5>] ret_from_fork+0x55/0x80
Jun  5 08:06:15 myserver kernel: [223272.311812]  [<ffffffff810a33a0>] ? kthread_create_on_node+0x1e0/0x1e0
Jun  5 08:06:15 myserver kernel: [223272.311813] Mem-Info:
Jun  5 08:06:15 myserver kernel: [223272.311825] active_anon:7773737 inactive_anon:481121 isolated_anon:0
Jun  5 08:06:15 myserver kernel: [223272.311825]  active_file:11395174 inactive_file:55405993 isolated_file:32
Jun  5 08:06:15 myserver kernel: [223272.311825]  unevictable:913 dirty:2241807 writeback:371322 unstable:0
Jun  5 08:06:15 myserver kernel: [223272.311825]  slab_reclaimable:2662247 slab_unreclaimable:41455
Jun  5 08:06:15 myserver kernel: [223272.311825]  mapped:7584 shmem:877 pagetables:25656 bounce:0
Jun  5 08:06:15 myserver kernel: [223272.311825]  free:505454 free_pcp:1564 free_cma:0
Jun  5 08:06:15 myserver kernel: [223272.311831] Node 0 DMA free:15904kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jun  5 08:06:15 myserver kernel: [223272.311837] lowmem_reserve[]: 0 2962 419208 419208 419208
Jun  5 08:06:15 myserver kernel: [223272.311843] Node 0 DMA32 free:1665176kB min:476kB low:592kB high:712kB active_anon:4kB inactive_anon:716kB active_file:4kB inactive_file:4kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3129288kB managed:3048504kB mlocked:0kB dirty:0kB writeback:0kB mapped:688kB shmem:680kB slab_reclaimable:7116kB slab_unreclaimable:1660kB kernel_stack:32kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Jun  5 08:06:15 myserver kernel: [223272.311849] lowmem_reserve[]: 0 0 416246 416246 416246
Jun  5 08:06:15 myserver kernel: [223272.311854] Node 0 Normal free:340736kB min:67100kB low:83872kB high:100648kB active_anon:31094944kB inactive_anon:1923768kB active_file:45580692kB inactive_file:221623968kB unevictable:3652kB isolated(anon):0kB isolated(file):128kB present:433061888kB managed:426235960kB mlocked:3652kB dirty:8966836kB writeback:1485680kB mapped:29648kB shmem:2828kB slab_reclaimable:10641872kB slab_unreclaimable:164160kB kernel_stack:14816kB pagetables:102624kB unstable:0kB bounce:0kB free_pcp:6236kB local_pcp:684kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Jun  5 08:06:15 myserver kernel: [223272.311860] lowmem_reserve[]: 0 0 0 0 0
Jun  5 08:06:15 myserver kernel: [223272.311863] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15904kB
Jun  5 08:06:15 myserver kernel: [223272.311873] Node 0 DMA32: 36*4kB (ME) 49*8kB (ME) 48*16kB (ME) 74*32kB (MEH) 57*64kB (MEH) 24*128kB (MEH) 14*256kB (MEH) 9*512kB (MH) 8*1024kB (ME) 2*2048kB (M) 399*4096kB (M) = 1665176kB
Jun  5 08:06:15 myserver kernel: [223272.311883] Node 0 Normal: 51041*4kB (UMEH) 17075*8kB (UMH) 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 340764kB
Jun  5 08:06:15 myserver kernel: [223272.311892] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jun  5 08:06:15 myserver kernel: [223272.311894] Node 0 hugepages_total=56000 hugepages_free=1616 hugepages_surp=0 hugepages_size=2048kB
Jun  5 08:06:15 myserver kernel: [223272.311895] 66808087 total pagecache pages
Jun  5 08:06:15 myserver kernel: [223272.311897] 5351 pages in swap cache
Jun  5 08:06:15 myserver kernel: [223272.311899] Swap cache stats: add 3329272, delete 3323921, find 64238633/65200782
Jun  5 08:06:15 myserver kernel: [223272.311900] Free swap  = 171830936kB
Jun  5 08:06:15 myserver kernel: [223272.311901] Total swap = 171966456kB
Jun  5 08:06:15 myserver kernel: [223272.311902] 109051792 pages RAM
Jun  5 08:06:15 myserver kernel: [223272.311903] 0 pages HighMem/MovableOnly
Jun  5 08:06:15 myserver kernel: [223272.311904] 1726700 pages reserved
Jun  5 08:06:15 myserver kernel: [223272.311905] 0 pages cma reserved
Jun  5 08:06:15 myserver kernel: [223272.311906] 0 pages hwpoisoned
    
por user791211 08.06.2018 / 21:17

0 respostas