Depurando falhas recorrentes em um pequeno servidor linux

0

Eu tenho uma pequena máquina que uso como servidor multiuso na minha rede local (backups, monitoramento, nfs, torrent, etc.). De vez em quando, noto que o ventilador está no máximo e não consigo fazer o ssh. Depois de uma reinicialização, tudo bem, mas nunca consegui chegar ao fundo do que está causando a interrupção.

Recentemente adicionei prometheus e grafana ao servidor, o que me deu um pouco mais de conhecimento. Um cronograma rápido para o incidente mais recente com base no que vejo lá:

  • 1620: A carga da CPU sobe para 3,5 por 40 minutos e, em seguida, nivela cerca de 2 (antes dessa hora, esteve próxima de 0).
  • 2006: Perde o contato com o servidor (os logs do prometheus param)
  • 2300: Eu ouço o zumbido e reconfixo o servidor

Olhando para os logs após a reinicialização, a única coisa em / var / log / messages durante este tempo é:

Sep 27 19:56:44 larch kernel: [257806.553544] PGD 0 
Sep 27 19:56:44 larch kernel: [257806.553567] 
Sep 27 19:56:44 larch kernel: [257806.553591] Oops: 0002 [#1] SMP
Sep 27 19:56:44 larch kernel: [257806.553628] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay arc4 evdev ath9k ath9k_common efi_pstore ath9k_hw kvm_amd nls_ascii nls_cp437 vfat fat ath kvm irqbypass mac80211 pcspkr serio_raw efivars k10temp cfg80211 ath3k amdkfd snd_hda_codec_realtek snd_hda_codec_generic bluetooth shpchp radeon snd_hda_codec_hdmi snd_hda_intel rfkill sp5100_tco snd_hda_codec snd_hda_core snd_hwdep snd_pcm ttm sg ir_rc6_decoder drm_kms_helper ir_lirc_codec lirc_dev snd_timer snd soundcore drm i2c_algo_bit rc_rc6_mce ite_cir rc_core button acpi_cpufreq parport_pc ppdev nfsd auth_rpcgss
Sep 27 19:56:44 larch kernel: [257806.554557]  oid_registry nfs_acl lp lockd grace parport sunrpc efivarfs ip_tables x_tables autofs4 ext4 crc16 jbd2 crc32c_generic fscrypto ecb glue_helper lrw gf128mul ablk_helper cryptd aes_x86_64 mbcache sd_mod ata_generic uas usb_storage ohci_pci ahci libahci pata_atiixp xhci_pci xhci_hcd psmouse ohci_hcd ehci_pci ehci_hcd r8169 mii libata scsi_mod i2c_piix4 usbcore usb_common
Sep 27 19:56:44 larch kernel: [257806.555016] CPU: 1 PID: 35 Comm: kswapd0 Not tainted 4.9.0-7-amd64 #1 Debian 4.9.110-3+deb9u2
Sep 27 19:56:44 larch kernel: [257806.555102] Hardware name: ZOTAC ZBOXNANO-AD10/ZBOXNANO-AD10, BIOS 4.6.4 12/06/2011
Sep 27 19:56:44 larch kernel: [257806.555178] task: ffff9443c9d80440 task.stack: ffffb37780840000
Sep 27 19:56:44 larch kernel: [257806.555238] RIP: 0010:[]  [] dentry_unlink_inode+0x52/0x150
Sep 27 19:56:44 larch kernel: [257806.555331] RSP: 0018:ffffb37780843bc8  EFLAGS: 00010246
Sep 27 19:56:44 larch kernel: [257806.555385] RAX: ffff9442beb12fb0 RBX: ffff94438069d240 RCX: 0000000000000000
Sep 27 19:56:44 larch kernel: [257806.555456] RDX: 0000000000000100 RSI: ffff9442b802fe48 RDI: ffff94438069d240
Sep 27 19:56:44 larch kernel: [257806.555527] RBP: ffff9443b172c798 R08: ffff94438069d2d0 R09: ffffb37780843d38
Sep 27 19:56:44 larch kernel: [257806.555599] R10: 0000000000000000 R11: ffff944303092f40 R12: ffff94438069d298
Sep 27 19:56:44 larch kernel: [257806.555670] R13: ffff94438069d298 R14: ffff94438069d240 R15: 0000000000000000
Sep 27 19:56:44 larch kernel: [257806.555743] FS:  0000000000000000(0000) GS:ffff9443ced00000(0000) knlGS:0000000000000000
Sep 27 19:56:44 larch kernel: [257806.555823] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 27 19:56:44 larch kernel: [257806.555881] CR2: 0000000000000108 CR3: 000000013e102000 CR4: 00000000000006f0
Sep 27 19:56:44 larch kernel: [257806.555952] Stack:
Sep 27 19:56:44 larch kernel: [257806.555976]  ffff94438069d240 ffff944383de1b40 ffffffffbc61fcff ffff944383de1b40
Sep 27 19:56:44 larch kernel: [257806.556062]  ffff94438069d2c0 ffffb37780843c38 ffffffffbc62029d 0000000000000362
Sep 27 19:56:44 larch kernel: [257806.556147]  0000000000052029 ffff9443c73684c0 ffff9443c7368000 0000000000000000
Sep 27 19:56:44 larch kernel: [257806.558380] Call Trace:
Sep 27 19:56:44 larch kernel: [257806.560587]  [] ? __dentry_kill+0xaf/0x160
Sep 27 19:56:44 larch kernel: [257806.562837]  [] ? shrink_dentry_list+0xfd/0x300
Sep 27 19:56:44 larch kernel: [257806.565045]  [] ? prune_dcache_sb+0x52/0x70
Sep 27 19:56:44 larch kernel: [257806.567202]  [] ? super_cache_scan+0x10c/0x190
Sep 27 19:56:44 larch kernel: [257806.569339]  [] ? shrink_slab.part.38+0x21a/0x440
Sep 27 19:56:44 larch kernel: [257806.571464]  [] ? shrink_node+0x10a/0x340
Sep 27 19:56:44 larch kernel: [257806.573590]  [] ? kswapd+0x2e7/0x700
Sep 27 19:56:44 larch kernel: [257806.575669]  [] ? mem_cgroup_shrink_node+0x170/0x170
Sep 27 19:56:44 larch kernel: [257806.577721]  [] ? kthread+0xd9/0xf0
Sep 27 19:56:44 larch kernel: [257806.579717]  [] ? kthread_park+0x60/0x60
Sep 27 19:56:44 larch kernel: [257806.581658]  [] ? ret_from_fork+0x44/0x70
Sep 27 19:56:44 larch kernel: [257806.583547] Code: 00 00 25 ff ff 8f fe 89 07 48 8b 87 b8 00 00 00 48 85 c0 74 32 48 8b 97 b0 00 00 00 48 85 d2 48 89 10 0f 84 e0 00 00 00 48 85 c9  89 42 08 48 c7 83 b0 00 00 00 00 00 00 00 48 c7 83 b8 00 00 
Sep 27 19:56:44 larch kernel: [257806.591248]  RSP 
Sep 27 19:56:44 larch kernel: [257806.593046] CR2: 0000000000000108
Sep 27 19:56:44 larch kernel: [257806.597728] ---[ end trace 3d94bfea732521fc ]---
Sep 27 19:56:44 larch kernel: [257806.854081] general protection fault: 0000 [#2] SMP
Sep 27 19:56:45 larch kernel: [257806.855804] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay arc4 evdev ath9k ath9k_common efi_pstore ath9k_hw kvm_amd nls_ascii nls_cp437 vfat fat ath kvm irqbypass mac80211 pcspkr serio_raw efivars k10temp cfg80211 ath3k amdkfd snd_hda_codec_realtek snd_hda_codec_generic bluetooth shpchp radeon snd_hda_codec_hdmi snd_hda_intel rfkill sp5100_tco snd_hda_codec snd_hda_core snd_hwdep snd_pcm ttm sg ir_rc6_decoder drm_kms_helper ir_lirc_codec lirc_dev snd_timer snd soundcore drm i2c_algo_bit rc_rc6_mce ite_cir rc_core button acpi_cpufreq parport_pc ppdev nfsd auth_rpcgss
Sep 27 19:56:45 larch kernel: [257806.870503]  oid_registry nfs_acl lp lockd grace parport sunrpc efivarfs ip_tables x_tables autofs4 ext4 crc16 jbd2 crc32c_generic fscrypto ecb glue_helper lrw gf128mul ablk_helper cryptd aes_x86_64 mbcache sd_mod ata_generic uas usb_storage ohci_pci ahci libahci pata_atiixp xhci_pci xhci_hcd psmouse ohci_hcd ehci_pci ehci_hcd r8169 mii libata scsi_mod i2c_piix4 usbcore usb_common
Sep 27 19:56:45 larch kernel: [257806.878569] CPU: 1 PID: 35 Comm: kswapd0 Tainted: G      D         4.9.0-7-amd64 #1 Debian 4.9.110-3+deb9u2
Sep 27 19:56:45 larch kernel: [257806.882577] Hardware name: ZOTAC ZBOXNANO-AD10/ZBOXNANO-AD10, BIOS 4.6.4 12/06/2011
Sep 27 19:56:45 larch kernel: [257806.884648] task: ffff9443c9d80440 task.stack: ffffb37780840000
Sep 27 19:56:45 larch kernel: [257806.886724] RIP: 0010:[]  [] __wake_up_common+0x28/0x90
Sep 27 19:56:45 larch kernel: [257806.888839] RSP: 0018:ffffb37780843e70  EFLAGS: 00010086
Sep 27 19:56:45 larch kernel: [257806.890932] RAX: 2e9195b6e438a597 RBX: ffffb37780843f10 RCX: 0000000000000000
Sep 27 19:56:45 larch kernel: [257806.893064] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffffb37780843f10
Sep 27 19:56:45 larch kernel: [257806.895213] RBP: ffffb37780843f18 R08: 0000000000000000 R09: ffffb377808437b8
Sep 27 19:56:45 larch kernel: [257806.897359] R10: 0000000000000000 R11: ffffb377808437a8 R12: 0000000000000282
Sep 27 19:56:45 larch kernel: [257806.899520] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000046
Sep 27 19:56:45 larch kernel: [257806.901680] FS:  0000000000000000(0000) GS:ffff9443ced00000(0000) knlGS:0000000000000000
Sep 27 19:56:45 larch kernel: [257806.903870] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 27 19:56:45 larch kernel: [257806.906055] CR2: 0000000000000108 CR3: 000000013e102000 CR4: 00000000000006f0
Sep 27 19:56:45 larch kernel: [257806.908266] Stack:
Sep 27 19:56:45 larch kernel: [257806.910467]  0000000100000000 ffffb37780843f10 ffffb37780843f08 0000000000000282
Sep 27 19:56:45 larch kernel: [257806.912764]  0000000000000000 0000000000000001 0000000000000046 ffffffffbc4bb9e1
Sep 27 19:56:45 larch kernel: [257806.915040]  ffff9443c9d80b60 ffff9443c9d80440 0000000000000000 ffffffffbc4760a0
Sep 27 19:56:45 larch kernel: [257806.917275] Call Trace:
Sep 27 19:56:45 larch kernel: [257806.919443]  [] ? complete+0x31/0x40
Sep 27 19:56:45 larch kernel: [257806.921614]  [] ? mm_release+0xb0/0x130
Sep 27 19:56:45 larch kernel: [257806.923765]  [] ? do_exit+0x150/0xaf0
Sep 27 19:56:45 larch kernel: [257806.925918]  [] ? rewind_stack_do_exit+0x17/0x20
Sep 27 19:56:45 larch kernel: [257806.928076] Code: 00 00 00 0f 1f 44 00 00 41 57 41 56 41 55 41 54 41 89 cd 55 53 48 89 fd 48 83 c5 08 48 83 ec 08 48 8b 47 08 89 54 24 04 48 39 c5  8b 08 74 46 48 8d 78 e8 4c 8d 79 e8 41 89 f6 4d 89 c4 8b 1f 
Sep 27 19:56:45 larch kernel: [257806.937109]  RSP 
Sep 27 19:56:45 larch kernel: [257806.939215] ---[ end trace 3d94bfea732521fd ]---

Então, algumas perguntas:

1) Alguma idéia do que a entrada de log acima está me dizendo?

2) Eu realmente gostaria de saber qual processo estava causando a alta carga de CPU entre 16 e 20. Existe alguma maneira de eu descobrir isso depois da reinicialização?

3) Eu sinto que já estive aqui algumas vezes, mas não sabia o que fazer a seguir. Existem outras etapas óbvias sistemáticas a seguir? Ou para entender melhor o que aconteceu da última vez ou para me preparar melhor para quando isso acontecer da próxima vez?

PS Estou executando o Debian nesta máquina, mas tenho experimentado problemas semelhantes quando estava executando tarefas similares no Arch.

    
por michaelmcandrew 28.09.2018 / 17:02

0 respostas

Tags