Eu tenho uma pequena máquina que uso como servidor multiuso na minha rede local (backups, monitoramento, nfs, torrent, etc.). De vez em quando, noto que o ventilador está no máximo e não consigo fazer o ssh. Depois de uma reinicialização, tudo bem, mas nunca consegui chegar ao fundo do que está causando a interrupção.
Recentemente adicionei prometheus e grafana ao servidor, o que me deu um pouco mais de conhecimento. Um cronograma rápido para o incidente mais recente com base no que vejo lá:
Olhando para os logs após a reinicialização, a única coisa em / var / log / messages durante este tempo é:
Sep 27 19:56:44 larch kernel: [257806.553544] PGD 0 Sep 27 19:56:44 larch kernel: [257806.553567] Sep 27 19:56:44 larch kernel: [257806.553591] Oops: 0002 [#1] SMP Sep 27 19:56:44 larch kernel: [257806.553628] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay arc4 evdev ath9k ath9k_common efi_pstore ath9k_hw kvm_amd nls_ascii nls_cp437 vfat fat ath kvm irqbypass mac80211 pcspkr serio_raw efivars k10temp cfg80211 ath3k amdkfd snd_hda_codec_realtek snd_hda_codec_generic bluetooth shpchp radeon snd_hda_codec_hdmi snd_hda_intel rfkill sp5100_tco snd_hda_codec snd_hda_core snd_hwdep snd_pcm ttm sg ir_rc6_decoder drm_kms_helper ir_lirc_codec lirc_dev snd_timer snd soundcore drm i2c_algo_bit rc_rc6_mce ite_cir rc_core button acpi_cpufreq parport_pc ppdev nfsd auth_rpcgss Sep 27 19:56:44 larch kernel: [257806.554557] oid_registry nfs_acl lp lockd grace parport sunrpc efivarfs ip_tables x_tables autofs4 ext4 crc16 jbd2 crc32c_generic fscrypto ecb glue_helper lrw gf128mul ablk_helper cryptd aes_x86_64 mbcache sd_mod ata_generic uas usb_storage ohci_pci ahci libahci pata_atiixp xhci_pci xhci_hcd psmouse ohci_hcd ehci_pci ehci_hcd r8169 mii libata scsi_mod i2c_piix4 usbcore usb_common Sep 27 19:56:44 larch kernel: [257806.555016] CPU: 1 PID: 35 Comm: kswapd0 Not tainted 4.9.0-7-amd64 #1 Debian 4.9.110-3+deb9u2 Sep 27 19:56:44 larch kernel: [257806.555102] Hardware name: ZOTAC ZBOXNANO-AD10/ZBOXNANO-AD10, BIOS 4.6.4 12/06/2011 Sep 27 19:56:44 larch kernel: [257806.555178] task: ffff9443c9d80440 task.stack: ffffb37780840000 Sep 27 19:56:44 larch kernel: [257806.555238] RIP: 0010:[] [] dentry_unlink_inode+0x52/0x150 Sep 27 19:56:44 larch kernel: [257806.555331] RSP: 0018:ffffb37780843bc8 EFLAGS: 00010246 Sep 27 19:56:44 larch kernel: [257806.555385] RAX: ffff9442beb12fb0 RBX: ffff94438069d240 RCX: 0000000000000000 Sep 27 19:56:44 larch kernel: [257806.555456] RDX: 0000000000000100 RSI: ffff9442b802fe48 RDI: ffff94438069d240 Sep 27 19:56:44 larch kernel: [257806.555527] RBP: ffff9443b172c798 R08: ffff94438069d2d0 R09: ffffb37780843d38 Sep 27 19:56:44 larch kernel: [257806.555599] R10: 0000000000000000 R11: ffff944303092f40 R12: ffff94438069d298 Sep 27 19:56:44 larch kernel: [257806.555670] R13: ffff94438069d298 R14: ffff94438069d240 R15: 0000000000000000 Sep 27 19:56:44 larch kernel: [257806.555743] FS: 0000000000000000(0000) GS:ffff9443ced00000(0000) knlGS:0000000000000000 Sep 27 19:56:44 larch kernel: [257806.555823] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 27 19:56:44 larch kernel: [257806.555881] CR2: 0000000000000108 CR3: 000000013e102000 CR4: 00000000000006f0 Sep 27 19:56:44 larch kernel: [257806.555952] Stack: Sep 27 19:56:44 larch kernel: [257806.555976] ffff94438069d240 ffff944383de1b40 ffffffffbc61fcff ffff944383de1b40 Sep 27 19:56:44 larch kernel: [257806.556062] ffff94438069d2c0 ffffb37780843c38 ffffffffbc62029d 0000000000000362 Sep 27 19:56:44 larch kernel: [257806.556147] 0000000000052029 ffff9443c73684c0 ffff9443c7368000 0000000000000000 Sep 27 19:56:44 larch kernel: [257806.558380] Call Trace: Sep 27 19:56:44 larch kernel: [257806.560587] [] ? __dentry_kill+0xaf/0x160 Sep 27 19:56:44 larch kernel: [257806.562837] [] ? shrink_dentry_list+0xfd/0x300 Sep 27 19:56:44 larch kernel: [257806.565045] [] ? prune_dcache_sb+0x52/0x70 Sep 27 19:56:44 larch kernel: [257806.567202] [] ? super_cache_scan+0x10c/0x190 Sep 27 19:56:44 larch kernel: [257806.569339] [] ? shrink_slab.part.38+0x21a/0x440 Sep 27 19:56:44 larch kernel: [257806.571464] [] ? shrink_node+0x10a/0x340 Sep 27 19:56:44 larch kernel: [257806.573590] [] ? kswapd+0x2e7/0x700 Sep 27 19:56:44 larch kernel: [257806.575669] [] ? mem_cgroup_shrink_node+0x170/0x170 Sep 27 19:56:44 larch kernel: [257806.577721] [] ? kthread+0xd9/0xf0 Sep 27 19:56:44 larch kernel: [257806.579717] [] ? kthread_park+0x60/0x60 Sep 27 19:56:44 larch kernel: [257806.581658] [] ? ret_from_fork+0x44/0x70 Sep 27 19:56:44 larch kernel: [257806.583547] Code: 00 00 25 ff ff 8f fe 89 07 48 8b 87 b8 00 00 00 48 85 c0 74 32 48 8b 97 b0 00 00 00 48 85 d2 48 89 10 0f 84 e0 00 00 00 48 85 c9 89 42 08 48 c7 83 b0 00 00 00 00 00 00 00 48 c7 83 b8 00 00 Sep 27 19:56:44 larch kernel: [257806.591248] RSP Sep 27 19:56:44 larch kernel: [257806.593046] CR2: 0000000000000108 Sep 27 19:56:44 larch kernel: [257806.597728] ---[ end trace 3d94bfea732521fc ]--- Sep 27 19:56:44 larch kernel: [257806.854081] general protection fault: 0000 [#2] SMP Sep 27 19:56:45 larch kernel: [257806.855804] Modules linked in: xt_nat xt_tcpudp veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack nf_nat nf_conntrack br_netfilter bridge stp llc overlay arc4 evdev ath9k ath9k_common efi_pstore ath9k_hw kvm_amd nls_ascii nls_cp437 vfat fat ath kvm irqbypass mac80211 pcspkr serio_raw efivars k10temp cfg80211 ath3k amdkfd snd_hda_codec_realtek snd_hda_codec_generic bluetooth shpchp radeon snd_hda_codec_hdmi snd_hda_intel rfkill sp5100_tco snd_hda_codec snd_hda_core snd_hwdep snd_pcm ttm sg ir_rc6_decoder drm_kms_helper ir_lirc_codec lirc_dev snd_timer snd soundcore drm i2c_algo_bit rc_rc6_mce ite_cir rc_core button acpi_cpufreq parport_pc ppdev nfsd auth_rpcgss Sep 27 19:56:45 larch kernel: [257806.870503] oid_registry nfs_acl lp lockd grace parport sunrpc efivarfs ip_tables x_tables autofs4 ext4 crc16 jbd2 crc32c_generic fscrypto ecb glue_helper lrw gf128mul ablk_helper cryptd aes_x86_64 mbcache sd_mod ata_generic uas usb_storage ohci_pci ahci libahci pata_atiixp xhci_pci xhci_hcd psmouse ohci_hcd ehci_pci ehci_hcd r8169 mii libata scsi_mod i2c_piix4 usbcore usb_common Sep 27 19:56:45 larch kernel: [257806.878569] CPU: 1 PID: 35 Comm: kswapd0 Tainted: G D 4.9.0-7-amd64 #1 Debian 4.9.110-3+deb9u2 Sep 27 19:56:45 larch kernel: [257806.882577] Hardware name: ZOTAC ZBOXNANO-AD10/ZBOXNANO-AD10, BIOS 4.6.4 12/06/2011 Sep 27 19:56:45 larch kernel: [257806.884648] task: ffff9443c9d80440 task.stack: ffffb37780840000 Sep 27 19:56:45 larch kernel: [257806.886724] RIP: 0010:[] [] __wake_up_common+0x28/0x90 Sep 27 19:56:45 larch kernel: [257806.888839] RSP: 0018:ffffb37780843e70 EFLAGS: 00010086 Sep 27 19:56:45 larch kernel: [257806.890932] RAX: 2e9195b6e438a597 RBX: ffffb37780843f10 RCX: 0000000000000000 Sep 27 19:56:45 larch kernel: [257806.893064] RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffffb37780843f10 Sep 27 19:56:45 larch kernel: [257806.895213] RBP: ffffb37780843f18 R08: 0000000000000000 R09: ffffb377808437b8 Sep 27 19:56:45 larch kernel: [257806.897359] R10: 0000000000000000 R11: ffffb377808437a8 R12: 0000000000000282 Sep 27 19:56:45 larch kernel: [257806.899520] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000046 Sep 27 19:56:45 larch kernel: [257806.901680] FS: 0000000000000000(0000) GS:ffff9443ced00000(0000) knlGS:0000000000000000 Sep 27 19:56:45 larch kernel: [257806.903870] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 27 19:56:45 larch kernel: [257806.906055] CR2: 0000000000000108 CR3: 000000013e102000 CR4: 00000000000006f0 Sep 27 19:56:45 larch kernel: [257806.908266] Stack: Sep 27 19:56:45 larch kernel: [257806.910467] 0000000100000000 ffffb37780843f10 ffffb37780843f08 0000000000000282 Sep 27 19:56:45 larch kernel: [257806.912764] 0000000000000000 0000000000000001 0000000000000046 ffffffffbc4bb9e1 Sep 27 19:56:45 larch kernel: [257806.915040] ffff9443c9d80b60 ffff9443c9d80440 0000000000000000 ffffffffbc4760a0 Sep 27 19:56:45 larch kernel: [257806.917275] Call Trace: Sep 27 19:56:45 larch kernel: [257806.919443] [] ? complete+0x31/0x40 Sep 27 19:56:45 larch kernel: [257806.921614] [] ? mm_release+0xb0/0x130 Sep 27 19:56:45 larch kernel: [257806.923765] [] ? do_exit+0x150/0xaf0 Sep 27 19:56:45 larch kernel: [257806.925918] [] ? rewind_stack_do_exit+0x17/0x20 Sep 27 19:56:45 larch kernel: [257806.928076] Code: 00 00 00 0f 1f 44 00 00 41 57 41 56 41 55 41 54 41 89 cd 55 53 48 89 fd 48 83 c5 08 48 83 ec 08 48 8b 47 08 89 54 24 04 48 39 c5 8b 08 74 46 48 8d 78 e8 4c 8d 79 e8 41 89 f6 4d 89 c4 8b 1f Sep 27 19:56:45 larch kernel: [257806.937109] RSP Sep 27 19:56:45 larch kernel: [257806.939215] ---[ end trace 3d94bfea732521fd ]---
Então, algumas perguntas:
1) Alguma idéia do que a entrada de log acima está me dizendo?
2) Eu realmente gostaria de saber qual processo estava causando a alta carga de CPU entre 16 e 20. Existe alguma maneira de eu descobrir isso depois da reinicialização?
3) Eu sinto que já estive aqui algumas vezes, mas não sabia o que fazer a seguir. Existem outras etapas óbvias sistemáticas a seguir? Ou para entender melhor o que aconteceu da última vez ou para me preparar melhor para quando isso acontecer da próxima vez?
PS Estou executando o Debian nesta máquina, mas tenho experimentado problemas semelhantes quando estava executando tarefas similares no Arch.
Tags debian