O servidor Debian continua reiniciando inesperadamente

3

O servidor do meu laboratório com o Debian-Wheezy-7.8-Stable continua reiniciando algumas vezes após algumas horas de atividade sem nenhuma notificação. Este servidor é configurado para computação numérica de carga consideravelmente alta, bem como computação paralela. Eu imprimi o log de var/log/messages e last reboot , mas achei difícil entender essas mensagens de log. Eu tentei olhar para entrada antes do tempo de reinicialização e olhar para o mesmo tempo em var/log/messages , mas parece que as entradas de var/log/messages mostram apenas log / messages após a reinicialização aconteceu.

Eu tenho surfado por aí e descobri que algumas pessoas têm o mesmo problema, mas parece que a causa é diferente uma da outra e /var/log/messages parece ser a chave para investigar o problema. O que meu var/log/messages realmente descreve em relação a esse evento de reinicialização indesejado? e como começar a aprender sobre como ler este log para iniciantes? Quer dizer, existe alguma palavra-chave importante para procurar ou algo do tipo?

Obrigado por qualquer ajuda que você possa fornecer.

last reboot

reboot   system boot  3.2.0-4-amd64    Wed May 20 03:29 - 12:43  (09:14)
reboot   system boot  3.2.0-4-amd64    Tue May 19 16:01 - 12:43  (20:42)

var/log/messages

May 18 07:35:01 labserver rsyslogd: [origin software="rsyslogd" swVersion="5.8.11" x-pid="2400" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
May 19 07:35:01 labserver rsyslogd: [origin software="rsyslogd" swVersion="5.8.11" x-pid="2400" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
May 19 16:01:19 labserver kernel: imklog 5.8.11, log source = /proc/kmsg started.
May 19 16:01:19 labserver rsyslogd: [origin software="rsyslogd" swVersion="5.8.11" x-pid="2401" x-info="http://www.rsyslog.com"] start
May 19 16:01:19 labserver kernel: [    0.000000] Initializing cgroup subsys cpuset
May 19 16:01:19 labserver kernel: [    0.000000] Initializing cgroup subsys cpu
May 19 16:01:19 labserver kernel: [    0.000000] Linux version 3.2.0-4-amd64 ([email protected]) (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 SMP Debian 3.2.65-1+deb7u2
May 19 16:01:19 labserver kernel: [    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.2.0-4-amd64 root=UUID=1fc245ac-9058-4208-862a-7f4e8e1b20b2 ro text
May 19 16:01:19 labserver kernel: [    0.000000] BIOS-provided physical RAM map:
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 0000000000000000 - 000000000009ac00 (usable)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 000000000009ac00 - 00000000000a0000 (reserved)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 0000000000100000 - 000000007df71000 (usable)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 000000007df71000 - 000000007e0f1000 (reserved)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 000000007e0f1000 - 000000007e2ec000 (ACPI NVS)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 000000007e2ec000 - 000000007f367000 (reserved)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 000000007f367000 - 000000007f800000 (ACPI NVS)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 0000000080000000 - 0000000090000000 (reserved)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 00000000fed1c000 - 00000000fed40000 (reserved)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 00000000ff000000 - 0000000100000000 (reserved)
May 19 16:01:19 labserver kernel: [    0.000000]  BIOS-e820: 0000000100000000 - 0000000880000000 (usable)
May 19 16:01:19 labserver kernel: [    0.000000] NX (Execute Disable) protection: active
May 19 16:01:19 labserver kernel: [    0.000000] SMBIOS 2.7 present.
May 19 16:01:19 labserver kernel: [    0.000000] No AGP bridge found
May 19 16:01:19 labserver kernel: [    0.000000] last_pfn = 0x880000 max_arch_pfn = 0x400000000
May 19 16:01:19 labserver kernel: [    0.000000] x86 PAT enabled: cpu 0, old 0x7040600070406, new 0x7010600070106
May 19 16:01:19 labserver kernel: [    0.000000] last_pfn = 0x7df71 max_arch_pfn = 0x400000000
May 19 16:01:19 labserver kernel: [    0.000000] found SMP MP-table at [ffff8800000fd900] fd900
May 19 16:01:19 labserver kernel: [    0.000000] Using GB pages for direct mapping
May 19 16:01:19 labserver kernel: [    0.000000] init_memory_mapping: 0000000000000000-000000007df71000
May 19 16:01:19 labserver kernel: [    0.000000] init_memory_mapping: 0000000100000000-0000000880000000
May 19 16:01:19 labserver kernel: [    0.000000] RAMDISK: 36bea000 - 375ed000
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: RSDP 00000000000f04a0 00024 (v02 ALASKA)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: XSDT 000000007e204088 0008C (v01 ALASKA    A M I 01072009 AMI  00010013)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: FACP 000000007e211040 0010C (v05 ALASKA    A M I 01072009 AMI  00010013)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI Warning: FADT (revision 5) is longer than ACPI 2.0 version, truncating length 268 to 244 (20110623/tbfadt-288)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: DSDT 000000007e2041a8 0CE96 (v02 ALASKA    A M I 00000015 INTL 20051117)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: FACS 000000007e2e3080 00040
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: APIC 000000007e211150 00100 (v03 ALASKA    A M I 01072009 AMI  00010013)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: FPDT 000000007e211250 00044 (v01 ALASKA    A M I 01072009 AMI  00010013)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: MCFG 000000007e211298 0003C (v01 ALASKA OEMMCFG. 01072009 MSFT 00000097)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: HPET 000000007e2112d8 00038 (v01 ALASKA    A M I 01072009 AMI. 00000005)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: PRAD 000000007e211310 000BE (v02 PRADID  PRADTID 00000001 MSFT 03000001)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: SPMI 000000007e2113d0 00040 (v05 A M I   OEMSPMI 00000000 AMI. 00000000)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: SSDT 000000007e211410 D0CB0 (v02  INTEL    CpuPm 00004000 INTL 20051117)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: EINJ 000000007e2e20c0 00130 (v01    AMI AMI EINJ 00000000      00000000)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: ERST 000000007e2e21f0 00230 (v01  AMIER AMI ERST 00000000      00000000)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: HEST 000000007e2e2420 000A8 (v01    AMI AMI HEST 00000000      00000000)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: BERT 000000007e2e24c8 00030 (v01    AMI AMI BERT 00000000      00000000)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: DMAR 000000007e2e24f8 000C4 (v01 A M I   OEMDMAR 00000001 INTL 00000001)
May 19 16:01:19 labserver kernel: [    0.000000] No NUMA configuration found
May 19 16:01:19 labserver kernel: [    0.000000] Faking a node at 0000000000000000-0000000880000000
May 19 16:01:19 labserver kernel: [    0.000000] Initmem setup node 0 0000000000000000-0000000880000000
May 19 16:01:19 labserver kernel: [    0.000000]   NODE_DATA [000000087fffb000 - 000000087fffffff]
May 19 16:01:19 labserver kernel: [    0.000000] Zone PFN ranges:
May 19 16:01:19 labserver kernel: [    0.000000]   DMA      0x00000010 -> 0x00001000
May 19 16:01:19 labserver kernel: [    0.000000]   DMA32    0x00001000 -> 0x00100000
May 19 16:01:19 labserver kernel: [    0.000000]   Normal   0x00100000 -> 0x00880000
May 19 16:01:19 labserver kernel: [    0.000000] Movable zone start PFN for each node
May 19 16:01:19 labserver kernel: [    0.000000] early_node_map[3] active PFN ranges
May 19 16:01:19 labserver kernel: [    0.000000]     0: 0x00000010 -> 0x0000009a
May 19 16:01:19 labserver kernel: [    0.000000]     0: 0x00000100 -> 0x0007df71
May 19 16:01:19 labserver kernel: [    0.000000]     0: 0x00100000 -> 0x00880000
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: PM-Timer IO Port: 0x408
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x02] lapic_id[0x02] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x04] lapic_id[0x04] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x06] lapic_id[0x06] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x08] lapic_id[0x08] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x0a] lapic_id[0x0a] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x01] lapic_id[0x01] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x03] lapic_id[0x03] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x05] lapic_id[0x05] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x07] lapic_id[0x07] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x09] lapic_id[0x09] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC (acpi_id[0x0b] lapic_id[0x0b] enabled)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x00] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x04] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x06] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x08] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x0a] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x05] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x07] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x09] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: LAPIC_NMI (acpi_id[0x0b] high edge lint[0x1])
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: IOAPIC (id[0x00] address[0xfec00000] gsi_base[0])
May 19 16:01:19 labserver kernel: [    0.000000] IOAPIC[0]: apic_id 0, version 32, address 0xfec00000, GSI 0-23
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: IOAPIC (id[0x02] address[0xfec01000] gsi_base[24])
May 19 16:01:19 labserver kernel: [    0.000000] IOAPIC[1]: apic_id 2, version 32, address 0xfec01000, GSI 24-47
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
May 19 16:01:19 labserver kernel: [    0.000000] Using ACPI (MADT) for SMP configuration information
May 19 16:01:19 labserver kernel: [    0.000000] ACPI: HPET id: 0x8086a701 base: 0xfed00000
May 19 16:01:19 labserver kernel: [    0.000000] SMP: Allowing 12 CPUs, 0 hotplug CPUs
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 000000000009a000 - 000000000009b000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 000000000009b000 - 00000000000a0000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 00000000000a0000 - 00000000000e0000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 00000000000e0000 - 0000000000100000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 000000007df71000 - 000000007e0f1000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 000000007e0f1000 - 000000007e2ec000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 000000007e2ec000 - 000000007f367000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 000000007f367000 - 000000007f800000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 000000007f800000 - 0000000080000000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 0000000080000000 - 0000000090000000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 0000000090000000 - 00000000fed1c000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 00000000fed1c000 - 00000000fed40000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 00000000fed40000 - 00000000ff000000
May 19 16:01:19 labserver kernel: [    0.000000] PM: Registered nosave memory: 00000000ff000000 - 0000000100000000
May 19 16:01:19 labserver kernel: [    0.000000] Allocating PCI resources starting at 90000000 (gap: 90000000:6ed1c000)
May 19 16:01:19 labserver kernel: [    0.000000] Booting paravirtualized kernel on bare hardware
May 19 16:01:19 labserver kernel: [    0.000000] setup_percpu: NR_CPUS:512 nr_cpumask_bits:512 nr_cpu_ids:12 nr_node_ids:1
May 19 16:01:19 labserver kernel: [    0.000000] PERCPU: Embedded 27 pages/cpu @ffff88087fc00000 s78848 r8192 d23552 u131072
May 19 16:01:19 labserver kernel: [    0.000000] Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 8258294
May 19 16:01:19 labserver kernel: [    0.000000] Policy zone: Normal
May 19 16:01:19 labserver kernel: [    0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-3.2.0-4-amd64 root=UUID=1fc245ac-9058-4208-862a-7f4e8e1b20b2 ro text
May 19 16:01:19 labserver kernel: [    0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
May 19 16:01:19 labserver kernel: [    0.000000] xsave/xrstor: enabled xstate_bv 0x7, cntxt size 0x340
May 19 16:01:19 labserver kernel: [    0.000000] Checking aperture...
May 19 16:01:19 labserver kernel: [    0.000000] No AGP bridge found
May 19 16:01:19 labserver kernel: [    0.000000] Memory: 32975732k/35651584k available (3434k kernel code, 2130964k absent, 544888k reserved, 3305k data, 576k init)
May 19 16:01:19 labserver kernel: [    0.000000] Hierarchical RCU implementation.
May 19 16:01:19 labserver kernel: [    0.000000]    RCU dyntick-idle grace-period acceleration is enabled.
May 19 16:01:19 labserver kernel: [    0.000000] NR_IRQS:33024 nr_irqs:1184 16
May 19 16:01:19 labserver kernel: [    0.000000] Extended CMOS year: 2000
May 19 16:01:19 labserver kernel: [    0.000000] Console: colour VGA+ 80x25
May 19 16:01:19 labserver kernel: [    0.000000] console [tty0] enabled
May 19 16:01:19 labserver kernel: [    0.000000] Fast TSC calibration using PIT
May 19 16:01:19 labserver kernel: [    0.004000] Detected 2100.074 MHz processor.
May 19 16:01:19 labserver kernel: [    0.000003] Calibrating delay loop (skipped), value calculated using timer frequency.. 4200.14 BogoMIPS (lpj=8400296)
May 19 16:01:19 labserver kernel: [    0.000144] pid_max: default: 32768 minimum: 301
May 19 16:01:19 labserver kernel: [    0.000253] Security Framework initialized
May 19 16:01:19 labserver kernel: [    0.000324] AppArmor: AppArmor disabled by boot time parameter
May 19 16:01:19 labserver kernel: [    0.002355] Dentry cache hash table entries: 4194304 (order: 13, 33554432 bytes)
May 19 16:01:19 labserver kernel: [    0.011585] Inode-cache hash table entries: 2097152 (order: 12, 16777216 bytes)
May 19 16:01:19 labserver kernel: [    0.015724] Mount-cache hash table entries: 256
May 19 16:01:19 labserver kernel: [    0.015915] Initializing cgroup subsys cpuacct
May 19 16:01:19 labserver kernel: [    0.015986] Initializing cgroup subsys memory
May 19 16:01:19 labserver kernel: [    0.016063] Initializing cgroup subsys devices
May 19 16:01:19 labserver kernel: [    0.016133] Initializing cgroup subsys freezer
May 19 16:01:19 labserver kernel: [    0.016201] Initializing cgroup subsys net_cls
May 19 16:01:19 labserver kernel: [    0.016270] Initializing cgroup subsys blkio
May 19 16:01:19 labserver kernel: [    0.016344] Initializing cgroup subsys perf_event
May 19 16:01:19 labserver kernel: [    0.016441] CPU: Physical Processor ID: 0
May 19 16:01:19 labserver kernel: [    0.016509] CPU: Processor Core ID: 0
May 19 16:01:19 labserver kernel: [    0.017564] mce: CPU supports 23 MCE banks
May 19 16:01:19 labserver kernel: [    0.017670] CPU0: Thermal monitoring enabled (TM1)
May 19 16:01:19 labserver kernel: [    0.017768] using mwait in idle threads.
May 19 16:01:19 labserver kernel: [    0.018315] ACPI: Core revision 20110623
May 19 16:01:19 labserver kernel: [    0.049889] DMAR: Host address width 46
May 19 16:01:19 labserver kernel: [    0.049958] DMAR: DRHD base: 0x000000fbffc000 flags: 0x1
May 19 16:01:19 labserver kernel: [    0.050034] IOMMU 0: reg_base_addr fbffc000 ver 1:0 cap d2078c106f0466 ecap f020de
May 19 16:01:19 labserver kernel: [    0.050122] DMAR: RMRR base: 0x0000007f239000 end: 0x0000007f247fff
May 19 16:01:19 labserver kernel: [    0.050195] DMAR: ATSR flags: 0x0
May 19 16:01:19 labserver kernel: [    0.050261] DMAR: RHSA base: 0x000000fbffc000 proximity domain: 0x0
May 19 16:01:19 labserver kernel: [    0.050427] IOAPIC id 0 under DRHD base  0xfbffc000 IOMMU 0
May 19 16:01:19 labserver kernel: [    0.050497] IOAPIC id 2 under DRHD base  0xfbffc000 IOMMU 0
May 19 16:01:19 labserver kernel: [    0.050568] HPET id 0 under DRHD base 0xfbffc000
May 19 16:01:19 labserver kernel: [    0.050741] Enabled IRQ remapping in x2apic mode
May 19 16:01:19 labserver kernel: [    0.050810] Enabling x2apic
May 19 16:01:19 labserver kernel: [    0.050875] Enabled x2apic
May 19 16:01:19 labserver kernel: [    0.050943] Switched APIC routing to cluster x2apic.
May 19 16:01:19 labserver kernel: [    0.051552] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
May 19 16:01:19 labserver kernel: [    0.091256] CPU0: Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz stepping 04
May 19 16:01:19 labserver kernel: [    0.195570] Performance Events: PEBS fmt1+, generic architected perfmon, Intel PMU driver.
May 19 16:01:19 labserver kernel: [    0.195802] ... version:                3
May 19 16:01:19 labserver kernel: [    0.195869] ... bit width:              48
May 19 16:01:19 labserver kernel: [    0.195936] ... generic registers:      4
May 19 16:01:19 labserver kernel: [    0.196003] ... value mask:             0000ffffffffffff
May 19 16:01:19 labserver kernel: [    0.196073] ... max period:             000000007fffffff
May 19 16:01:19 labserver kernel: [    0.196143] ... fixed-purpose events:   3
May 19 16:01:19 labserver kernel: [    0.196210] ... event mask:             000000070000000f
May 19 16:01:19 labserver kernel: [    0.196468] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    0.196637] Booting Node   0, Processors  #1
May 19 16:01:19 labserver kernel: [    0.312587] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    0.312765]  #2
May 19 16:01:19 labserver kernel: [    0.424400] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    0.424578]  #3
May 19 16:01:19 labserver kernel: [    0.536316] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    0.536489]  #4
May 19 16:01:19 labserver kernel: [    0.648124] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    0.648303]  #5
May 19 16:01:19 labserver kernel: [    0.759941] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    0.760115]  #6
May 19 16:01:19 labserver kernel: [    0.871864] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    0.872050]  #7
May 19 16:01:19 labserver kernel: [    0.983690] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    0.983866]  #8
May 19 16:01:19 labserver kernel: [    1.095600] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    1.095774]  #9
May 19 16:01:19 labserver kernel: [    1.207414] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    1.207589]  #10
May 19 16:01:19 labserver kernel: [    1.319223] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    1.319400]  #11 Ok.
May 19 16:01:19 labserver kernel: [    1.431095] NMI watchdog enabled, takes one hw-pmu counter.
May 19 16:01:19 labserver kernel: [    1.431192] Brought up 12 CPUs
May 19 16:01:19 labserver kernel: [    1.431260] Total of 12 processors activated (50398.84 BogoMIPS).
May 19 16:01:19 labserver kernel: [    1.450786] devtmpfs: initialized
May 19 16:01:19 labserver kernel: [    1.455360] PM: Registering ACPI NVS region at 7e0f1000 (2076672 bytes)
May 19 16:01:19 labserver kernel: [    1.455494] PM: Registering ACPI NVS region at 7f367000 (4820992 bytes)
May 19 16:01:19 labserver kernel: [    1.455843] print_constraints: dummy: 
May 19 16:01:19 labserver kernel: [    1.455977] NET: Registered protocol family 16
May 19 16:01:19 labserver kernel: [    1.456140] ACPI: bus type pci registered
May 19 16:01:19 labserver kernel: [    1.456268] PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0x80000000-0x8fffffff] (base 0x80000000)
May 19 16:01:19 labserver kernel: [    1.456361] PCI: MMCONFIG at [mem 0x80000000-0x8fffffff] reserved in E820
May 19 16:01:19 labserver kernel: [    1.466673] PCI: Using configuration type 1 for base access
May 19 16:01:19 labserver kernel: [    1.468173] bio: create slab <bio-0> at 0
May 19 16:01:19 labserver kernel: [    1.468353] ACPI: Added _OSI(Module Device)
May 19 16:01:19 labserver kernel: [    1.468422] ACPI: Added _OSI(Processor Device)
May 19 16:01:19 labserver kernel: [    1.468491] ACPI: Added _OSI(3.0 _SCP Extensions)
May 19 16:01:19 labserver kernel: [    1.468560] ACPI: Added _OSI(Processor Aggregator Device)
May 19 16:01:19 labserver kernel: [    1.484562] ACPI: Executed 1 blocks of module-level executable AML code
May 19 16:01:19 labserver kernel: [    1.727818] ACPI: Interpreter enabled
May 19 16:01:19 labserver kernel: [    1.727891] ACPI: (supports S0 S1 S4 S5)
May 19 16:01:19 labserver kernel: [    1.728159] ACPI: Using IOAPIC for interrupt routing
May 19 16:01:19 labserver kernel: [    1.736531] ACPI: No dock devices found.
May 19 16:01:19 labserver kernel: [    1.736630] HEST: Table parsing has been initialized.
May 19 16:01:19 labserver kernel: [    1.736704] PCI: Using host bridge windows from ACPI; if necessary, use "pci=nocrs" and report a bug
May 19 16:01:19 labserver kernel: [    1.737041] ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-fe])
May 19 16:01:19 labserver kernel: [    1.737361] pci_root PNP0A08:00: host bridge window [io  0x0000-0x03af]
May 19 16:01:19 labserver kernel: [    1.737435] pci_root PNP0A08:00: host bridge window [io  0x03e0-0x0cf7]
May 19 16:01:19 labserver kernel: [    1.737508] pci_root PNP0A08:00: host bridge window [io  0x03b0-0x03df]
May 19 16:01:19 labserver kernel: [    1.737586] pci_root PNP0A08:00: host bridge window [io  0x0d00-0xffff]
May 19 16:01:19 labserver kernel: [    1.737659] pci_root PNP0A08:00: host bridge window [mem 0x000a0000-0x000bffff]
May 19 16:01:19 labserver kernel: [    1.737747] pci_root PNP0A08:00: host bridge window [mem 0x000c0000-0x000dffff]
May 19 16:01:19 labserver kernel: [    1.737834] pci_root PNP0A08:00: host bridge window [mem 0xfed0e000-0xfed0ffff]
May 19 16:01:19 labserver kernel: [    1.737922] pci_root PNP0A08:00: host bridge window [mem 0x80000000-0xfbffffff]
May 19 16:01:19 labserver kernel: [    1.740791] pci 0000:00:01.0: PCI bridge to [bus 01-01]
May 19 16:01:19 labserver kernel: [    1.745575] pci 0000:00:01.1: PCI bridge to [bus 02-03]
May 19 16:01:19 labserver kernel: [    1.745700] pci 0000:00:02.0: PCI bridge to [bus 04-04]
May 19 16:01:19 labserver kernel: [    1.745816] pci 0000:00:03.0: PCI bridge to [bus 05-05]
May 19 16:01:19 labserver kernel: [    1.745933] pci 0000:00:03.2: PCI bridge to [bus 06-06]
May 19 16:01:19 labserver kernel: [    1.746285] pci 0000:00:11.0: PCI bridge to [bus 07-07]
May 19 16:01:19 labserver kernel: [    1.746541] pci 0000:00:1e.0: PCI bridge to [bus 08-08] (subtractive decode)
May 19 16:01:19 labserver kernel: [    1.747170]  pci0000:00: Requesting ACPI _OSC control (0x1d)
May 19 16:01:19 labserver kernel: [    1.747465]  pci0000:00: ACPI _OSC control (0x15) granted
May 19 16:01:19 labserver kernel: [    1.756901] ACPI: PCI Root Bridge [UNC0] (domain 0000 [bus ff])
May 19 16:01:19 labserver kernel: [    1.758443]  pci0000:ff: Requesting ACPI _OSC control (0x1d)
May 19 16:01:19 labserver kernel: [    1.758528]  pci0000:ff: ACPI _OSC control (0x1d) granted
May 19 16:01:19 labserver kernel: [    1.759439] ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 10 *11 12 14 15)
May 19 16:01:19 labserver kernel: [    1.760105] ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 *10 11 12 14 15)
May 19 16:01:19 labserver kernel: [    1.760768] ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 *5 6 10 11 12 14 15)
May 19 16:01:19 labserver kernel: [    1.761383] ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 10 *11 12 14 15)
May 19 16:01:19 labserver kernel: [    1.762006] ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 6 7 10 11 12 14 15) *0
May 19 16:01:19 labserver kernel: [    1.762729] ACPI: PCI Interrupt Link [LNKF] (IRQs 3 4 5 6 7 10 11 12 14 15) *0
May 19 16:01:19 labserver kernel: [    1.763450] ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 5 6 7 10 11 12 14 15) *0
May 19 16:01:19 labserver kernel: [    1.764170] ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 5 6 *7 10 11 12 14 15)
    
por Franky 20.05.2015 / 10:15

1 resposta

1

Você precisa fornecer mais informações, especialmente as entradas de log, antes de o sistema ser reinicializado. No entanto, tanto quanto eu posso ver, pode não fornecer mais informações. Verifique outros registros, como o syslog.

As causas mais comuns na minha experiência de reinicializações súbitas, sem qualquer indicação do que realmente deu errado, costumam estar relacionadas ao hardware. Caso contrário, o kernel terá uma chance de escrever algo nos logs para dar uma pista.

Algumas causas comuns de reinicializações repentinas:

  • Superaquecimento , provavelmente a causa principal, ter uma idéia da temperatura, tentar registrá-lo, o servidor tem uma tela que pode mostrar a temperatura, é a sala resfriada corretamente. Talvez substituir o composto térmico nos dissipadores de calor que cobrem a (s) CPU (s).

  • Hardware ou drivers ruins , obtenha uma lista usando "lspci" por exemplo, um dimm ruim pode fazer com que um sistema seja interrompido e / ou reinicializado repentinamente (reposicione dimms, CPUs e cartões). Lembro-me de um servidor que reiniciou ocasionalmente devido a um problema com a placa ethernet da intel. Às vezes, um disco defeituoso também pode causar esses problemas, embora normalmente ele simplesmente pare de reiniciar, em vez de reiniciá-lo.

  • Um mau no-break , lembro-me de um no-break suportado por bateria indo devagar e um dos indicadores que ele fez foi um ciclo de energia semanal regular de servidores conectados a ele. Você pode ter um cronograma de ciclo de energia mal configurado.

por 20.05.2015 / 10:56