Solucionando problemas de registros da NIC por NETDEV WATCHDOG

1

Eu tenho um servidor CentOS 6.6 (Linux 2.6.32-504) rodando em um novo Dell PowerEdge R320. No outro dia, comecei a perceber problemas em / var / log / messages relacionados a um dos meus NICs.

Nota: este é um cluster de servidores que requer que a eth0 seja usada em uma rede crossover para um servidor de backup redundante, então usamos eth1 para conversar com a rede do cliente e o mundo; então seria desafiador alternar as NICs para fins de solução de problemas. Eu também notarei que não estou vendo nenhum problema no eth0.

Eu gostaria de solucionar isso o máximo possível ~ é uma falha de hardware, o kernel do Linux não gosta do hardware da Dell, etc.? Eu não sou um administrador Linux de carreira, então eu sou meio que verde nesse sentido.

As pesquisas do Google parecem ter mais perguntas para mim do que respostas (e só parecem se aplicar a outras distros; mas apontam possibilidades como problemas de kernel / firmware. Eu imploro humildemente pelos conhecimentos e idéias dessa comunidade!

A única coisa que notei foi que a interface estava voltando para 100Mb / half (quando está realmente em 1Gb / full).

Eu fornecerei o máximo que puder, isso é relevante.

Aqui está o snippet de /var/log/messages onde eu notei o problema ... isso significa alguma coisa para alguém?

Mar 22 16:10:07 ind1un043 kernel: ------------[ cut here ]------------
Mar 22 16:10:07 ind1un043 kernel: WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0x26b/0x280() (Not tainted)
Mar 22 16:10:07 ind1un043 kernel: Hardware name: PowerEdge R320
Mar 22 16:10:07 ind1un043 kernel: NETDEV WATCHDOG: eth1 (tg3): transmit queue 0 timed out
Mar 22 16:10:07 ind1un043 kernel: Modules linked in: nls_utf8 hfsplus(U) mpt3sas mpt2sas scsi_transport_sas raid_class mptctl mptbase dell_rbu coretemp 8021q garp stp llc wctc4xxp(U) dahdi_transcode(U) wcb4xxp(U) wctdm(U) wcfxo(U) wctdm24xxp(U) wcte11xp(U) wct1xxp(U) wcte12xp(U) dahdi_voicebus(U) wct4xxp(U) oct612x(U) dahdi(U) crc_ccitt ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 uinput iTCO_wdt iTCO_vendor_support microcode dcdbas sb_edac edac_core ipmi_devintf power_meter acpi_ipmi ipmi_si ipmi_msghandler sg shpchp tg3 ptp pps_core lpc_ich mfd_core ext4 jbd2 mbcache sr_mod cdrom sd_mod crc_t10dif ahci wmi megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]
Mar 22 16:10:07 ind1un043 kernel: Pid: 9, comm: ksoftirqd/1 Not tainted 2.6.32-504.el6.x86_64 #1
Mar 22 16:10:07 ind1un043 kernel: Call Trace:
Mar 22 16:10:07 ind1un043 kernel: <IRQ>  [<ffffffff81074df7>] ? warn_slowpath_common+0x87/0xc0
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff81074ee6>] ? warn_slowpath_fmt+0x46/0x50
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8147df7b>] ? dev_watchdog+0x26b/0x280
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff81060d7c>] ? scheduler_tick+0xcc/0x260
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8147dd10>] ? dev_watchdog+0x0/0x280
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff81087db7>] ? run_timer_softirq+0x197/0x340
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff810b0275>] ? tick_dev_program_event+0x65/0xc0
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8107d8b1>] ? __do_softirq+0xc1/0x1e0
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff810b034a>] ? tick_program_event+0x2a/0x30
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8100c30c>] ? call_softirq+0x1c/0x30
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8100fc15>] ? do_softirq+0x65/0xa0
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8107d765>] ? irq_exit+0x85/0x90
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff81533c0a>] ? smp_apic_timer_interrupt+0x4a/0x60
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8100bb93>] ? apic_timer_interrupt+0x13/0x20
Mar 22 16:10:07 ind1un043 kernel: <EOI>  [<ffffffff8107d4c5>] ? ksoftirqd+0xd5/0x110
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8107d3f0>] ? ksoftirqd+0x0/0x110
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8109e66e>] ? kthread+0x9e/0xc0
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8100c20a>] ? child_rip+0xa/0x20
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
Mar 22 16:10:07 ind1un043 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Mar 22 16:10:07 ind1un043 kernel: ---[ end trace c154004f7af06fb3 ]---
Mar 22 16:10:07 ind1un043 kernel: tg3 0000:02:00.1: eth1: transmit timed out, resetting
Mar 22 16:10:08 ind1un043 kernel: tg3 0000:02:00.1: eth1: 0x00000000: 0x165f14e4, 0x00100406, 0x02000000, 0x00800010
Mar 22 16:10:08 ind1un043 kernel: tg3 0000:02:00.1: eth1: 0x00000010: 0xd90d000c, 0x00000000, 0xd90e000c, 0x00000000
Mar 22 16:10:08 ind1un043 kernel: tg3 0000:02:00.1: eth1: 0x00000020: 0xd90f000c, 0x00000000, 0x00000000, 0x04f71028
... (TOO MANY CHARACTERS TO POST ON SERVER FAULT, SO STRIPPING) ...
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 0x00007020: 0x00000000, 0x00000000, 0x00000406, 0x10004000
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 0x00007030: 0x000e0000, 0x00004af8, 0x00170030, 0x00000000
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 0: Host status block [00000001:00000063:(0000:0695:0000):(0000:01b8)]
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 0: NAPI info [00000063:00000063:(01a5:01b8:01ff):0000:(0095:0000:0000:0000)]
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 1: Host status block [00000001:000000ac:(0000:0000:0000):(083a:0000)]
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 1: NAPI info [000000ac:000000ac:(0000:0000:01ff):083a:(003a:003a:0000:0000)]
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 2: Host status block [00000001:0000004f:(089e:0000:0000):(0000:0000)]
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 2: NAPI info [00000046:00000046:(0000:0000:01ff):0895:(0095:0095:0000:0000)]
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 3: Host status block [00000001:00000012:(0000:0000:0000):(0000:0000)]
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 3: NAPI info [00000012:00000012:(0000:0000:01ff):0fbc:(07bc:07bc:0000:0000)]
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 4: Host status block [00000001:000000d4:(0000:0000:0742):(0000:0000)]
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: 4: NAPI info [000000d4:000000d4:(0000:0000:01ff):0742:(0742:0742:0000:0000)]
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: tg3_stop_block timed out, ofs=1400 enable_bit=2
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: tg3_stop_block timed out, ofs=c00 enable_bit=2
Mar 22 16:10:09 ind1un043 kernel: tg3 0000:02:00.1: eth1: Link is down
Mar 22 16:10:11 ind1un043 kernel: tg3 0000:02:00.1: eth1: Link is up at 100 Mbps, half duplex
Mar 22 16:10:11 ind1un043 kernel: tg3 0000:02:00.1: eth1: Flow control is off for TX and off for RX
Mar 22 16:10:11 ind1un043 kernel: tg3 0000:02:00.1: eth1: EEE is disabled

Mas antes disso, aqui está o trecho de /var/log/messages onde a NIC inicializa normalmente durante uma inicialização ...

Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.0: eth0: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address 14:18:77:32:77:3f
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.0: eth0: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.0: eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.0: eth0: dma_rwctrl[00000001] dma_mask[64-bit]
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.1: PCI INT B -> GSI 17 (level, low) -> IRQ 17
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.1: eth1: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address 14:18:77:32:77:40
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.1: eth1: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.1: eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.1: eth1: dma_rwctrl[00000001] dma_mask[64-bit]

Aqui está o trecho de /var/log/messages onde a NIC aparece normalmente, mas não tem certeza em qual contexto (é logo após o init imediatamente acima deste trecho) ...

Mar 21 16:01:13 ind1un043 kernel: ADDRCONF(NETDEV_UP): eth0: link is not ready
Mar 21 16:01:13 ind1un043 kernel: ADDRCONF(NETDEV_UP): eth1: link is not ready
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.0: eth0: Link is up at 1000 Mbps, full duplex
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.0: eth0: Flow control is on for TX and on for RX
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.0: eth0: EEE is disabled
Mar 21 16:01:13 ind1un043 kernel: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.1: eth1: Link is up at 100 Mbps, half duplex
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.1: eth1: Flow control is off for TX and off for RX
Mar 21 16:01:13 ind1un043 kernel: tg3 0000:02:00.1: eth1: EEE is disabled
Mar 21 16:01:13 ind1un043 kernel: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready

E aqui está o trecho de /var/log/messages , onde a NIC parece ter outros problemas em breve, renegociando ou algo assim. Isso acontece aleatoriamente várias vezes nos próximos dias ...

Mar 21 16:01:42 ind1un043 kernel: ADDRCONF(NETDEV_UP): eth1: link is not ready
Mar 21 16:01:44 ind1un043 kernel: tg3 0000:02:00.1: eth1: Link is up at 100 Mbps, half duplex
Mar 21 16:01:44 ind1un043 kernel: tg3 0000:02:00.1: eth1: Flow control is off for TX and off for RX
Mar 21 16:01:44 ind1un043 kernel: tg3 0000:02:00.1: eth1: EEE is disabled
Mar 21 16:01:44 ind1un043 kernel: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready

Aqui está o meu lspci | grep -i net ...

02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet PCIe
02:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5720 Gigabit Ethernet PCIe
    
por csr19us 25.03.2016 / 16:14

1 resposta

0

Não tenho certeza se isso é exatamente uma "resposta" ou não, mas isso resolveu meu problema, no que diz respeito a esse caso específico. Isso parece estar relacionado à ACPI.

Eu editei meu arquivo /etc/grub.conf para adicionar o seguinte ao final da linha kernel :

noapic acpi=off pci=noacpi

Agora não vejo mais erros no log.

Poderia ser que os recursos de economia de energia estavam derrubando a interface e trazendo-a de volta em um modo mais lento? Ou poderia ser algum bug com o hardware / firmware e este kernel? Não tenho certeza ~ mas isso, pelo menos, corrige meu problema (é um servidor de produção, cuja rede nunca queremos ficar inoperantes).

    
por 29.03.2016 / 15:05