Causando Root e Fixando o Buffer do NIC Overruns para Interfaces de 10Gb no Linux (SCTP)

0

Eu estou vendo uma alta taxa de erros de pacotes (quase todos os excessos) em ambos os 10gb NICs anexados ao meu servidor linux. O sistema está lidando com altos volumes de tráfego de rede SCTP (muito pouco TCP), então este é provavelmente um problema de ajuste do kernel do Linux.

No entanto, todos os parâmetros de ajuste que tentei até agora parecem estar tendo pouco efeito e ainda estou vendo altos volumes de excesso de pacotes. Qualquer ponte sobre outras coisas que eu poderia tentar obter o sistema de manipulação de pacotes eficientemente seria muito apreciado!

:~# ifconfig ens4f1

ens4f1    Link encap:Ethernet  HWaddr 5c:b9:01:de:0d:4c  
      UP BROADCAST RUNNING PROMISC MULTICAST  MTU:9000  Metric:1
      RX packets:22313514162 errors:17598241316 dropped:68 
overruns:17598241316 frame:0
      TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
      collisions:0 txqueuelen:1000 
      RX bytes:31767480894219 (31.7 TB)  TX bytes:0 (0.0 B)
      Interrupt:17 Memory:c9800000-c9ffffff 

Detalhes do sistema:

SO: Ubuntu Linux (4.11.0-14-genérico # 20 ~ 16.04.1-Ubuntu SMP x86_64) Núcleos da CPU: 72 Modelo NIC: NetXtreme II BCM57810 10 Gigabit Ethernet RAM: 240 GiB

Estatísticas da amostra da NIC mostrando a taxa de erros do pacote:

 for i in 'seq 1 10';do echo "$i) 'date'" - $(ifconfig ens4f0| egrep "RX"| egrep overruns;sleep 5);done

1) Thu Oct 12 19:50:40 SGT 2017 - RX packets:8364065830 errors:2594507718 dropped:215 overruns:2594507718 frame:0
2) Thu Oct 12 19:50:45 SGT 2017 - RX packets:8365336060 errors:2596662672 dropped:215 overruns:2596662672 frame:0
3) Thu Oct 12 19:50:50 SGT 2017 - RX packets:8366602087 errors:2598840959 dropped:215 overruns:2598840959 frame:0
4) Thu Oct 12 19:50:55 SGT 2017 - RX packets:8367881271 errors:2600989229 dropped:215 overruns:2600989229 frame:0
5) Thu Oct 12 19:51:01 SGT 2017 - RX packets:8369147536 errors:2603157030 dropped:215 overruns:2603157030 frame:0
6) Thu Oct 12 19:51:06 SGT 2017 - RX packets:8370149567 errors:2604904183 dropped:215 overruns:2604904183 frame:0
7) Thu Oct 12 19:51:11 SGT 2017 - RX packets:8371298018 errors:2607183939 dropped:215 overruns:2607183939 frame:0
8) Thu Oct 12 19:51:16 SGT 2017 - RX packets:8372455587 errors:2609411186 dropped:215 overruns:2609411186 frame:0
9) Thu Oct 12 19:51:21 SGT 2017 - RX packets:8373585102 errors:2611680597 dropped:215 overruns:2611680597 frame:0
10) Thu Oct 12 19:51:26 SGT 2017 - RX packets:8374678508 errors:2614053000 dropped:215 overruns:2614053000 frame:0

No entanto, a verificação (com tc) não mostra nenhum excesso de buffer de anel na NIC:

tc -s qdisc show dev ens4f0|egrep drop

Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 
Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) 

Verificando as retransmissões tcp, a taxa é baixa:

  for i in 'seq 1 10';do echo "'date'" - $(netstat -s | grep - i retransmited;sleep 2);done

 Thu Oct 12 20:04:29 SGT 2017 - 10633 segments retransmited
 Thu Oct 12 20:04:31 SGT 2017 - 10634 segments retransmited
 Thu Oct 12 20:04:33 SGT 2017 - 10636 segments retransmited
 Thu Oct 12 20:04:35 SGT 2017 - 10636 segments retransmited
 Thu Oct 12 20:04:37 SGT 2017 - 10638 segments retransmited
 Thu Oct 12 20:04:39 SGT 2017 - 10639 segments retransmited
 Thu Oct 12 20:04:41 SGT 2017 - 10640 segments retransmited
 Thu Oct 12 20:04:43 SGT 2017 - 10640 segments retransmited
 Thu Oct 12 20:04:45 SGT 2017 - 10643 segments retransmited

O que eu tentei até agora:

  • Ajustando os parâmetros da NIC (união de pacotes, descarregamento, aumento de buffers de anel de NIC, etc ...):

    ethtool -L ens4f0 combinado 30

    ethtool -K ens4f0 está no rx no tx on sg on tso on

    ethtool -C ens4f0 rx-usecs 96

    ethtool -C ens4f0 adaptive-rx em

    ethtool -G ens4f0 rx 4078 tx 4078

  • sysctl ajusta o kernel (principalmente aumentando os buffers tcp do kernel):

    sysctl -w net.ipv4.tcp_low_latency = 1

    sysctl -w net.ipv4.tcp_max_syn_backlog = 16384

    sysctl -w net.core.optmem_max = 20480000

    sysctl -w net.core.netdev_max_backlog = 5000000

    sysctl -w net.ipv4.tcp_rmem="65536 1747600 83886080"

    sysctl -w net.core.somaxconn = 1280

    sysctl -w kernel.sched_min_granularity_ns = 10000000

    sysctl -w kernel.sched_wakeup_granularity_ns = 15000000

    sysctl -w net.ipv4.tcp_wmem="65536 1747600 83886080"

    sysctl -w net.core.wmem_max = 2147483647

    sysctl -w net.core.wmem_default = 2147483647

    sysctl -w net.core.rmem_max = 2147483647

    sysctl -w net.core.rmem_default = 2147483647

    sysctl -w net.ipv4.tcp_congestion_control = cúbico

    sysctl -w net.ipv4.tcp_rmem="163840 3495200 268754560"

    sysctl -w net.ipv4.tcp_wmem="163840 3495200 268754560"

    sysctl -w net.ipv4.udp_rmem_min="163840 3495200 268754560"

    sysctl -w net.ipv4.udp_wmem_min="163840 3495200 268754560"

    sysctl -w net.ipv4.tcp_mem="268754560 268754560 268754560"

    sysctl -w net.ipv4.udp_mem="268754560 268754560 268754560"

    sysctl -w net.ipv4.tcp_mtu_probing = 1

    sysctl -w net.ipv4.tcp_slow_start_after_idle = 0

Resultados depois disso (aparentemente não muito):

 :~# for i in 'seq 1 10';do echo "$i) 'date'" - $(ifconfig ens4f1| egrep "RX"| egrep overruns;sleep 5);done

 1) Thu Oct 12 20:42:56 SGT 2017 - RX packets:16260617113 errors:10964865836 dropped:68 overruns:10964865836 frame:0
 2) Thu Oct 12 20:43:01 SGT 2017 - RX packets:16263268608 errors:10969589847 dropped:68 overruns:10969589847 frame:0
 3) Thu Oct 12 20:43:06 SGT 2017 - RX packets:16265869693 errors:10974489639 dropped:68 overruns:10974489639 frame:0
 4) Thu Oct 12 20:43:11 SGT 2017 - RX packets:16268487078 errors:10979323070 dropped:68 overruns:10979323070 frame:0
 5) Thu Oct 12 20:43:16 SGT 2017 - RX packets:16271098501 errors:10984193349 dropped:68 overruns:10984193349 frame:0
 6) Thu Oct 12 20:43:21 SGT 2017 - RX packets:16273804004 errors:10988857622 dropped:68 overruns:10988857622 frame:0
 7) Thu Oct 12 20:43:26 SGT 2017 - RX packets:16276493470 errors:10993340211 dropped:68 overruns:10993340211 frame:0
 8) Thu Oct 12 20:43:31 SGT 2017 - RX packets:16278612090 errors:10997152436 dropped:68 overruns:10997152436 frame:0
 9) Thu Oct 12 20:43:36 SGT 2017 - RX packets:16281253727 errors:11001834579 dropped:68 overruns:11001834579 frame:0
 10) Thu Oct 12 20:43:41 SGT 2017 - RX packets:16283972622 errors:11006374277 dropped:68 overruns:11006374277 frame:0

Freak a CPU para um melhor desempenho:

cpufreq-set -r -g performance

Resultados (nada significativo):

 :~# for i in 'seq 1 10';do echo "$i) 'date'" - $(ifconfig ens4f1| egrep "RX"| egrep overruns;sleep 5);done

 1) Thu Oct 12 21:53:07 SGT 2017 - RX packets:18506492788 errors:14622639426 dropped:68 overruns:14622639426 frame:0
 2) Thu Oct 12 21:53:12 SGT 2017 - RX packets:18509314581 errors:14626750641 dropped:68 overruns:14626750641 frame:0
 3) Thu Oct 12 21:53:17 SGT 2017 - RX packets:18511485458 errors:14630268859 dropped:68 overruns:14630268859 frame:0
 4) Thu Oct 12 21:53:22 SGT 2017 - RX packets:18514223562 errors:14634547845 dropped:68 overruns:14634547845 frame:0
 5) Thu Oct 12 21:53:27 SGT 2017 - RX packets:18516926578 errors:14638745143 dropped:68 overruns:14638745143 frame:0
 6) Thu Oct 12 21:53:32 SGT 2017 - RX packets:18519605412 errors:14642929021 dropped:68 overruns:14642929021 frame:0
 7) Thu Oct 12 21:53:37 SGT 2017 - RX packets:18522523560 errors:14647108982 dropped:68 overruns:14647108982 frame:0
 8) Thu Oct 12 21:53:42 SGT 2017 - RX packets:18525185869 errors:14651577286 dropped:68 overruns:14651577286 frame:0
 9) Thu Oct 12 21:53:47 SGT 2017 - RX packets:18527947266 errors:14655961847 dropped:68 overruns:14655961847 frame:0
 10) Thu Oct 12 21:53:52 SGT 2017 - RX packets:18530703288 errors:14659988398 dropped:68 overruns:14659988398 frame:0

Resultados usando sar:

:~# sar -n EDEV 5 3| egrep "(ens4f1|IFACE)"

11:17:43 PM     IFACE   rxerr/s   txerr/s    coll/s  rxdrop/s  txdrop/s  txcarr/s  rxfram/s  rxfifo/s  txfifo/s
11:17:48 PM    ens4f1 360809.40      0.00      0.00      0.00      0.00      0.00      0.00 360809.40      0.00
11:17:53 PM    ens4f1 382500.40      0.00      0.00      0.00      0.00      
0.00      0.00 382500.40      0.00
11:17:58 PM    ens4f1 353717.00      0.00      0.00      0.00      0.00      
 0.00      0.00 353717.00      0.00
Average:       ens4f1 365675.60      0.00      0.00      0.00      0.00      0.00      0.00 365675.60      0.00

Também ajustei alguns parâmetros específicos do SCTP, mas também sem resultados:

sysctl -w net.core.rmem_max=900000000
sysctl -w net.core.wmem_max=900000000

sysctl -w net.sctp.sctp_mem="2100000000 2100000000 2100000000"
sysctl -w net.sctp.sctp_rmem="2100000000 2100000000 2100000000"
sysctl -w net.sctp.sctp_wmem="2100000000 2100000000 2100000000"

sysctl -w net.ipv4.udp_mem="5000000000 5000000000 5000000000"
sysctl -w net.ipv4.udp_mem="10000000000 10000000000 10000000000"

 for i in 'seq 1 10';do echo "$i) 'date'" - $(ifconfig ens4f1| egrep "RX"| egrep overruns;sleep 5);done
 1) Sat Oct 14 21:55:55 SGT 2017 - RX packets:84379103241 errors:56372972367 dropped:58 overruns:56372972367 frame:0
 2) Sat Oct 14 21:56:00 SGT 2017 - RX packets:84381451420 errors:56377777944 dropped:58 overruns:56377777944 frame:0
 3) Sat Oct 14 21:56:05 SGT 2017 - RX packets:84383737427 errors:56382434478 dropped:58 overruns:56382434478 frame:0
 4) Sat Oct 14 21:56:10 SGT 2017 - RX packets:84386524128 errors:56386618268 dropped:58 overruns:56386618268 frame:0
 5) Sat Oct 14 21:56:15 SGT 2017 - RX packets:84389578203 errors:56390512483 dropped:58 overruns:56390512483 frame:0
 6) Sat Oct 14 21:56:20 SGT 2017 - RX packets:84392673120 errors:56394472475 dropped:58 overruns:56394472475 frame:0
 7) Sat Oct 14 21:56:25 SGT 2017 - RX packets:84395714973 errors:56398573221 dropped:58 overruns:56398573221 frame:0
 8) Sat Oct 14 21:56:30 SGT 2017 - RX packets:84398951451 errors:56402297479 dropped:58 overruns:56402297479 frame:0
 9) Sat Oct 14 21:56:35 SGT 2017 - RX packets:84401039177 errors:56406013473 dropped:58 overruns:56406013473 frame:0
 10) Sat Oct 14 21:56:40 SGT 2017 - RX packets:84403558097 errors:56410804379 dropped:58 overruns:56410804379 frame:0
    
por Traiano Welcome 15.10.2017 / 10:04

0 respostas