Como encontrar a causa do High Load Average no Linux

2

Estou tentando encontrar o motivo pelo qual meu servidor está chegando a 150 em média de carga e, em algum momento, ele chega a 200.

Este é um Ubuntu Server rodando em uma máquina virtual VMware ESXi, com 8 CPUs e 26 GB de RAM. O servidor é um servidor de e-mail (usando o Kerio Mailserver) com cerca de 500 contas de usuários. O servidor obtém essa carga alta durante o horário de trabalho (desde as 8:30 da manhã às 6:30 da tarde), o Kerio Mailserver é um servidor de email do tipo Exchange, os usuários sincronizam o Outlook usando um conector.

Esse é o resultado de algum comando que estou usando tentando encontrar a causa da carga alta:

TOP:

top - 16:25:32 up 46 days, 20:59,  2 users,  load average: 178.44, 164.61, 156.84
Tasks: 241 total,   1 running, 240 sleeping,   0 stopped,   0 zombie
%Cpu(s): 45.8 us, 53.6 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.5 si,  0.0 st
KiB Mem:  24689476 total, 24234796 used,   454680 free,   675324 buffers
KiB Swap: 23436284 total,   763136 used, 22673148 free. 14616960 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                                                                                                          
 1253 root      20   0 13.889g 6.918g  22680 S 99.3 29.4 190621:37 mailserver                                                                                                                                       
   64 root      20   0       0      0      0 S  0.2  0.0 257:29.22 kswapd0                                                                                                                                          
30482 root      20   0   25036   3160   2516 R  0.1  0.0   0:01.33 top                                                                                                                                              
 8227 root      20   0  318856  12232  11288 S  0.0  0.0  15:39.14 smbd                                                                                                                                             
    1 root      20   0   36408   3144   1884 S  0.0  0.0   0:06.40 init                                                                                                                                             
    2 root      20   0       0      0      0 S  0.0  0.0   0:00.49 kthreadd                                                                                                                                         
    3 root      20   0       0      0      0 S  0.0  0.0   2:32.48 ksoftirqd/0                                                                                                                                      
    5 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 kworker/0:0H                                                                                                                                     
    7 root      20   0       0      0      0 S  0.0  0.0  81:48.05 rcu_sched                                                                                                                                        
    8 root      20   0       0      0      0 S  0.0  0.0   0:00.00 rcu_bh                                                                                                                                           
    9 root      rt   0       0      0      0 S  0.0  0.0   0:05.96 migration/0                                                                                                                                      
   10 root      rt   0       0      0      0 S  0.0  0.0   0:12.15 watchdog/0                                                                                                                                       
   11 root      rt   0       0      0      0 S  0.0  0.0   0:11.93 watchdog/1                                                                                                                                       
   12 root      rt   0       0      0      0 S  0.0  0.0   0:06.02 migration/1                                                                                                                                      
   13 root      20   0       0      0      0 S  0.0  0.0   2:17.19 ksoftirqd/1                                                                                                                                      
   14 root      20   0       0      0      0 S  0.0  0.0   0:00.00 kworker/1:0                                                                                                                                      
   15 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 kworker/1:0H                                                                                                                                     
   16 root      rt   0       0      0      0 S  0.0  0.0   0:10.84 watchdog/2                                                                                                                                       
   17 root      rt   0       0      0      0 S  0.0  0.0   0:06.29 migration/2                                                                                                                                      
   18 root      20   0       0      0      0 S  0.0  0.0   2:22.20 ksoftirqd/2                                                                                                                                      
   19 root      20   0       0      0      0 S  0.0  0.0   0:00.00 kworker/2:0                                                                                                                                      
   20 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 kworker/2:0H                                                                                                                                     
   21 root      rt   0       0      0      0 S  0.0  0.0   0:10.75 watchdog/3                                                                                                                                       
   22 root      rt   0       0      0      0 S  0.0  0.0   0:06.34 migration/3                                                                                                                                      
   23 root      20   0       0      0      0 S  0.0  0.0   2:07.07 ksoftirqd/3                                                                                                                                      
   24 root      20   0       0      0      0 S  0.0  0.0   0:00.00 kworker/3:0                                                                                                                                      
   25 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 kworker/3:0H                                                                                                                                     
   26 root      rt   0       0      0      0 S  0.0  0.0   0:11.49 watchdog/4                                                                                                                                       
   27 root      rt   0       0      0      0 S  0.0  0.0   0:06.34 migration/4                                                                                                                                      
   28 root      20   0       0      0      0 S  0.0  0.0   1:50.66 ksoftirqd/4                                                                                                                                      
   30 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 kworker/4:0H                                                                                                                                     
   31 root      rt   0       0      0      0 S  0.0  0.0   0:11.48 watchdog/5                                                                                                                                       
   32 root      rt   0       0      0      0 S  0.0  0.0   0:06.45 migration/5                                                                                                                                      
   33 root      20   0       0      0      0 S  0.0  0.0   2:04.74 ksoftirqd/5                                                                                                                                      
   34 root      20   0       0      0      0 S  0.0  0.0   1:30.98 kworker/5:0                                                                                                                                      
   35 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 kworker/5:0H                                                                                                                                     
   36 root      rt   0       0      0      0 S  0.0  0.0   0:11.22 watchdog/6                                                                                                                                       
   37 root      rt   0       0      0      0 S  0.0  0.0   0:06.40 migration/6                                                                                                                                      
   38 root      20   0       0      0      0 S  0.0  0.0   2:23.44 ksoftirqd/6                                                                                                                                      
   40 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 kworker/6:0H                                                                                                                                     
   41 root      rt   0       0      0      0 S  0.0  0.0   0:11.06 watchdog/7                                                                                                                                       
   42 root      rt   0       0      0      0 S  0.0  0.0   0:06.50 migration/7                                                                                                                                      
   43 root      20   0       0      0      0 S  0.0  0.0   2:06.70 ksoftirqd/7                                                                                                                                      
   45 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 kworker/7:0H                                                                                                                                     
   46 root      20   0       0      0      0 S  0.0  0.0   0:00.00 kdevtmpfs                                                                                                                                        
   47 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 netns                                                                                                                                            
   48 root       0 -20       0      0      0 S  0.0  0.0   0:00.00 perf                                                                                                                                             
   49 root      20   0       0      0      0 S  0.0  0.0   0:08.92 khungtaskd        

IOSTAT

Linux 4.4.0-124-generic (mardom-mail)   08/06/2018  _x86_64_    (8 CPU)

08/06/2018 04:26:33 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.86    88.05   95.18  148.71  1809.49  2014.93    31.36     1.03    4.21    6.02    3.05   0.82  20.08

08/06/2018 04:26:35 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               1.50   164.00  255.00  234.00  3744.00  2028.00    23.61     2.06    4.21    6.00    2.26   1.35  66.00

08/06/2018 04:26:37 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00   160.50  211.00   25.50  3164.00  1188.00    36.80     1.30    5.45    5.90    1.80   2.88  68.00

08/06/2018 04:26:39 PM
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.50   207.00  129.00   63.50  2808.00  1942.00    49.35     1.16    6.06    8.48    1.13   3.35  64.40

* sdb é o dispositivo onde os dados de e-mail são armazenados *

SYSCTL:

fs.file-max = 2097152
vm.swappiness = 10
vm.dirty_ratio = 60
vm.dirty_background_ratio = 2
net.ipv4.tcp_synack_retries = 2
net.ipv4.ip_local_port_range = 2000 65535
net.ipv4.tcp_rfc1337 = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 15
net.core.rmem_default = 31457280
net.core.rmem_max = 12582912
net.core.wmem_default = 31457280
net.core.wmem_max = 12582912
net.ipv4.tcp_max_syn_backlog = 4096
net.core.somaxconn = 4096
net.core.netdev_max_backlog = 65536
net.core.optmem_max = 25165824
net.ipv4.tcp_mem = 65536 131072 262144
net.ipv4.udp_mem = 65536 131072 262144
net.ipv4.tcp_rmem = 8192 87380 16777216
net.ipv4.udp_rmem_min = 16384
net.ipv4.tcp_wmem = 8192 65536 16777216
net.ipv4.udp_wmem_min = 16384
net.ipv4.tcp_max_tw_buckets = 1440000
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_rmem = 20240 87380 16582912
net.ipv4.tcp_wmem = 20240 87380 16582912
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_no_metrics_save = 1
net.core.netdev_max_backlog = 8000

NETSTAT -S Isso faz parte da saída do comando "netstat -s":

TcpExt:
    147823 SYN cookies sent
    90353 SYN cookies received
    649892 invalid SYN cookies received
    35112 resets received for embryonic SYN_RECV sockets
    598 packets pruned from receive queue because of socket buffer overrun
    1248 ICMP packets dropped because they were out-of-window
    1750991 TCP sockets finished time wait in fast timer
    115289 TCP sockets finished time wait in slow timer
    198679 passive connections rejected because of time stamp
    313672 packets rejects in established connections because of timestamp
    30746108 delayed acks sent
    52041 delayed acks further delayed because of locked socket
    Quick ack mode was activated 3386942 times
    708473 times the listen queue of a socket overflowed
    1060673 SYNs to LISTEN sockets dropped
    3845729 packets directly queued to recvmsg prequeue.
    356530 bytes directly in process context from backlog
    352524121 bytes directly received in process context from prequeue
    477210889 packet headers predicted
    269376 packets header predicted and directly queued to user
    863932645 acknowledgments not containing data payload received
    1164901897 predicted acknowledgments
    5140 times recovered from packet loss due to fast retransmit
    3122668 times recovered from packet loss by selective acknowledgements
    17 bad SACK blocks received
    Detected reordering 2283 times using FACK
    Detected reordering 2792 times using SACK
    Detected reordering 79 times using reno fast retransmit
    Detected reordering 7587 times using time stamp
    11657 congestion windows fully recovered without slow start
    7574 congestion windows partially recovered using Hoe heuristic
    113083 congestion windows recovered without slow start by DSACK
    1749877 congestion windows recovered without slow start after partial ack
    TCPLostRetransmit: 505007
    446 timeouts after reno fast retransmit
    69083 timeouts after SACK recovery
    37057 timeouts in loss state
    10089153 fast retransmits
    193583 forward retransmits
    842022 retransmits in slow start
    5758947 other TCP timeouts
    TCPLossProbes: 16066532
    TCPLossProbeRecovery: 213644
    551 classic Reno fast retransmits failed
    81107 SACK retransmits failed
    74 times receiver scheduled too late for direct processing
    9695 packets collapsed in receive queue due to low socket buffer
    3556447 DSACKs sent for old packets
    99902 DSACKs sent for out of order packets
    11030608 DSACKs received
    47032 DSACKs for out of order packets received
    2171835 connections reset due to unexpected data
    1370307 connections reset due to early user close
    1984601 connections aborted due to timeout
    TCPSACKDiscard: 52
    TCPDSACKIgnoredOld: 38928
    TCPDSACKIgnoredNoUndo: 4151872
    TCPSpuriousRTOs: 50607
    TCPSackShifted: 11911393
    TCPSackMerged: 12335047
    TCPSackShiftFallback: 12395942
    IPReversePathFilter: 1
    TCPReqQFullDoCookies: 161423
    TCPRetransFail: 123
    TCPRcvCoalesce: 306299180
    TCPOFOQueue: 6851749
    TCPOFOMerge: 94743
    TCPChallengeACK: 74466
    TCPSYNChallenge: 6119
    TCPFastOpenCookieReqd: 3
    TCPSpuriousRtxHostQueues: 1761
    TCPAutoCorking: 59800735
    TCPFromZeroWindowAdv: 52396
    TCPToZeroWindowAdv: 52396
    TCPWantZeroWindowAdv: 416614
    TCPSynRetrans: 2205685
    TCPOrigDataSent: -230466592
    TCPHystartTrainDetect: 950609
    TCPHystartTrainCwnd: 17827670
    TCPHystartDelayDetect: 649377
    TCPHystartDelayCwnd: 27358928
    TCPACKSkippedSynRecv: 227586
    TCPACKSkippedPAWS: 32292
    TCPACKSkippedSeq: 30707
    TCPACKSkippedFinWait2: 7
    TCPACKSkippedTimeWait: 48
    TCPACKSkippedChallenge: 1494
    TCPWinProbe: 279293
    TCPKeepAlive: 1685

Quando eu executo "netstat -nat | wc -l" eu recebo 2988 conexões.

Como você pode ver na saída "superior", o processo "servidor de e-mail" está consumindo > 90% da CPU, mas isso também pode causar carga alta?

** ATUALIZAÇÃO ** Como atualização, notei que, quando eu bloqueio algumas vlans de se conectarem ao servidor (vlans com cerca de 60 computadores ou mais), o carregamento começa a diminuir. Isso poderia ser um problema relacionado à rede? ou capacidade do servidor em geral?

    
por Alberto Medina 06.08.2018 / 22:05

0 respostas