Estou gerenciando um servidor bare-metal executando um banco de dados MySQL, bem como outros softwares. Estamos vendo problemas de desempenho com algumas consultas do MySQL, mas os problemas ocorrem intermitentemente e quase que exclusivamente afetam as consultas UPDATE.
Como exemplo, tenho uma consulta específica que é executada a cada 15 segundos. Na maioria das vezes, ele é concluído em milissegundos, mas, às vezes, o perfil da consulta mostra isso:
Status Duration
starting 0.000026
checking permissions 0.000003
Opening tables 0.000011
System lock 0.000002
init 0.000007
update 0.000052
Waiting for query cache lock 0.000001
update 0.000002
end 0.000001
query end 15.198987
closing tables 0.000023
freeing items 0.000026
logging slow query 0.000014
logging slow query 0.000001
cleaning up 0.000002
Eu tentei as sugestões em essa pergunta , sem efeito. Comecei então a procurar itens de nível inferior, pois notei que alguns dos outros softwares também exibiam uma resposta lenta intermitente.
Aqui está a saída principal quando o desempenho é bom:
top - 12:09:48 up 138 days, 21:57, 2 users, load average: 2.35, 1.73, 1.55
Tasks: 392 total, 2 running, 388 sleeping, 2 stopped, 0 zombie
%Cpu0 : 16.4 us, 1.3 sy, 0.0 ni, 80.3 id, 1.0 wa, 0.0 hi, 1.0 si, 0.0 st
%Cpu1 : 9.3 us, 0.7 sy, 0.0 ni, 89.1 id, 0.7 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu2 : 4.0 us, 0.7 sy, 0.0 ni, 95.0 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 3.0 us, 0.7 sy, 0.0 ni, 96.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu4 : 10.0 us, 0.7 sy, 0.0 ni, 89.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 14.9 us, 0.7 sy, 0.0 ni, 84.4 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 2.6 us, 0.3 sy, 0.0 ni, 96.4 id, 0.7 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 1.3 us, 0.0 sy, 0.0 ni, 98.0 id, 0.7 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 32836768 total, 24516984 used, 8319784 free, 143868 buffers
KiB Swap: 0 total, 0 used, 0 free. 15166836 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26264 root 20 0 1255040 81148 17792 S 19.3 0.2 367:38.32 node
1450 root 20 0 1249548 106060 17760 S 7.6 0.3 2612:39 node
30815 root 20 0 1245184 71652 17780 S 5.6 0.2 898:00.21 node
20700 root 20 0 1238644 63004 16812 S 2.3 0.2 635:20.34 node
281 root 20 0 193348 101184 97840 S 2.0 0.3 2832:15 systemd-journal
2974 root 20 0 48404 21688 3812 S 1.3 0.1 1329:51 openvpn
26283 postgres 20 0 235384 45408 42648 S 1.3 0.1 10:17.88 postgres
26284 postgres 20 0 235380 44780 42008 S 1.3 0.1 10:14.81 postgres
25764 root 20 0 1232976 58792 17716 S 1.0 0.2 42:51.01 node
26279 postgres 20 0 235356 45772 43028 S 1.0 0.1 10:13.60 postgres
12793 root 20 0 23956 3300 2488 R 0.7 0.0 0:00.11 top
26280 postgres 20 0 235384 44764 41992 S 0.7 0.1 10:16.64 postgres
27135 postgres 20 0 235380 44932 42156 S 0.7 0.1 10:15.24 postgres
27136 postgres 20 0 235360 44288 41532 S 0.7 0.1 10:15.45 postgres
E quando está ruim:
top - 12:13:19 up 138 days, 22:00, 2 users, load average: 3.34, 2.45, 1.88
Tasks: 393 total, 2 running, 389 sleeping, 2 stopped, 0 zombie
%Cpu0 : 11.1 us, 0.0 sy, 0.0 ni, 0.0 id, 87.8 wa, 0.0 hi, 1.0 si, 0.0 st
%Cpu1 : 1.3 us, 0.0 sy, 0.0 ni, 75.5 id, 22.8 wa, 0.0 hi, 0.3 si, 0.0 st
%Cpu2 : 5.7 us, 0.3 sy, 0.0 ni, 8.4 id, 85.6 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 11.0 us, 0.3 sy, 0.0 ni, 74.3 id, 14.3 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu4 : 0.7 us, 0.0 sy, 0.0 ni, 87.3 id, 12.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 1.3 us, 0.0 sy, 0.0 ni, 98.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 1.0 us, 0.0 sy, 0.0 ni, 99.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 3.0 us, 0.3 sy, 0.0 ni, 93.3 id, 3.4 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 32836768 total, 24870588 used, 7966180 free, 143868 buffers
KiB Swap: 0 total, 0 used, 0 free. 15073140 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26264 root 20 0 1252992 109712 17792 S 10.0 0.3 368:16.02 node
1450 root 20 0 1255692 112632 17760 S 4.3 0.3 2612:57 node
30815 root 20 0 1240064 81032 17780 S 4.0 0.2 898:13.71 node
20700 root 20 0 1243764 74472 16812 S 2.3 0.2 635:24.22 node
281 root 20 0 58644 22468 19124 S 1.0 0.1 2832:18 systemd-journal
25764 root 20 0 1240144 64560 17716 S 1.0 0.2 42:53.01 node
2974 root 20 0 48404 21688 3812 S 0.7 0.1 1329:53 openvpn
7 root 20 0 0 0 0 S 0.3 0.0 728:52.82 rcu_sched
770 postgres 20 0 235648 45920 42780 S 0.3 0.1 8:11.67 postgres
5476 root 20 0 0 0 0 S 0.3 0.0 0:00.15 kworker/u16:1
12329 root 20 0 0 0 0 D 0.3 0.0 0:00.18 kworker/2:1
12778 root 20 0 0 0 0 R 0.3 0.0 0:00.25 kworker/2:2
26278 postgres 20 0 232340 17764 15544 S 0.3 0.1 5:38.38 postgres
28478 mysql 20 0 7210212 3.677g 8560 S 0.3 11.7 1849:14 mysqld
30829 postgres 20 0 232340 17484 15272 S 0.3 0.1 14:01.09 postgres
1 root 20 0 28968 5188 2864 S 0.0 0.0 5:44.60 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:12.38 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 75:39.66 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
Seguindo alguns conselhos sobre solução de problemas de E / S alta no Linux , eu corri iostat -xm 5
.
Aqui está uma saída normal:
avg-cpu: %user %nice %system %iowait %steal %idle
6.05 0.00 0.63 0.80 0.00 92.52
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 2.60 0.00 58.40 0.00 0.64 22.58 0.02 0.32 0.00 0.32 0.29 1.68
sdb 0.00 2.80 0.00 58.20 0.00 0.64 22.66 0.12 2.10 0.00 2.10 1.70 9.92
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md2 0.00 0.00 0.00 24.00 0.00 0.10 8.53 0.00 0.00 0.00 0.00 0.00 0.00
md3 0.00 0.00 0.00 27.80 0.00 0.53 39.19 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 24.20 0.00 0.53 45.02 0.22 9.06 0.00 9.06 2.64 6.40
e uma saída lenta:
avg-cpu: %user %nice %system %iowait %steal %idle
3.90 0.00 0.28 51.05 0.00 44.77
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 7.20 0.00 42.20 0.00 15.24 739.82 0.35 8.27 0.00 8.27 0.74 3.12
sdb 0.00 7.40 0.00 62.00 0.00 24.69 815.55 110.16 2142.10 0.00 2142.10 16.13 100.00
md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
md2 0.00 0.00 0.00 0.40 0.00 0.00 24.00 0.00 0.00 0.00 0.00 0.00 0.00
md3 0.00 0.00 0.00 0.20 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.20 0.00 0.00 8.00 219.50 3880180.00 0.00 3880180.00 5000.00 100.00
sda e sdb são unidades Samsung SSD 850 EVO 1TB. Cada um tem uma tabela de partições idêntica, criando quatro dispositivos md no RAID1:
Personalities : [raid1]
md3 : active raid1 sda4[0] sdb4[1]
940987072 blocks super 1.2 [2/2] [UU]
bitmap: 2/8 pages [8KB], 65536KB chunk
md2 : active raid1 sda3[0] sdb3[1]
19514368 blocks super 1.2 [2/2] [UU]
md1 : active raid1 sda2[0] sdb2[1]
15617024 blocks super 1.2 [2/2] [UU]
md0 : active raid1 sda1[0] sdb1[1]
487104 blocks super 1.2 [2/2] [UU]
md0 é / boot, digite xfs. md1 é para troca, mas atualmente não está em uso. md2 é /, digite xfs. Finalmente, md3 é o dispositivo de apoio para um volume criptografado (dm-0) onde os dados são armazenados:
LUKS header information for /dev/md3
Version: 1
Cipher name: aes
Cipher mode: xts-plain64
Hash spec: sha1
Payload offset: 4096
O que me incomoda particularmente é que a carga de I / O no sdb parece muito mais alta que em sda. Como os dois discos estão fornecendo apenas partições de membros para matrizes RAID1, isso não parece certo.
Eu verifiquei os dados SMART para o sdb e executei o autoteste curto e longo sem que problemas fossem detectados:
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 8719 -
# 2 Short offline Completed without error 00% 8711 -
Os registros do sistema não mostram entradas para sdb além das alterações de temperatura do registro do smartd a cada 30 minutos.
versão do kernel:
Linux no 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2+deb8u3 (2016-07-02) x86_64 GNU/Linux
O que estou perdendo?