Eu virtualizei um datacenter há alguns meses e temos um pool de três servidores HP DL360 G5, cada um com 32 GB de memória e dois Intel Xeons. Recentemente, temos tido 2 problemas, o primeiro dos quais é a velocidade de leitura do disco que se tornou extremamente lenta. Digitar "ls" em uma VM linux que tenha apenas alguns arquivos leva vários segundos para retornar uma lista de arquivos. Além disso, as VMs no cluster às vezes serão remontadas como sistemas de arquivos somente leitura por si mesmos. Dmesg nos hosts produz uma infinidade de erros "DRDY ERR". Os principais repositórios de armazenamento que usamos estão em um Drobo B800i, compartilhado sobre isci. Eu postei iostat e um grep dos erros DRDY do dmesg abaixo, estes são servidores corporativos e eles estão indo para baixo de forma intermitente, o que nunca é bom:
Aqui está um Iostat de um dos servidores: [root @ XenServer-1 tmp] # iostat Linux 2.6.32.43-0.4.1.xs1.8.0.835.170778xen (XenServer-1.ethoplex.com) 31/07/2014
avg-cpu: %user %nice %system %iowait %steal %idle
0.42 0.00 0.46 3.51 0.40 95.21
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
cciss/c0d0 17.30 76.54 304.24 893755376 3552874247
cciss/c0d0p1 1.04 0.27 22.82 3169526 266433488
cciss/c0d0p2 0.00 0.01 0.00 73890 0
cciss/c0d0p3 16.25 76.24 281.43 890365720 3286440759
sda 76.84 59.78 87.32 698047689 1019733585
dm-0 0.68 0.95 0.28 11071656 3217737
sdb 3.44 177.64 37.74 2074378210 440737634
dm-2 0.00 0.01 0.00 135808 2216
dm-3 12.23 361.61 131.55 4222728781 1536204287
sdc 4.05 27.93 328.02 326147810 3830552980
sdd 6.23 101.72 113.03 1187808537 1319897350
tda 1.61 9.74 40.01 113749658 467248640
dm-28 0.84 36.78 23.11 429521222 269838659
dm-14 0.24 56.24 0.00 656723598 0
dm-21 0.08 18.17 0.00 212172507 0
tdb 0.08 0.12 1.44 1384368 16853616
dm-5 0.38 4.03 36.17 47063052 422416430
tdc 0.61 4.03 36.10 47062722 421602000
dm-7 1.26 17.74 5.51 207110960 64292628
tde 1.22 17.64 5.49 206019946 64129696
dm-30 0.03 0.01 0.60 61956 6979438
dm-4 0.02 0.00 8.85 1014 103326613
tdd 0.11 0.00 8.82 1264 103049216
dm-9 0.00 0.02 0.05 175978 591472
tdg 0.00 0.02 0.05 175950 590704
dm-10 0.01 0.09 0.21 1104226 2488947
tdf 0.01 0.09 0.21 1105562 2472346
dm-6 0.00 0.00 0.04 1568 419135
dm-16 0.00 0.01 0.00 132105 0
dm-17 0.03 0.05 0.76 625890 8867990
dm-8 0.00 0.06 0.10 752923 1226072
tdh 0.00 0.07 0.10 788356 1218922
tdi 0.00 0.00 0.00 884 0
Dmesg Grep DRDY:
[11645348.631020] ata1.00: status: { DRDY ERR }
[11646434.714902] ata1.00: status: { DRDY ERR }
[11648427.773389] ata1.00: status: { DRDY ERR }
[11648950.139954] ata1.00: status: { DRDY ERR }
[11649612.475350] ata1.00: status: { DRDY ERR }
[11650177.522603] ata1.00: status: { DRDY ERR }
[11650649.818020] ata1.00: status: { DRDY }
[11651837.989833] ata1.00: status: { DRDY ERR }
[11654729.414605] ata1.00: status: { DRDY ERR }
[11655685.782290] ata1.00: status: { DRDY ERR }
[11657120.774143] ata1.00: status: { DRDY ERR }
[11659704.724995] ata1.00: status: { DRDY }
[11661322.210812] ata1.00: status: { DRDY ERR }
[11662029.088563] ata1.00: status: { DRDY ERR }
[11663314.187972] ata1.00: status: { DRDY ERR }
[11667978.796829] ata1.00: status: { DRDY ERR }
[11670487.088008] ata1.00: status: { DRDY ERR }
[11671800.577054] ata1.00: status: { DRDY ERR }
Dmesg:
[11464689.083861] sr 1:0:0:0: CDB: Get event status notification: 4a 01 00 00 10 00 00 00 08 00
[11464689.083875] ata1.00: cmd a0/00:00:00:08:00/00:00:00:00:00/a0 tag 0 pio 16392 in
[11464689.083876]res 40/00:03:00:00:00/00:00:00:00:00/a0 Emask 0x4 (timeout)
[11464689.083896] ata1.00: status: { DRDY }
[11464694.133755] ata1: link is slow to respond, please be patient (ready=0)
[11464699.123711] ata1: device not ready (errno=-16), forcing hardreset
[11464699.123727] ata1: soft resetting link
[11464699.344063] ata1.00: configured for PIO0
[11464699.348375] ata1: EH complete
[11464706.383733] ata1.00: qc timeout (cmd 0xa0)
[11464706.383766] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[11464706.383782] sr 1:0:0:0: CDB: Test Unit Ready: 00 00 00 00 00 00
[11464706.383794] ata1.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
[11464706.383795]res 51/20:03:00:00:00/00:00:00:00:00/a0 Emask 0x5 (timeout)
[11464706.383806] ata1.00: status: { DRDY ERR }
[11464711.433625] ata1: link is slow to respond, please be patient (ready=0)
[11464716.433591] ata1: device not ready (errno=-16), forcing hardreset