Meu servidor estava executando um array RAID 1 com dois discos. Um desses discos falhou hoje e foi substituído.
Eu copiei a partição GPT para o novo disco rígido (sda) com:
sgdisk -R /dev/sda /dev/sdb
e alterou o UDID com
sgdisk -G /dev/sda
Eu adicionei as duas partições à matriz RAID:
mdadm /dev/md4 -a /dev/sda4
e
mdadm /dev/md5 -a /dev/sda5
/dev/md4
foi recriado corretamente, mas não /dev/md5
.
Quando executo cat /proc/mdstat
logo depois de executar esses comandos, ele mostra isso:
Personalities : [raid1]
md5 : active raid1 sda5[2] sdb5[1]
2820667711 blocks super 1.2 [2/1] [_U]
[>....................] recovery = 0.0% (2109952/2820667711) finish=423.0min speed=111050K/sec
md4 : active raid1 sda4[2] sdb4[0]
15727544 blocks super 1.2 [2/2] [UU]
unused devices: <none>
Qual estava correto? ele estava tentando reconstruir md5
, mas alguns minutos depois ele parou e agora cat /proc/mdstat
retorna:
Personalities : [raid1]
md5 : active raid1 sda5[2](S) sdb5[1]
2820667711 blocks super 1.2 [2/1] [_U]
md4 : active raid1 sda4[2] sdb4[0]
15727544 blocks super 1.2 [2/2] [UU]
unused devices: <none>
Por que parou de recriar nesse novo disco? Aqui está o que eu recebo quando executando mdadm --detail /dev/md5
/dev/md5:
Version : 1.2
Creation Time : Sun Sep 16 15:26:58 2012
Raid Level : raid1
Array Size : 2820667711 (2690.00 GiB 2888.36 GB)
Used Dev Size : 2820667711 (2690.00 GiB 2888.36 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Sat Dec 27 04:01:26 2014
State : clean, degraded
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1
Name : rescue:5 (local to host rescue)
UUID : 29868a4d:f63c6b43:ee926581:fd775604
Events : 5237753
Number Major Minor RaidDevice State
0 0 0 0 removed
1 8 21 1 active sync /dev/sdb5
2 8 5 - spare /dev/sda5
Obrigado @ Michael Hampton pela sua resposta. Estou de volta depois de uma noite de sono :-) Então eu verifiquei o dmesg e eu entendi:
[Sat Dec 27 04:01:04 2014] md: recovery of RAID array md5
[Sat Dec 27 04:01:04 2014] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
[Sat Dec 27 04:01:04 2014] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
[Sat Dec 27 04:01:04 2014] md: using 128k window, over a total of 2820667711k.
[Sat Dec 27 04:01:04 2014] RAID1 conf printout:
[Sat Dec 27 04:01:04 2014] --- wd:2 rd:2
[Sat Dec 27 04:01:04 2014] disk 0, wo:0, o:1, dev:sdb4
[Sat Dec 27 04:01:04 2014] disk 1, wo:0, o:1, dev:sda4
[Sat Dec 27 04:01:21 2014] ata2.00: exception Emask 0x0 SAct 0x1e000 SErr 0x0 action 0x0
[Sat Dec 27 04:01:21 2014] ata2.00: irq_stat 0x40000008
[Sat Dec 27 04:01:21 2014] ata2.00: cmd 60/80:68:00:12:51/03:00:0d:00:00/40 tag 13 ncq 458752 in
[Sat Dec 27 04:01:21 2014] res 41/40:80:68:14:51/00:03:0d:00:00/00 Emask 0x409 (media error) <F>
[Sat Dec 27 04:01:21 2014] ata2.00: configured for UDMA/133
[Sat Dec 27 04:01:21 2014] sd 1:0:0:0: [sdb] Unhandled sense code
[Sat Dec 27 04:01:21 2014] sd 1:0:0:0: [sdb]
[Sat Dec 27 04:01:21 2014] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Sat Dec 27 04:01:21 2014] sd 1:0:0:0: [sdb]
[Sat Dec 27 04:01:21 2014] Sense Key : Medium Error [current] [descriptor]
[Sat Dec 27 04:01:21 2014] Descriptor sense data with sense descriptors (in hex):
[Sat Dec 27 04:01:21 2014] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
[Sat Dec 27 04:01:21 2014] 0d 51 14 68
[Sat Dec 27 04:01:21 2014] sd 1:0:0:0: [sdb]
[Sat Dec 27 04:01:21 2014] Add. Sense: Unrecovered read error - auto reallocate failed
[Sat Dec 27 04:01:21 2014] sd 1:0:0:0: [sdb] CDB:
[Sat Dec 27 04:01:21 2014] Read(16): 88 00 00 00 00 00 0d 51 12 00 00 00 03 80 00 00
[Sat Dec 27 04:01:21 2014] end_request: I/O error, dev sdb, sector 223417448
[Sat Dec 27 04:01:21 2014] ata2: EH complete
[Sat Dec 27 04:01:24 2014] ata2.00: exception Emask 0x0 SAct 0x8 SErr 0x0 action 0x0
[Sat Dec 27 04:01:24 2014] ata2.00: irq_stat 0x40000008
[Sat Dec 27 04:01:24 2014] ata2.00: cmd 60/08:18:68:14:51/00:00:0d:00:00/40 tag 3 ncq 4096 in
[Sat Dec 27 04:01:24 2014] res 41/40:08:68:14:51/00:00:0d:00:00/00 Emask 0x409 (media error) <F>
[Sat Dec 27 04:01:24 2014] ata2.00: configured for UDMA/133
[Sat Dec 27 04:01:24 2014] sd 1:0:0:0: [sdb] Unhandled sense code
[Sat Dec 27 04:01:24 2014] sd 1:0:0:0: [sdb]
[Sat Dec 27 04:01:24 2014] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[Sat Dec 27 04:01:24 2014] sd 1:0:0:0: [sdb]
[Sat Dec 27 04:01:24 2014] Sense Key : Medium Error [current] [descriptor]
[Sat Dec 27 04:01:24 2014] Descriptor sense data with sense descriptors (in hex):
[Sat Dec 27 04:01:24 2014] 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
[Sat Dec 27 04:01:24 2014] 0d 51 14 68
[Sat Dec 27 04:01:24 2014] sd 1:0:0:0: [sdb]
[Sat Dec 27 04:01:24 2014] Add. Sense: Unrecovered read error - auto reallocate failed
[Sat Dec 27 04:01:24 2014] sd 1:0:0:0: [sdb] CDB:
[Sat Dec 27 04:01:24 2014] Read(16): 88 00 00 00 00 00 0d 51 14 68 00 00 00 08 00 00
[Sat Dec 27 04:01:24 2014] end_request: I/O error, dev sdb, sector 223417448
[Sat Dec 27 04:01:24 2014] ata2: EH complete
[Sat Dec 27 04:01:24 2014] md/raid1:md5: sdb: unrecoverable I/O read error for block 4219904
[Sat Dec 27 04:01:24 2014] md: md5: recovery interrupted.
[Sat Dec 27 04:01:24 2014] RAID1 conf printout:
[Sat Dec 27 04:01:24 2014] --- wd:1 rd:2
[Sat Dec 27 04:01:24 2014] disk 0, wo:1, o:1, dev:sda5
[Sat Dec 27 04:01:24 2014] disk 1, wo:0, o:1, dev:sdb5
[Sat Dec 27 04:01:24 2014] RAID1 conf printout:
[Sat Dec 27 04:01:24 2014] --- wd:1 rd:2
[Sat Dec 27 04:01:24 2014] disk 1, wo:0, o:1, dev:sdb5
Portanto, parece ser um erro de leitura. Mas a SMART não parece ser tão ruim (se eu entendi corretamente):
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 088 087 006 Pre-fail Always - 154455820
3 Spin_Up_Time 0x0003 096 096 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 5
5 Reallocated_Sector_Ct 0x0033 084 084 036 Pre-fail Always - 21664
7 Seek_Error_Rate 0x000f 072 060 030 Pre-fail Always - 38808769144
9 Power_On_Hours 0x0032 071 071 000 Old_age Always - 26073
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 5
183 Runtime_Bad_Block 0x0032 099 099 000 Old_age Always - 1
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 721
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 4295032833
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 063 061 045 Old_age Always - 37 (Min/Max 33/37)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 3
193 Load_Cycle_Count 0x0032 095 095 000 Old_age Always - 10183
194 Temperature_Celsius 0x0022 037 040 000 Old_age Always - 37 (0 21 0 0)
197 Current_Pending_Sector 0x0012 088 088 000 Old_age Always - 2072
198 Offline_Uncorrectable 0x0010 088 088 000 Old_age Offline - 2072
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 157045479198210
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 4435703883570
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 5487937263078
SMART Error Log Version: 1
ATA Error Count: 6 (device log contains only the most recent five errors)
De qualquer forma obrigado pela sua resposta. E sim, se eu estivesse configurando o servidor novamente, eu definitivamente não usaria mais de uma partição para minha matriz RAID (neste caso, na verdade, o md5 está usando LVM.
Obrigado,