Meu RAID5 vai morrer porque, ao restaurar um HDD com falha, em outro HDD, erros de leitura ocorreram e uma restauração é impossível. Eu só quero saber se é possível restaurar apenas uma parte dos dados restantes - a maioria deles deve estar lá de qualquer maneira.
Eu não tenho um backup completo, porque 25 TB de dados não são fáceis de fazer backup. Eu sei: No Backup - No Mercy. Tudo bem. Mas eu quero tentar. Não consigo imaginar, não é possível trazer este ataque de volta à vida.
Minha configuração atual:
Um dia, /dev/sda
falhou com a seguinte mensagem:
This is an automatically generated mail message from mdadm running on ***
A Fail event had been detected on md device /dev/md/0.
It could be related to component device /dev/sda1.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0 [raid1] [raid10]
md0 : active raid5 sde1[5] sdb1[1] sda1[0](F) sdc1[7] sdd1[6]
23441545216 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/4] [_UUUU]
bitmap: 4/44 pages [16KB], 65536KB chunk
unused devices: <none>
Então eu fiz o seguinte:
Verificado o disco:
$ sudo smartctl -a /dev/sda
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-93-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD60EFRX-68MYMN1
Serial Number: WD-WX41D75LN5CP
LU WWN Device Id: 5 0014ee 2620bfa28
Firmware Version: 82.00A82
User Capacity: 6.001.175.126.016 bytes [6,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5700 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Jan 30 22:02:37 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
See vendor-specific Attribute list for failed Attributes.
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 7004) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 723) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x303d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 001 001 051 Pre-fail Always FAILING_NOW 37802
3 Spin_Up_Time 0x0027 217 189 021 Pre-fail Always - 8141
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 26
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 077 077 000 Old_age Always - 17248
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 26
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 4
193 Load_Cycle_Count 0x0032 194 194 000 Old_age Always - 19683
194 Temperature_Celsius 0x0022 113 103 000 Old_age Always - 39
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Verificado o ataque:
$ sudo mdadm -D /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Tue Dec 15 22:20:33 2015
Raid Level : raid5
Array Size : 23441545216 (22355.60 GiB 24004.14 GB)
Used Dev Size : 5860386304 (5588.90 GiB 6001.04 GB)
Raid Devices : 5
Total Devices : 5
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Tue Jan 30 22:07:30 2018
State : clean
Active Devices : 5
Working Devices : 5
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Name : fractal.hostname:0 (local to host fractal.hostname)
UUID : e9bdcf76:c5e04a88:32d5dfa6:557bdeaf
Events : 80222
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 17 1 active sync /dev/sdb1
7 8 33 2 active sync /dev/sdc1
6 8 49 3 active sync /dev/sdd1
5 8 65 4 active sync /dev/sde1
# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md0 : active raid5 sde1[5] sdb1[1] sda1[0] sdc1[7] sdd1[6]
23441545216 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
bitmap: 4/44 pages [16KB], 65536KB chunk
unused devices: <none>
Então o disco não falhou agora, mas vai. Então eu removi do ataque,
# mdadm /dev/md0 --fail /dev/sda1
mdadm: set /dev/sda1 faulty in /dev/md0
# mdadm /dev/md0 --remove /dev/sda1
mdadm: hot removed /dev/sda1 from /dev/md0
reiniciado e adicionado o novo depois de copiar as tabelas de partição:
sgdisk --backup=gpt_backup /dev/sdb
sgdisk --load-backup=gpt_backup /dev/sda
sgdisk -G /dev/sda
mdadm --manage /dev/md0 -a /dev/sda1
Durante a recuperação, outra unidade ( /dev/sdb
) falhou da seguinte forma:
This is an automatically generated mail message from mdadm
running on fractal.hostname
A Fail event had been detected on md device /dev/md/0.
It could be related to component device /dev/sdb1.
Faithfully yours, etc.
P.S. The /proc/mdstat file currently contains the following:
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md0 : active raid5 sda1[8] sdb1[1](F) sde1[5] sdc1[7] sdd1[6]
23441545216 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/3] [__UUU]
[===================>.] recovery = 99.5% (5835184192/5860386304) finish=595.0min speed=703K/sec
bitmap: 0/44 pages [0KB], 65536KB chunk
unused devices: <none>
Ele obviamente não sabia ler alguns detalhes para restauração:
This message was generated by the smartd daemon running on:
host name: fractal
DNS domain: hostname
The following warning/error was logged by the smartd daemon:
Device: /dev/sdb [SAT], 167 Currently unreadable (pending) sectors
Device info:
WDC WD60EFRX-68MYMN1, S/N:WD-WX41D75LN3YP, WWN:5-0014ee-20cb6e76b, FW:82.00A82, 6.00 TB
For details see host's SYSLOG.
You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Wed Jan 31 14:15:20 2018 CET
Another message will be sent in 24 hours if the problem persists.
Eu tentei o mesmo procedimento três vezes e toda vez que ele falhou da mesma maneira (somente quando atingi outro progresso, às vezes 60%, às vezes 99,5%).
Depois de várias tentativas, reorganizei a matriz, mas terminou da mesma maneira:
# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : inactive sdb1[1](S) sde1[5](S) sdc1[7](S) sdd1[6](S) sda1[8](S)
29301944199 blocks super 1.2
unused devices: <none>
# mdadm -D /dev/md0
/dev/md0:
Version : 1.2
Raid Level : raid0
Total Devices : 5
Persistence : Superblock is persistent
State : inactive
Name : fractal.hostname:0 (local to host fractal.hostname)
UUID : e9bdcf76:c5e04a88:32d5dfa6:557bdeaf
Events : 133218
Number Major Minor RaidDevice
- 8 1 - /dev/sda1
- 8 17 - /dev/sdb1
- 8 33 - /dev/sdc1
- 8 49 - /dev/sdd1
- 8 65 - /dev/sde1
# mdadm --assemble --update=resync --force /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1
mdadm: forcing event count in /dev/sdb1(1) from 132262 upto 133218
mdadm: clearing FAULTY flag for device 0 in /dev/md0 for /dev/sda1
mdadm: Marking array /dev/md0 as 'clean'
mdadm: failed to RUN_ARRAY /dev/md0: Input/output error
# mdadm -D /dev/md0
/dev/md0:
Version : 1.2
Raid Level : raid0
Total Devices : 2
Persistence : Superblock is persistent
State : inactive
Name : fractal.hostname:0 (local to host fractal.hostname)
UUID : e9bdcf76:c5e04a88:32d5dfa6:557bdeaf
Events : 133218
Number Major Minor RaidDevice
- 8 49 - /dev/sdd1
- 8 65 - /dev/sde1
# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : inactive sde1[5](S) sdd1[6](S)
11720776864 blocks super 1.2
unused devices: <none>
# mdadm --assemble --update=resync --force /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1
mdadm: /dev/sdd1 is busy - skipping
mdadm: /dev/sde1 is busy - skipping
mdadm: Merging with already-assembled /dev/md/0
mdadm: Marking array /dev/md/0 as 'clean'
mdadm: /dev/md/0 has been started with 4 drives (out of 5) and 1 spare.
# mdadm -D /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Tue Dec 15 22:20:33 2015
Raid Level : raid5
Array Size : 23441545216 (22355.60 GiB 24004.14 GB)
Used Dev Size : 5860386304 (5588.90 GiB 6001.04 GB)
Raid Devices : 5
Total Devices : 5
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Sat Feb 10 09:42:03 2018
State : clean, degraded, recovering
Active Devices : 4
Working Devices : 5
Failed Devices : 0
Spare Devices : 1
Layout : left-symmetric
Chunk Size : 512K
Rebuild Status : 0% complete
Name : fractal.hostname:0 (local to host fractal.hostname)
UUID : e9bdcf76:c5e04a88:32d5dfa6:557bdeaf
Events : 133220
Number Major Minor RaidDevice State
8 8 1 0 spare rebuilding /dev/sda1
1 8 17 1 active sync /dev/sdb1
7 8 33 2 active sync /dev/sdc1
6 8 49 3 active sync /dev/sdd1
5 8 65 4 active sync /dev/sde1
Você vê alguma chance de recuperar meus dados sem perda? Existe alguma possibilidade de recuperar uma parte dos meus dados? Obviamente, ele poderia sincronizar até 99,5%, mas eu não poderia usá-lo.
Muito obrigado antecipadamente.
Tags raid mdadm linux software-raid