RAID5 falha durante a recuperação (mdadm): Alguma chance de ressincronizar?

2

Meu RAID5 vai morrer porque, ao restaurar um HDD com falha, em outro HDD, erros de leitura ocorreram e uma restauração é impossível. Eu só quero saber se é possível restaurar apenas uma parte dos dados restantes - a maioria deles deve estar lá de qualquer maneira.

Eu não tenho um backup completo, porque 25 TB de dados não são fáceis de fazer backup. Eu sei: No Backup - No Mercy. Tudo bem. Mas eu quero tentar. Não consigo imaginar, não é possível trazer este ataque de volta à vida.

Minha configuração atual:

  • RAID5 com 5 unidades de disco rígido (WD WD60EFRX Red SA3), 6 TB
  • o SO está no SSD extra
  • Ubuntu 16.04 com um ataque ao software mdadm

Um dia, /dev/sda falhou com a seguinte mensagem:

This is an automatically generated mail message from mdadm running on ***

A Fail event had been detected on md device /dev/md/0.

It could be related to component device /dev/sda1.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0 [raid1] [raid10] 
md0 : active raid5 sde1[5] sdb1[1] sda1[0](F) sdc1[7] sdd1[6]
23441545216 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/4] [_UUUU]
bitmap: 4/44 pages [16KB], 65536KB chunk

unused devices: <none>

Então eu fiz o seguinte:

Verificado o disco:

$ sudo smartctl -a /dev/sda
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-93-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD60EFRX-68MYMN1
Serial Number:    WD-WX41D75LN5CP
LU WWN Device Id: 5 0014ee 2620bfa28
Firmware Version: 82.00A82
User Capacity:    6.001.175.126.016 bytes [6,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5700 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Jan 30 22:02:37 2018 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: FAILED!
Drive failure expected in less than 24 hours. SAVE ALL DATA.
See vendor-specific Attribute list for failed Attributes.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:        ( 7004) seconds.
Offline data collection
capabilities:            (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    ( 723) minutes.
Conveyance self-test routine
recommended polling time:    (   5) minutes.
SCT capabilities:          (0x303d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   001   001   051    Pre-fail  Always   FAILING_NOW 37802
  3 Spin_Up_Time            0x0027   217   189   021    Pre-fail  Always       -       8141
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       26
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   077   077   000    Old_age   Always       -       17248
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       26
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       4
193 Load_Cycle_Count        0x0032   194   194   000    Old_age   Always       -       19683
194 Temperature_Celsius     0x0022   113   103   000    Old_age   Always       -       39
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Verificado o ataque:

$ sudo mdadm -D /dev/md0
    /dev/md0:
            Version : 1.2
      Creation Time : Tue Dec 15 22:20:33 2015
         Raid Level : raid5
         Array Size : 23441545216 (22355.60 GiB 24004.14 GB)
      Used Dev Size : 5860386304 (5588.90 GiB 6001.04 GB)
       Raid Devices : 5
      Total Devices : 5
        Persistence : Superblock is persistent

      Intent Bitmap : Internal

        Update Time : Tue Jan 30 22:07:30 2018
              State : clean
     Active Devices : 5
    Working Devices : 5
     Failed Devices : 0
      Spare Devices : 0

             Layout : left-symmetric
         Chunk Size : 512K

               Name : fractal.hostname:0  (local to host fractal.hostname)
               UUID : e9bdcf76:c5e04a88:32d5dfa6:557bdeaf
             Events : 80222

        Number   Major   Minor   RaidDevice State
           0       8        1        0      active sync   /dev/sda1
           1       8       17        1      active sync   /dev/sdb1
           7       8       33        2      active sync   /dev/sdc1
           6       8       49        3      active sync   /dev/sdd1
           5       8       65        4      active sync   /dev/sde1
    #  cat /proc/mdstat
    Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
    md0 : active raid5 sde1[5] sdb1[1] sda1[0] sdc1[7] sdd1[6]
          23441545216 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
          bitmap: 4/44 pages [16KB], 65536KB chunk

    unused devices: <none>

Então o disco não falhou agora, mas vai. Então eu removi do ataque,

# mdadm /dev/md0 --fail /dev/sda1
mdadm: set /dev/sda1 faulty in /dev/md0

# mdadm /dev/md0 --remove /dev/sda1
mdadm: hot removed /dev/sda1 from /dev/md0

reiniciado e adicionado o novo depois de copiar as tabelas de partição:

sgdisk --backup=gpt_backup /dev/sdb
sgdisk --load-backup=gpt_backup /dev/sda
sgdisk -G /dev/sda

mdadm --manage /dev/md0 -a /dev/sda1

Durante a recuperação, outra unidade ( /dev/sdb ) falhou da seguinte forma:

This is an automatically generated mail message from mdadm
running on fractal.hostname

A Fail event had been detected on md device /dev/md/0.

It could be related to component device /dev/sdb1.

Faithfully yours, etc.

P.S. The /proc/mdstat file currently contains the following:

Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10] 
md0 : active raid5 sda1[8] sdb1[1](F) sde1[5] sdc1[7] sdd1[6]
     23441545216 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/3] [__UUU]
     [===================>.]  recovery = 99.5% (5835184192/5860386304) finish=595.0min speed=703K/sec
     bitmap: 0/44 pages [0KB], 65536KB chunk

unused devices: <none>

Ele obviamente não sabia ler alguns detalhes para restauração:

This message was generated by the smartd daemon running on:

  host name:  fractal
  DNS domain: hostname

The following warning/error was logged by the smartd daemon:

Device: /dev/sdb [SAT], 167 Currently unreadable (pending) sectors

Device info:
WDC WD60EFRX-68MYMN1, S/N:WD-WX41D75LN3YP, WWN:5-0014ee-20cb6e76b, FW:82.00A82, 6.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Wed Jan 31 14:15:20 2018 CET
Another message will be sent in 24 hours if the problem persists.

Eu tentei o mesmo procedimento três vezes e toda vez que ele falhou da mesma maneira (somente quando atingi outro progresso, às vezes 60%, às vezes 99,5%).

Depois de várias tentativas, reorganizei a matriz, mas terminou da mesma maneira:

# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : inactive sdb1[1](S) sde1[5](S) sdc1[7](S) sdd1[6](S) sda1[8](S)
      29301944199 blocks super 1.2

unused devices: <none>
# mdadm -D /dev/md0
/dev/md0:
        Version : 1.2
     Raid Level : raid0
  Total Devices : 5
    Persistence : Superblock is persistent

          State : inactive

           Name : fractal.hostname:0  (local to host fractal.hostname)
           UUID : e9bdcf76:c5e04a88:32d5dfa6:557bdeaf
         Events : 133218

    Number   Major   Minor   RaidDevice

       -       8        1        -        /dev/sda1
       -       8       17        -        /dev/sdb1
       -       8       33        -        /dev/sdc1
       -       8       49        -        /dev/sdd1
       -       8       65        -        /dev/sde1
# mdadm --assemble --update=resync --force /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1
mdadm: forcing event count in /dev/sdb1(1) from 132262 upto 133218
mdadm: clearing FAULTY flag for device 0 in /dev/md0 for /dev/sda1
mdadm: Marking array /dev/md0 as 'clean'
mdadm: failed to RUN_ARRAY /dev/md0: Input/output error
# mdadm -D /dev/md0
/dev/md0:
        Version : 1.2
     Raid Level : raid0
  Total Devices : 2
    Persistence : Superblock is persistent

          State : inactive

           Name : fractal.hostname:0  (local to host fractal.hostname)
           UUID : e9bdcf76:c5e04a88:32d5dfa6:557bdeaf
         Events : 133218

    Number   Major   Minor   RaidDevice

       -       8       49        -        /dev/sdd1
       -       8       65        -        /dev/sde1
# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : inactive sde1[5](S) sdd1[6](S)
      11720776864 blocks super 1.2

unused devices: <none>
# mdadm --assemble --update=resync --force /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1
mdadm: /dev/sdd1 is busy - skipping
mdadm: /dev/sde1 is busy - skipping
mdadm: Merging with already-assembled /dev/md/0
mdadm: Marking array /dev/md/0 as 'clean'
mdadm: /dev/md/0 has been started with 4 drives (out of 5) and 1 spare.
# mdadm -D /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Tue Dec 15 22:20:33 2015
     Raid Level : raid5
     Array Size : 23441545216 (22355.60 GiB 24004.14 GB)
  Used Dev Size : 5860386304 (5588.90 GiB 6001.04 GB)
   Raid Devices : 5
  Total Devices : 5
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Sat Feb 10 09:42:03 2018
          State : clean, degraded, recovering
 Active Devices : 4
Working Devices : 5
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 512K

 Rebuild Status : 0% complete

           Name : fractal.hostname:0  (local to host fractal.hostname)
           UUID : e9bdcf76:c5e04a88:32d5dfa6:557bdeaf
         Events : 133220

    Number   Major   Minor   RaidDevice State
       8       8        1        0      spare rebuilding   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       7       8       33        2      active sync   /dev/sdc1
       6       8       49        3      active sync   /dev/sdd1
       5       8       65        4      active sync   /dev/sde1

Você vê alguma chance de recuperar meus dados sem perda? Existe alguma possibilidade de recuperar uma parte dos meus dados? Obviamente, ele poderia sincronizar até 99,5%, mas eu não poderia usá-lo.

Muito obrigado antecipadamente.

    
por swat 10.02.2018 / 09:59

0 respostas