O ZFS continua reparando o mesmo disco ao esfregar - o que isso significa?

3

Eu tenho um pool ZFS composto por 6 discos, organizado em 3 espelhos despojados. O servidor é um Supermicro X11SSM-F com um CPU Xeon, 32 GB de RAM ECC com Ubuntu 17.04. Estou usando 2 Icy Dock MB154SP-B para hospedar fisicamente os discos, e há 8 conectores SATA 3 na placa-mãe, para que os discos sejam apresentados diretamente ao ZFS (sem placa RAID intermediária).

Esta configuração estava funcionando bem até muito recentemente. Então notei ao executar zpool status que o último scrub tinha reparado alguns dados:

$ sudo zpool status
  pool: cloudpool
 state: ONLINE
  scan: scrub repaired 2.98M in 4h56m with 0 errors on Sun Jul  9 05:20:16 2017
config:

    NAME                                          STATE     READ WRITE CKSUM
    cloudpool                                     ONLINE       0     0     0
      mirror-0                                    ONLINE       0     0     0
        ata-ST8000VN0022-2EL112_ZA17FZXF          ONLINE       0     0     0
        ata-ST8000VN0022-2EL112_ZA17H5D3          ONLINE       0     0     0
      mirror-1                                    ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5NFLRU3  ONLINE       0     0     0
        ata-ST4000VN000-2AH166_WDH0KMHT           ONLINE       0     0     0
      mirror-2                                    ONLINE       0     0     0
        ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N3EHHA2E  ONLINE       0     0     0
        ata-ST3000DM001-1CH166_Z1F1HL4V           ONLINE       0     0     0

errors: No known data errors

Intrigado, decidi executar um novo scrub:

$ sudo zpool scrub cloudpool
... giving it a few minutes to run ...

$ sudo zpool status
  pool: cloudpool
 state: ONLINE
  scan: scrub in progress since Tue Jul 11 22:55:12 2017
    124M scanned out of 4.52T at 4.59M/s, 286h55m to go
    256K repaired, 0.00% done
config:

    NAME                                          STATE     READ WRITE CKSUM
    cloudpool                                     ONLINE       0     0     0
      mirror-0                                    ONLINE       0     0     0
        ata-ST8000VN0022-2EL112_ZA17FZXF          ONLINE       0     0     0
        ata-ST8000VN0022-2EL112_ZA17H5D3          ONLINE       0     0     0
      mirror-1                                    ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5NFLRU3  ONLINE       0     0     0  (repairing)
        ata-ST4000VN000-2AH166_WDH0KMHT           ONLINE       0     0     0
      mirror-2                                    ONLINE       0     0     0
        ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N3EHHA2E  ONLINE       0     0     0
        ata-ST3000DM001-1CH166_Z1F1HL4V           ONLINE       0     0     0

errors: No known data errors

Depois que eu termino, recebo o seguinte:

$ sudo zpool status
  pool: cloudpool
 state: ONLINE
  scan: scrub repaired 624K in 4h35m with 0 errors on Wed Jul 12 03:31:00 2017
config:

    NAME                                          STATE     READ WRITE CKSUM
    cloudpool                                     ONLINE       0     0     0
      mirror-0                                    ONLINE       0     0     0
        ata-ST8000VN0022-2EL112_ZA17FZXF          ONLINE       0     0     0
        ata-ST8000VN0022-2EL112_ZA17H5D3          ONLINE       0     0     0
      mirror-1                                    ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5NFLRU3  ONLINE       0     0     0
        ata-ST4000VN000-2AH166_WDH0KMHT           ONLINE       0     0     0
      mirror-2                                    ONLINE       0     0     0
        ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N3EHHA2E  ONLINE       0     0     0
        ata-ST3000DM001-1CH166_Z1F1HL4V           ONLINE       0     0     0

errors: No known data errors

Eu então decidi começar a esfregar a piscina novamente. Depois de deixar correr um pouco, estou recebendo o seguinte:

$ sudo zpool status
  pool: cloudpool
 state: ONLINE
  scan: scrub in progress since Wed Jul 12 09:55:19 2017
    941G scanned out of 4.52T at 282M/s, 3h42m to go
    112K repaired, 20.34% done
config:

    NAME                                          STATE     READ WRITE CKSUM
    cloudpool                                     ONLINE       0     0     0
      mirror-0                                    ONLINE       0     0     0
        ata-ST8000VN0022-2EL112_ZA17FZXF          ONLINE       0     0     0
        ata-ST8000VN0022-2EL112_ZA17H5D3          ONLINE       0     0     0
      mirror-1                                    ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5NFLRU3  ONLINE       0     0     0  (repairing)
        ata-ST4000VN000-2AH166_WDH0KMHT           ONLINE       0     0     0
      mirror-2                                    ONLINE       0     0     0
        ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N3EHHA2E  ONLINE       0     0     0
        ata-ST3000DM001-1CH166_Z1F1HL4V           ONLINE       0     0     0

errors: No known data errors

Ao analisar os dados SMART do disco, não vejo nada de suspeito (exceto talvez Raw_Read_Error_Rate ?):

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.10.0-26-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68WT0N0
Serial Number:    WD-WCC4E5NFLRU3
LU WWN Device Id: 5 0014ee 262ee543f
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jul 12 10:19:08 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (52020) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 520) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       99
  3 Spin_Up_Time            0x0027   186   176   021    Pre-fail  Always       -       7683
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       33
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   091   091   000    Old_age   Always       -       6735
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       33
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       5
193 Load_Cycle_Count        0x0032   198   198   000    Old_age   Always       -       7500
194 Temperature_Celsius     0x0022   110   108   000    Old_age   Always       -       42
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%        13         -
# 2  Conveyance offline  Completed without error       00%         1         -
# 3  Short offline       Completed without error       00%         1         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

No entanto, vejo algumas mensagens estranhas na saída de dmesg :

[100240.777601] ata2.00: exception Emask 0x0 SAct 0x3000000 SErr 0x0 action 0x0
[100240.777608] ata2.00: irq_stat 0x40000008
[100240.777614] ata2.00: failed command: READ FPDMA QUEUED
[100240.777624] ata2.00: cmd 60/00:c0:c8:bc:01/01:00:00:00:00/40 tag 24 ncq dma 131072 in
                         res 41/40:00:a8:bd:01/00:00:00:00:00/40 Emask 0x409 (media error) <F>
[100240.777628] ata2.00: status: { DRDY ERR }
[100240.777631] ata2.00: error: { UNC }
[100240.779320] ata2.00: configured for UDMA/133
[100240.779342] sd 1:0:0:0: [sdb] tag#24 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[100240.779346] sd 1:0:0:0: [sdb] tag#24 Sense Key : Medium Error [current] 
[100240.779350] sd 1:0:0:0: [sdb] tag#24 Add. Sense: Unrecovered read error - auto reallocate failed
[100240.779354] sd 1:0:0:0: [sdb] tag#24 CDB: Read(16) 88 00 00 00 00 00 00 01 bc c8 00 00 01 00 00 00
[100240.779357] blk_update_request: I/O error, dev sdb, sector 114088
[100240.779384] ata2: EH complete
[100244.165785] ata2.00: exception Emask 0x0 SAct 0x3d SErr 0x0 action 0x0
[100244.165793] ata2.00: irq_stat 0x40000008
[100244.165798] ata2.00: failed command: READ FPDMA QUEUED
[100244.165807] ata2.00: cmd 60/00:00:c8:be:01/01:00:00:00:00/40 tag 0 ncq dma 131072 in
                         res 41/40:00:70:bf:01/00:00:00:00:00/40 Emask 0x409 (media error) <F>
[100244.165811] ata2.00: status: { DRDY ERR }
[100244.165814] ata2.00: error: { UNC }
[100244.167465] ata2.00: configured for UDMA/133
[100244.167488] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[100244.167492] sd 1:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current] 
[100244.167496] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed
[100244.167500] sd 1:0:0:0: [sdb] tag#0 CDB: Read(16) 88 00 00 00 00 00 00 01 be c8 00 00 01 00 00 00
[100244.167503] blk_update_request: I/O error, dev sdb, sector 114544
[100244.167531] ata2: EH complete
[100248.177949] ata2.00: exception Emask 0x0 SAct 0x41c00002 SErr 0x0 action 0x0
[100248.177957] ata2.00: irq_stat 0x40000008
[100248.177963] ata2.00: failed command: READ FPDMA QUEUED
[100248.177972] ata2.00: cmd 60/00:f0:c8:c0:01/01:00:00:00:00/40 tag 30 ncq dma 131072 in
                         res 41/40:00:b8:c1:01/00:00:00:00:00/40 Emask 0x409 (media error) <F>
[100248.177977] ata2.00: status: { DRDY ERR }
[100248.177980] ata2.00: error: { UNC }
[100248.179638] ata2.00: configured for UDMA/133
[100248.179667] sd 1:0:0:0: [sdb] tag#30 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[100248.179671] sd 1:0:0:0: [sdb] tag#30 Sense Key : Medium Error [current] 
[100248.179675] sd 1:0:0:0: [sdb] tag#30 Add. Sense: Unrecovered read error - auto reallocate failed
[100248.179679] sd 1:0:0:0: [sdb] tag#30 CDB: Read(16) 88 00 00 00 00 00 00 01 c0 c8 00 00 01 00 00 00
[100248.179682] blk_update_request: I/O error, dev sdb, sector 115128
[100248.179705] ata2: EH complete
...

Percorrendo dmesg , vejo 31 desses casos nos registros:

[100240.779357] blk_update_request: I/O error, dev sdb, sector 114088
[100244.167503] blk_update_request: I/O error, dev sdb, sector 114544
[100248.179682] blk_update_request: I/O error, dev sdb, sector 115128
[100251.599649] blk_update_request: I/O error, dev sdb, sector 115272
[100255.812020] blk_update_request: I/O error, dev sdb, sector 115576
[100259.636088] blk_update_request: I/O error, dev sdb, sector 115768
[100263.400169] blk_update_request: I/O error, dev sdb, sector 116000
[100267.912099] blk_update_request: I/O error, dev sdb, sector 116472
[100271.300223] blk_update_request: I/O error, dev sdb, sector 116680
[100274.732989] blk_update_request: I/O error, dev sdb, sector 117000
[100279.665331] blk_update_request: I/O error, dev sdb, sector 118624
[100283.043738] blk_update_request: I/O error, dev sdb, sector 118768
[100286.456260] blk_update_request: I/O error, dev sdb, sector 119072
[100293.472354] blk_update_request: I/O error, dev sdb, sector 7814018576
[100298.443416] blk_update_request: I/O error, dev sdb, sector 119496
[100302.236908] blk_update_request: I/O error, dev sdb, sector 119968
[100305.655675] blk_update_request: I/O error, dev sdb, sector 120032
[100309.450754] blk_update_request: I/O error, dev sdb, sector 120496
[100313.724792] blk_update_request: I/O error, dev sdb, sector 121512
[100324.782008] blk_update_request: I/O error, dev sdb, sector 186032
[100329.002031] blk_update_request: I/O error, dev sdb, sector 189536
[100333.057101] blk_update_request: I/O error, dev sdb, sector 189680
[100336.476953] blk_update_request: I/O error, dev sdb, sector 189888
[100341.133527] blk_update_request: I/O error, dev sdb, sector 190408
[100349.890540] blk_update_request: I/O error, dev sdb, sector 191824
[353944.190625] blk_update_request: I/O error, dev sdb, sector 115480
[353951.660635] blk_update_request: I/O error, dev sdb, sector 116536
[353959.391011] blk_update_request: I/O error, dev sdb, sector 118976
[353966.811863] blk_update_request: I/O error, dev sdb, sector 120176
[353978.447354] blk_update_request: I/O error, dev sdb, sector 189984
[393732.681767] blk_update_request: I/O error, dev sdb, sector 190000

Não sei bem o que fazer com isso:

  • Por que o processo de limpeza continua reparando dados no mesmo disco? A quantidade de dados reparados está diminuindo, o que parece indicar que os dados estão sendo "permanentemente" corrigidos, mas por que ainda há dados a serem corrigidos ao executar um scrub literalmente horas após outro scrub?
  • Por que não estou vendo nenhuma indicação de erros de leitura / gravação / soma de verificação em zpool status , apesar do fato de o ZFS encontrar alguns dados para corrigir em cada scrub?
  • Por que estou vendo UREs no disco, mas nada é suspeito no relatório SMART?
  • O que significa auto reallocate failed ? O disco já está sem blocos de substituição? Este é um sistema que está funcionando sem problemas há 6 meses, então eu esperava que problemas com este disco surgissem um pouco mais cedo.

E, na prática, o que isso significa para esse disco em particular? Precisa de substituição?

Editar # 1 Após o último scrub, agora recebo o seguinte:

$ sudo zpool status
  pool: cloudpool
 state: ONLINE
  scan: scrub repaired 0 in 4h35m with 0 errors on Wed Jul 12 21:44:41 2017
config:

    NAME                                          STATE     READ WRITE CKSUM
    cloudpool                                     ONLINE       0     0     0
      mirror-0                                    ONLINE       0     0     0
        ata-ST8000VN0022-2EL112_ZA17FZXF          ONLINE       0     0     0
        ata-ST8000VN0022-2EL112_ZA17H5D3          ONLINE       0     0     0
      mirror-1                                    ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5NFLRU3  ONLINE       0     0     0
        ata-ST4000VN000-2AH166_WDH0KMHT           ONLINE       0     0     0
      mirror-2                                    ONLINE       0     0     0
        ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N3EHHA2E  ONLINE       0     0     0
        ata-ST3000DM001-1CH166_Z1F1HL4V           ONLINE       0     0     0

errors: No known data errors

O Smartctl agora diz:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       122
  3 Spin_Up_Time            0x0027   186   176   021    Pre-fail  Always       -       7683
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       33
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   091   091   000    Old_age   Always       -       6749
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       33
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       5
193 Load_Cycle_Count        0x0032   198   198   000    Old_age   Always       -       7507
194 Temperature_Celsius     0x0022   114   108   000    Old_age   Always       -       38
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

Então ... eu estou bem? O que foi isso tudo?

    
por FlorentR 12.07.2017 / 11:14

0 respostas