Disco rígido problemático no LVM - está quebrado?

6

Eu tenho um volume LVM configurado com vários discos rígidos, e um deles parece estar falhando, ou pelo menos algo estranho está acontecendo. Toda vez que o volume lógico series vê uma atividade de gravação pesada, o programa em execução (rTorrent na maioria das vezes) trava e dmesg relatórios

ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x1810000 action 0xe frozen
ata6.00: irq_stat 0x00400000, PHY RDY changed
ata6: SError: { PHYRdyChg LinkSeq TrStaTrns }
ata6.00: failed command: FLUSH CACHE EXT
ata6.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
         res 40/00:2c:ff:e3:e3/00:00:39:00:00/40 Emask 0x10 (ATA bus error)
ata6.00: status: { DRDY }
ata6: hard resetting link
ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata6.00: configured for UDMA/133
end_request: I/O error, dev sdf, sector 0
ata6: EH complete
I/O error in filesystem ("dm-3") meta-data dev dm-3 block 0x640092a       ("xlog_iodone") error 5 buf count 32768
xfs_force_shutdown(dm-3,0x2) called from line 1043 of file fs/xfs/xfs_log.c.  Return address = 0xffffffff8119b919
Filesystem "dm-3": Log I/O Error Detected.  Shutting down filesystem: dm-3
Please umount the filesystem, and rectify the problem(s)
xfs_force_shutdown(dm-3,0x2) called from line 811 of file fs/xfs/xfs_log.c.  Return address = 0xffffffff8119ccfb
Filesystem "dm-3": xfs_log_force: error 5 returned.
Filesystem "dm-3": xfs_log_force: error 5 returned.
Filesystem "dm-3": xfs_log_force: error 5 returned.
Filesystem "dm-3": xfs_log_force: error 5 returned.
Filesystem "dm-3": xfs_log_force: error 5 returned.
Filesystem "dm-3": xfs_log_force: error 5 returned.
... and so on

O volume em si:

--- Logical volume ---
  LV Name                /dev/storage/series
  VG Name                storage
  LV UUID                sF6I3A-Ttt5-PEml-BY5i-edOV-43ha-5P75Z3
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                2.86 TiB
  Current LE             748800
  Segments               29
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:3

Eu, então, umount de todos os volumes LVM e tento executar xfs_check em um (todos os volumes lógicos estão usando XFS). Diz

ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_check. If you are unable to mount the filesystem, then use the xfs_repair -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this.

então eu vou em frente e mount , o que funciona bem, então unmount novamente para que eu possa executar o cheque. Isso é executado por um tempo, até que seja morto por usar muita memória.

# xfs_check /dev/storage/series 
/usr/sbin/xfs_check: line 31: 14350 Killed 
                xfs_db$DBOPTS -F -i -p xfs_check -c "check$OPTS" $1

dmesg então relata

xfs_db invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0
xfs_db cpuset=/ mems_allowed=0
Pid: 14350, comm: xfs_db Tainted: P           2.6.32-gentoo-r7 #1
Call Trace:
 [<ffffffff81067aec>] ? 0xffffffff81067aec
 [<ffffffff8107a848>] 0xffffffff8107a848
 [<ffffffff8104ee2c>] ? 0xffffffff8104ee2c
 [<ffffffff8107ac83>] 0xffffffff8107ac83
 [<ffffffff8107adf1>] 0xffffffff8107adf1
 [<ffffffff8107d460>] 0xffffffff8107d460
 [<ffffffff8129d69e>] ? 0xffffffff8129d69e
 [<ffffffff8108a40d>] 0xffffffff8108a40d
 [<ffffffff8108bd67>] 0xffffffff8108bd67
 [<ffffffff810258ff>] 0xffffffff810258ff
 [<ffffffff8140290f>] 0xffffffff8140290f
Mem-Info:
DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
CPU    1: hi:    0, btch:   1 usd:   0
DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd: 103
CPU    1: hi:  186, btch:  31 usd: 177
Normal per-cpu:
CPU    0: hi:  186, btch:  31 usd:  35
CPU    1: hi:  186, btch:  31 usd: 155
active_anon:717606 inactive_anon:271926 isolated_anon:0
 active_file:155 inactive_file:217 isolated_file:0
 unevictable:0 dirty:0 writeback:48 unstable:0
 free:6959 slab_reclaimable:1102 slab_unreclaimable:4133
 mapped:156 shmem:0 pagetables:3644 bounce:0
DMA free:15888kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15272kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2999 4009 4009
DMA32 free:10020kB min:6052kB low:7564kB high:9076kB active_anon:2377112kB inactive_anon:594248kB active_file:252kB inactive_file:268kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3071904kB mlocked:0kB dirty:0kB writeback:16kB mapped:196kB shmem:0kB slab_reclaimable:1620kB slab_unreclaimable:3980kB kernel_stack:56kB pagetables:3636kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:800 all_unreclaimable? yes
lowmem_reserve[]: 0 0 1010 1010
Normal free:1928kB min:2036kB low:2544kB high:3052kB active_anon:493312kB inactive_anon:493456kB active_file:368kB inactive_file:600kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1034240kB mlocked:0kB dirty:0kB writeback:176kB mapped:428kB shmem:0kB slab_reclaimable:2788kB slab_unreclaimable:12552kB kernel_stack:1008kB pagetables:10940kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2872 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
DMA: 0*4kB 0*8kB 3*16kB 3*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15888kB
DMA32: 459*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 1*4096kB = 10020kB
Normal: 482*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1928kB
2990 total pagecache pages
2626 pages in swap cache
Swap cache stats: add 129611, delete 126985, find 334/869
Free swap  = 0kB
Total swap = 498004kB
1048560 pages RAM
34218 pages reserved
1846 pages shared
1006066 pages non-shared
Out of memory: kill process 14350 (xfs_db) score 105765 or a child
Killed process 14350 (xfs_db)

Os problemas de memória provavelmente não estão relacionados, embora eu não saiba por que xfs_check precisa disso.

smartctl tem isto a dizer sobre a unidade:

# smartctl -a /dev/sdf
smartctl 5.39.1 2010-01-28 r3054 [x86_64-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Blue Serial ATA family
Device Model:     WDC WD5000AAKS-00YGA0
Serial Number:    WD-WCAS80682099
Firmware Version: 12.01C02
User Capacity:    500,107,862,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue May 17 23:17:17 2011 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 (13200) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 154) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303f) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   226   181   021    Pre-fail  Always       -       3675
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       33
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000e   200   200   051    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   061   061   000    Old_age   Always       -       28688
 10 Spin_Retry_Count        0x0012   100   253   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0012   100   253   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       32
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       19
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       35
194 Temperature_Celsius     0x0022   112   095   000    Old_age   Always       -       38
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       1
200 Multi_Zone_Error_Rate   0x0008   200   200   051    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     28541         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

A SMART parece pensar que não há muito errado, mas obviamente algo está acontecendo. Infelizmente, não tenho certeza do que devo tentar agora. Eu gostaria de evitar a troca de cabos ou a substituição da unidade até ter certeza de que é necessário, mas qualquer sugestão é bem-vinda.

Atualizar

Como sugerido por @Zoredache, executei badblocks na unidade.

# badblocks -s /dev/sdf
Checking for bad blocks (read-only test): done

e pelo que pude entender, isso deveria mostrar uma lista de blocos ruins, o que significa que não encontrou nenhum…

    
por PerfectlyNormal 17.05.2011 / 23:33

1 resposta

7

Tente desativar o NCQ para a unidade problemática (referência: esta página e esta página )

echo 1 > /sys/block/sdX/device/queue_depth

Você também pode tentar trocar o cabo SATA para a unidade, porque uma conexão elétrica fraca / limítrofe também pode causar esses tipos de erros.

Quanto ao seu problema de memória ao executar o xfs_check; você só precisa de mais memória RAM e / ou espaço de troca. Esse é um sistema de arquivos muito grande, então não estou surpreso que o xfs_check precise de muita memória.

    
por 20.05.2011 / 18:08