Eu tenho um volume LVM configurado com vários discos rígidos, e um deles parece estar falhando, ou pelo menos algo estranho está acontecendo. Toda vez que o volume lógico series
vê uma atividade de gravação pesada, o programa em execução (rTorrent na maioria das vezes) trava e dmesg
relatórios
ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x1810000 action 0xe frozen
ata6.00: irq_stat 0x00400000, PHY RDY changed
ata6: SError: { PHYRdyChg LinkSeq TrStaTrns }
ata6.00: failed command: FLUSH CACHE EXT
ata6.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
res 40/00:2c:ff:e3:e3/00:00:39:00:00/40 Emask 0x10 (ATA bus error)
ata6.00: status: { DRDY }
ata6: hard resetting link
ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata6.00: configured for UDMA/133
end_request: I/O error, dev sdf, sector 0
ata6: EH complete
I/O error in filesystem ("dm-3") meta-data dev dm-3 block 0x640092a ("xlog_iodone") error 5 buf count 32768
xfs_force_shutdown(dm-3,0x2) called from line 1043 of file fs/xfs/xfs_log.c. Return address = 0xffffffff8119b919
Filesystem "dm-3": Log I/O Error Detected. Shutting down filesystem: dm-3
Please umount the filesystem, and rectify the problem(s)
xfs_force_shutdown(dm-3,0x2) called from line 811 of file fs/xfs/xfs_log.c. Return address = 0xffffffff8119ccfb
Filesystem "dm-3": xfs_log_force: error 5 returned.
Filesystem "dm-3": xfs_log_force: error 5 returned.
Filesystem "dm-3": xfs_log_force: error 5 returned.
Filesystem "dm-3": xfs_log_force: error 5 returned.
Filesystem "dm-3": xfs_log_force: error 5 returned.
Filesystem "dm-3": xfs_log_force: error 5 returned.
... and so on
O volume em si:
--- Logical volume ---
LV Name /dev/storage/series
VG Name storage
LV UUID sF6I3A-Ttt5-PEml-BY5i-edOV-43ha-5P75Z3
LV Write Access read/write
LV Status available
# open 1
LV Size 2.86 TiB
Current LE 748800
Segments 29
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:3
Eu, então, umount
de todos os volumes LVM e tento executar xfs_check
em um (todos os volumes lógicos estão usando XFS). Diz
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed. Mount the filesystem to replay the log, and unmount it before
re-running xfs_check. If you are unable to mount the filesystem, then use
the xfs_repair -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.
então eu vou em frente e mount
, o que funciona bem, então unmount
novamente para que eu possa executar o cheque.
Isso é executado por um tempo, até que seja morto por usar muita memória.
# xfs_check /dev/storage/series
/usr/sbin/xfs_check: line 31: 14350 Killed
xfs_db$DBOPTS -F -i -p xfs_check -c "check$OPTS" $1
dmesg então relata
xfs_db invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0
xfs_db cpuset=/ mems_allowed=0
Pid: 14350, comm: xfs_db Tainted: P 2.6.32-gentoo-r7 #1
Call Trace:
[<ffffffff81067aec>] ? 0xffffffff81067aec
[<ffffffff8107a848>] 0xffffffff8107a848
[<ffffffff8104ee2c>] ? 0xffffffff8104ee2c
[<ffffffff8107ac83>] 0xffffffff8107ac83
[<ffffffff8107adf1>] 0xffffffff8107adf1
[<ffffffff8107d460>] 0xffffffff8107d460
[<ffffffff8129d69e>] ? 0xffffffff8129d69e
[<ffffffff8108a40d>] 0xffffffff8108a40d
[<ffffffff8108bd67>] 0xffffffff8108bd67
[<ffffffff810258ff>] 0xffffffff810258ff
[<ffffffff8140290f>] 0xffffffff8140290f
Mem-Info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 103
CPU 1: hi: 186, btch: 31 usd: 177
Normal per-cpu:
CPU 0: hi: 186, btch: 31 usd: 35
CPU 1: hi: 186, btch: 31 usd: 155
active_anon:717606 inactive_anon:271926 isolated_anon:0
active_file:155 inactive_file:217 isolated_file:0
unevictable:0 dirty:0 writeback:48 unstable:0
free:6959 slab_reclaimable:1102 slab_unreclaimable:4133
mapped:156 shmem:0 pagetables:3644 bounce:0
DMA free:15888kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15272kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2999 4009 4009
DMA32 free:10020kB min:6052kB low:7564kB high:9076kB active_anon:2377112kB inactive_anon:594248kB active_file:252kB inactive_file:268kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3071904kB mlocked:0kB dirty:0kB writeback:16kB mapped:196kB shmem:0kB slab_reclaimable:1620kB slab_unreclaimable:3980kB kernel_stack:56kB pagetables:3636kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:800 all_unreclaimable? yes
lowmem_reserve[]: 0 0 1010 1010
Normal free:1928kB min:2036kB low:2544kB high:3052kB active_anon:493312kB inactive_anon:493456kB active_file:368kB inactive_file:600kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1034240kB mlocked:0kB dirty:0kB writeback:176kB mapped:428kB shmem:0kB slab_reclaimable:2788kB slab_unreclaimable:12552kB kernel_stack:1008kB pagetables:10940kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2872 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
DMA: 0*4kB 0*8kB 3*16kB 3*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15888kB
DMA32: 459*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 1*4096kB = 10020kB
Normal: 482*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1928kB
2990 total pagecache pages
2626 pages in swap cache
Swap cache stats: add 129611, delete 126985, find 334/869
Free swap = 0kB
Total swap = 498004kB
1048560 pages RAM
34218 pages reserved
1846 pages shared
1006066 pages non-shared
Out of memory: kill process 14350 (xfs_db) score 105765 or a child
Killed process 14350 (xfs_db)
Os problemas de memória provavelmente não estão relacionados, embora eu não saiba por que xfs_check
precisa disso.
smartctl
tem isto a dizer sobre a unidade:
# smartctl -a /dev/sdf
smartctl 5.39.1 2010-01-28 r3054 [x86_64-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Blue Serial ATA family
Device Model: WDC WD5000AAKS-00YGA0
Serial Number: WD-WCAS80682099
Firmware Version: 12.01C02
User Capacity: 500,107,862,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Tue May 17 23:17:17 2011 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (13200) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 154) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x303f) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0003 226 181 021 Pre-fail Always - 3675
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 33
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x000e 200 200 051 Old_age Always - 0
9 Power_On_Hours 0x0032 061 061 000 Old_age Always - 28688
10 Spin_Retry_Count 0x0012 100 253 051 Old_age Always - 0
11 Calibration_Retry_Count 0x0012 100 253 051 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 32
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 19
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 35
194 Temperature_Celsius 0x0022 112 095 000 Old_age Always - 38
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 1
200 Multi_Zone_Error_Rate 0x0008 200 200 051 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 28541 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
A SMART parece pensar que não há muito errado, mas obviamente algo está acontecendo. Infelizmente, não tenho certeza do que devo tentar agora. Eu gostaria de evitar a troca de cabos ou a substituição da unidade até ter certeza de que é necessário, mas qualquer sugestão é bem-vinda.
Atualizar
Como sugerido por @Zoredache, executei badblocks
na unidade.
# badblocks -s /dev/sdf
Checking for bad blocks (read-only test): done
e pelo que pude entender, isso deveria mostrar uma lista de blocos ruins, o que significa que não encontrou nenhum…