Uma máquina Debian 6 com o software RAID1 que gerenciamos (mas não tem acesso físico) está emitindo todos os tipos de erros sobre os discos (tanto ATA1 quanto ATA2).
Eu não tenho ideia do que isso poderia ser. Os discos parecem estar bem. Nós não notamos trancamentos ou o que for com os sites que o servidor está servindo.
Eu sei que esta é uma pergunta 'poderia ser qualquer coisa', mas eu realmente espero que alguém possa me ajudar.
Especificações:
- Debian 6, executando o hipervisor Xen
- Discos: 250 GB WDC WD2500AAKX-00U6AA0
- NCQ suportado e ativado: setores ata2.00: 488397168, multi 16: LBA48 NCQ (profundidade 31/32), AA
- Controlador SATA: controlador SAHIA AHCI de 6 portas da Patsburg da Intel Corporation (rev 06)
- Kernel: 2.6.32-5-xen-amd64
- Ram: 16 GB
- CPU Intel (R) Xeon (R) E5-2620 0 @ 2.00GHz
Aqui estão alguns dos erros:
[2013-05-13 21:36:17] ata1.00: exception Emask 0x10 SAct 0x3 SErr 0x400100 action 0x6 frozen
[2013-05-13 21:36:17] ata1.00: irq_stat 0x08000000, interface fatal error
[2013-05-13 21:36:17] ata1: SError: { UnrecovData Handshk }
[2013-05-13 21:36:17] ata1.00: failed command: WRITE FPDMA QUEUED
[2013-05-13 21:36:17] ata1.00: cmd 61/08:00:98:1f:5e/00:00:0d:00:00/40 tag 0 ncq 4096 out
[2013-05-13 21:36:17] res 40/00:0c:58:3a:62/00:00:0d:00:00/40 Emask 0x10 (ATA bus error)
[2013-05-13 21:36:17] ata1.00: status: { DRDY }
[2013-05-13 21:36:17] ata1.00: failed command: WRITE FPDMA QUEUED
[2013-05-13 21:36:17] ata1.00: cmd 61/08:08:58:3a:62/00:00:0d:00:00/40 tag 1 ncq 4096 out
[2013-05-13 21:36:17] res 40/00:0c:58:3a:62/00:00:0d:00:00/40 Emask 0x10 (ATA bus error)
[2013-05-13 21:36:17] ata1.00: status: { DRDY }
[2013-05-13 21:36:17] ata1: hard resetting link
[2013-05-13 21:36:17] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[2013-05-13 21:36:17] ata1.00: configured for UDMA/133
[2013-05-13 21:36:17] ata1: EH complete
e
[2013-05-15 08:58:29] ata1.00: exception Emask 0x10 SAct 0x40f SErr 0x400100 action 0x6 frozen
[2013-05-15 08:58:29] ata1.00: irq_stat 0x08000000, interface fatal error
[2013-05-15 08:58:29] ata1: SError: { UnrecovData Handshk }
[2013-05-15 08:58:29] ata1.00: failed command: WRITE FPDMA QUEUED
[2013-05-15 08:58:29] ata1.00: cmd 61/58:00:48:c4:6b/00:00:0d:00:00/40 tag 0 ncq 45056 out
[2013-05-15 08:58:29] res 40/00:1c:78:cb:6b/00:00:0d:00:00/40 Emask 0x10 (ATA bus error)
[2013-05-15 08:58:29] ata1.00: status: { DRDY }
[2013-05-15 08:58:29] ata1.00: failed command: WRITE FPDMA QUEUED
[2013-05-15 08:58:29] ata1.00: cmd 61/10:08:78:c8:6b/01:00:0d:00:00/40 tag 1 ncq 139264 out
[2013-05-15 08:58:29] res 40/00:1c:78:cb:6b/00:00:0d:00:00/40 Emask 0x10 (ATA bus error)
[2013-05-15 08:58:29] ata1.00: status: { DRDY }
[2013-05-15 08:58:29] ata1.00: failed command: WRITE FPDMA QUEUED
[2013-05-15 08:58:29] ata1.00: cmd 61/b0:10:c8:ca:6b/00:00:0d:00:00/40 tag 2 ncq 90112 out
[2013-05-15 08:58:29] res 40/00:1c:78:cb:6b/00:00:0d:00:00/40 Emask 0x10 (ATA bus error)
[2013-05-15 08:58:29] ata1.00: status: { DRDY }
[2013-05-15 08:58:29] ata1.00: failed command: WRITE FPDMA QUEUED
[2013-05-15 08:58:29] ata1.00: cmd 61/58:18:78:cb:6b/00:00:0d:00:00/40 tag 3 ncq 45056 out
[2013-05-15 08:58:29] res 40/00:1c:78:cb:6b/00:00:0d:00:00/40 Emask 0x10 (ATA bus error)
[2013-05-15 08:58:29] ata1.00: status: { DRDY }
[2013-05-15 08:58:29] ata1.00: failed command: WRITE FPDMA QUEUED
[2013-05-15 08:58:29] ata1.00: cmd 61/b0:50:c8:c7:6b/00:00:0d:00:00/40 tag 10 ncq 90112 out
[2013-05-15 08:58:29] res 40/00:1c:78:cb:6b/00:00:0d:00:00/40 Emask 0x10 (ATA bus error)
[2013-05-15 08:58:29] ata1.00: status: { DRDY }
[2013-05-15 08:58:29] ata1: hard resetting link
[2013-05-15 08:58:29] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[2013-05-15 08:58:29] ata1.00: configured for UDMA/133
[2013-05-15 08:58:29] ata1: EH complete
e
[2013-05-19 01:21:19] ata2.00: exception Emask 0x10 SAct 0x3 SErr 0x400100 action 0x6 frozen
[2013-05-19 01:21:19] ata2.00: irq_stat 0x08000000, interface fatal error
[2013-05-19 01:21:19] ata2: SError: { UnrecovData Handshk }
[2013-05-19 01:21:19] ata2.00: failed command: WRITE FPDMA QUEUED
[2013-05-19 01:21:19] ata2.00: cmd 61/58:00:e8:75:93/00:00:12:00:00/40 tag 0 ncq 45056 out
[2013-05-19 01:21:19] res 40/00:0c:40:76:93/00:00:12:00:00/40 Emask 0x10 (ATA bus error)
[2013-05-19 01:21:19] ata2.00: status: { DRDY }
[2013-05-19 01:21:19] ata2.00: failed command: WRITE FPDMA QUEUED
[2013-05-19 01:21:19] ata2.00: cmd 61/b0:08:40:76:93/00:00:12:00:00/40 tag 1 ncq 90112 out
[2013-05-19 01:21:19] res 40/00:0c:40:76:93/00:00:12:00:00/40 Emask 0x10 (ATA bus error)
[2013-05-19 01:21:19] ata2.00: status: { DRDY }
[2013-05-19 01:21:19] ata2: hard resetting link
[2013-05-19 01:21:19] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[2013-05-19 01:21:19] ata2.00: configured for UDMA/133
[2013-05-19 01:21:19] ata2: EH complete
O SMART não apresenta erros. Aqui está o smart de SDA (SDB é similar):
=== START OF INFORMATION SECTION ===
Device Model: WDC WD2500AAKX-00U6AA0
Serial Number: WD-WCC2H0107714
Firmware Version: 15.01H15
User Capacity: 250,059,350,016 bytes
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 145 145 021 Pre-fail Always - 3750
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 9
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3430
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 7
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 6
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2
194 Temperature_Celsius 0x0022 110 108 000 Old_age Always - 33
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
E os erros da interface sda:
# smartctl -l sataphy /dev/sda
smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
General Purpose Logging (GPL) feature set supported
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 2 0 Command failed due to ICRC error
0x0002 2 165 R_ERR response for data FIS
0x0003 2 0 R_ERR response for device-to-host data FIS
0x0004 2 165 R_ERR response for host-to-device data FIS
0x0005 2 0 R_ERR response for non-data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 0 R_ERR response for host-to-device non-data FIS
0x000a 2 49 Device-to-host register FISes sent due to a COMRESET
0x000b 2 79 CRC errors within host-to-device FIS
0x8000 4 12672920 Vendor specific
e sdb:
# smartctl -l sataphy /dev/sdb
smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
General Purpose Logging (GPL) feature set supported
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 2 0 Command failed due to ICRC error
0x0002 2 45 R_ERR response for data FIS
0x0003 2 0 R_ERR response for device-to-host data FIS
0x0004 2 45 R_ERR response for host-to-device data FIS
0x0005 2 2 R_ERR response for non-data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 2 R_ERR response for host-to-device non-data FIS
0x000a 2 46 Device-to-host register FISes sent due to a COMRESET
0x000b 2 22 CRC errors within host-to-device FIS
0x8000 4 12672927 Vendor specific