ATA erros no Debian 6 Xen host, mas os discos estão bem

2

Uma máquina Debian 6 com o software RAID1 que gerenciamos (mas não tem acesso físico) está emitindo todos os tipos de erros sobre os discos (tanto ATA1 quanto ATA2).

Eu não tenho ideia do que isso poderia ser. Os discos parecem estar bem. Nós não notamos trancamentos ou o que for com os sites que o servidor está servindo.

Eu sei que esta é uma pergunta 'poderia ser qualquer coisa', mas eu realmente espero que alguém possa me ajudar.

Especificações:

  1. Debian 6, executando o hipervisor Xen
  2. Discos: 250 GB WDC WD2500AAKX-00U6AA0
  3. NCQ suportado e ativado: setores ata2.00: 488397168, multi 16: LBA48 NCQ (profundidade 31/32), AA
  4. Controlador SATA: controlador SAHIA AHCI de 6 portas da Patsburg da Intel Corporation (rev 06)
  5. Kernel: 2.6.32-5-xen-amd64
  6. Ram: 16 GB
  7. CPU Intel (R) Xeon (R) E5-2620 0 @ 2.00GHz

Aqui estão alguns dos erros:

[2013-05-13 21:36:17]  ata1.00: exception Emask 0x10 SAct 0x3 SErr 0x400100 action 0x6 frozen
[2013-05-13 21:36:17]  ata1.00: irq_stat 0x08000000, interface fatal error
[2013-05-13 21:36:17]  ata1: SError: { UnrecovData Handshk }
[2013-05-13 21:36:17]  ata1.00: failed command: WRITE FPDMA QUEUED
[2013-05-13 21:36:17]  ata1.00: cmd 61/08:00:98:1f:5e/00:00:0d:00:00/40 tag 0 ncq 4096 out
[2013-05-13 21:36:17]           res 40/00:0c:58:3a:62/00:00:0d:00:00/40 Emask 0x10 (ATA bus error)
[2013-05-13 21:36:17]  ata1.00: status: { DRDY }
[2013-05-13 21:36:17]  ata1.00: failed command: WRITE FPDMA QUEUED
[2013-05-13 21:36:17]  ata1.00: cmd 61/08:08:58:3a:62/00:00:0d:00:00/40 tag 1 ncq 4096 out
[2013-05-13 21:36:17]           res 40/00:0c:58:3a:62/00:00:0d:00:00/40 Emask 0x10 (ATA bus error)
[2013-05-13 21:36:17]  ata1.00: status: { DRDY }
[2013-05-13 21:36:17]  ata1: hard resetting link
[2013-05-13 21:36:17]  ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[2013-05-13 21:36:17]  ata1.00: configured for UDMA/133
[2013-05-13 21:36:17]  ata1: EH complete

e

[2013-05-15 08:58:29]  ata1.00: exception Emask 0x10 SAct 0x40f SErr 0x400100 action 0x6 frozen
[2013-05-15 08:58:29]  ata1.00: irq_stat 0x08000000, interface fatal error
[2013-05-15 08:58:29]  ata1: SError: { UnrecovData Handshk }
[2013-05-15 08:58:29]  ata1.00: failed command: WRITE FPDMA QUEUED
[2013-05-15 08:58:29]  ata1.00: cmd 61/58:00:48:c4:6b/00:00:0d:00:00/40 tag 0 ncq 45056 out
[2013-05-15 08:58:29]           res 40/00:1c:78:cb:6b/00:00:0d:00:00/40 Emask 0x10 (ATA bus error)
[2013-05-15 08:58:29]  ata1.00: status: { DRDY }
[2013-05-15 08:58:29]  ata1.00: failed command: WRITE FPDMA QUEUED
[2013-05-15 08:58:29]  ata1.00: cmd 61/10:08:78:c8:6b/01:00:0d:00:00/40 tag 1 ncq 139264 out
[2013-05-15 08:58:29]           res 40/00:1c:78:cb:6b/00:00:0d:00:00/40 Emask 0x10 (ATA bus error)
[2013-05-15 08:58:29]  ata1.00: status: { DRDY }
[2013-05-15 08:58:29]  ata1.00: failed command: WRITE FPDMA QUEUED
[2013-05-15 08:58:29]  ata1.00: cmd 61/b0:10:c8:ca:6b/00:00:0d:00:00/40 tag 2 ncq 90112 out
[2013-05-15 08:58:29]           res 40/00:1c:78:cb:6b/00:00:0d:00:00/40 Emask 0x10 (ATA bus error)
[2013-05-15 08:58:29]  ata1.00: status: { DRDY }
[2013-05-15 08:58:29]  ata1.00: failed command: WRITE FPDMA QUEUED
[2013-05-15 08:58:29]  ata1.00: cmd 61/58:18:78:cb:6b/00:00:0d:00:00/40 tag 3 ncq 45056 out
[2013-05-15 08:58:29]           res 40/00:1c:78:cb:6b/00:00:0d:00:00/40 Emask 0x10 (ATA bus error)
[2013-05-15 08:58:29]  ata1.00: status: { DRDY }
[2013-05-15 08:58:29]  ata1.00: failed command: WRITE FPDMA QUEUED
[2013-05-15 08:58:29]  ata1.00: cmd 61/b0:50:c8:c7:6b/00:00:0d:00:00/40 tag 10 ncq 90112 out
[2013-05-15 08:58:29]           res 40/00:1c:78:cb:6b/00:00:0d:00:00/40 Emask 0x10 (ATA bus error)
[2013-05-15 08:58:29]  ata1.00: status: { DRDY }
[2013-05-15 08:58:29]  ata1: hard resetting link
[2013-05-15 08:58:29]  ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[2013-05-15 08:58:29]  ata1.00: configured for UDMA/133
[2013-05-15 08:58:29]  ata1: EH complete

e

[2013-05-19 01:21:19]  ata2.00: exception Emask 0x10 SAct 0x3 SErr 0x400100 action 0x6 frozen
[2013-05-19 01:21:19]  ata2.00: irq_stat 0x08000000, interface fatal error
[2013-05-19 01:21:19]  ata2: SError: { UnrecovData Handshk }
[2013-05-19 01:21:19]  ata2.00: failed command: WRITE FPDMA QUEUED
[2013-05-19 01:21:19]  ata2.00: cmd 61/58:00:e8:75:93/00:00:12:00:00/40 tag 0 ncq 45056 out
[2013-05-19 01:21:19]           res 40/00:0c:40:76:93/00:00:12:00:00/40 Emask 0x10 (ATA bus error)
[2013-05-19 01:21:19]  ata2.00: status: { DRDY }
[2013-05-19 01:21:19]  ata2.00: failed command: WRITE FPDMA QUEUED
[2013-05-19 01:21:19]  ata2.00: cmd 61/b0:08:40:76:93/00:00:12:00:00/40 tag 1 ncq 90112 out
[2013-05-19 01:21:19]           res 40/00:0c:40:76:93/00:00:12:00:00/40 Emask 0x10 (ATA bus error)
[2013-05-19 01:21:19]  ata2.00: status: { DRDY }
[2013-05-19 01:21:19]  ata2: hard resetting link
[2013-05-19 01:21:19]  ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[2013-05-19 01:21:19]  ata2.00: configured for UDMA/133
[2013-05-19 01:21:19]  ata2: EH complete

O SMART não apresenta erros. Aqui está o smart de SDA (SDB é similar):

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD2500AAKX-00U6AA0
Serial Number:    WD-WCC2H0107714
Firmware Version: 15.01H15
User Capacity:    250,059,350,016 bytes

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   145   145   021    Pre-fail  Always       -       3750
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       9
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       3430
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       7
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       6
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2
194 Temperature_Celsius     0x0022   110   108   000    Old_age   Always       -       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

E os erros da interface sda:

# smartctl -l sataphy /dev/sda
smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

General Purpose Logging (GPL) feature set supported
SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2          165  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2          165  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x000a  2           49  Device-to-host register FISes sent due to a COMRESET
0x000b  2           79  CRC errors within host-to-device FIS
0x8000  4     12672920  Vendor specific

e sdb:

# smartctl -l sataphy /dev/sdb
smartctl 5.40 2010-07-12 r3124 [x86_64-unknown-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

General Purpose Logging (GPL) feature set supported
SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2           45  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2           45  R_ERR response for host-to-device data FIS
0x0005  2            2  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            2  R_ERR response for host-to-device non-data FIS
0x000a  2           46  Device-to-host register FISes sent due to a COMRESET
0x000b  2           22  CRC errors within host-to-device FIS
0x8000  4     12672927  Vendor specific
    
por Halfgaar 11.07.2013 / 15:58

1 resposta

4

O handshake parece sugerir que o controlador está tendo problemas para falar com os drives. Eu suspeito de interferência elétrica, cabos ruins ou possivelmente um controlador ruim. Neste último caso, a placa-mãe precisaria ser substituída. Você só encontrará o culpado por um processo de eliminação, testando cada peça de hardware separadamente.

    
por 11.07.2013 / 19:59

Tags