Problema com RAID5 (mdadm) - disco desanexado

1

Tendo estas linhas em / var / log / syslog

 Apr 18 16:53:05 Server kernel: [4487878.816036] ata4: EH in SWNCQ mode,QC:qc_active 0x1 sactive 0x1
    Apr 18 16:53:05 Server kernel: [4487878.816058] ata4: SWNCQ:qc_active 0x1 defer_bits 0x0 last_issue_tag 0x0
    Apr 18 16:53:05 Server kernel: [4487878.816059]   dhfis 0x1 dmafis 0x1 sdbfis 0x0
    Apr 18 16:53:05 Server kernel: [4487878.816093] ata4: ATA_REG 0x40 ERR_REG 0x0
    Apr 18 16:53:05 Server kernel: [4487878.816108] ata4: tag : dhfis dmafis sdbfis sacitve
    Apr 18 16:53:05 Server kernel: [4487878.816125] ata4: tag 0x0: 1 1 0 1
    Apr 18 16:53:05 Server kernel: [4487878.816150] ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
    Apr 18 16:53:05 Server kernel: [4487878.816178] ata4.00: failed command: WRITE FPDMA QUEUED
    Apr 18 16:53:05 Server kernel: [4487878.816199] ata4.00: cmd 61/08:00:00:88:e0/00:00:e8:00:00/40 tag 0 ncq 4096 out
    Apr 18 16:53:05 Server kernel: [4487878.816200]          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
    Apr 18 16:53:05 Server kernel: [4487878.816253] ata4.00: status: { DRDY }
    Apr 18 16:53:05 Server kernel: [4487878.816272] ata4: hard resetting link
    Apr 18 16:53:05 Server kernel: [4487878.816274] ata4: nv: skipping hardreset on occupied port
    Apr 18 16:53:06 Server kernel: [4487879.676029] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
    Apr 18 16:53:07 Server kernel: [4487880.416749] ata4.00: n_sectors mismatch 3907029168 != 268435455
    Apr 18 16:53:07 Server kernel: [4487880.416752] ata4.00: revalidation failed (errno=-19)
    Apr 18 16:53:07 Server kernel: [4487880.416773] ata4.00: limiting speed to UDMA/133:PIO2
    Apr 18 16:53:11 Server kernel: [4487884.676024] ata4: hard resetting link
    Apr 18 16:53:11 Server kernel: [4487884.676027] ata4: nv: skipping hardreset on occupied port
    Apr 18 16:53:12 Server kernel: [4487885.144032] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
    Apr 18 16:53:12 Server kernel: [4487885.240185] ata4.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
    Apr 18 16:53:12 Server kernel: [4487885.240190] ata4.00: revalidation failed (errno=-5)
    Apr 18 16:53:12 Server kernel: [4487885.240210] ata4.00: disabled
    Apr 18 16:53:17 Server kernel: [4487890.144023] ata4: hard resetting link
    Apr 18 16:53:17 Server kernel: [4487891.024033] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
    Apr 18 16:53:17 Server kernel: [4487891.033357] ata4.00: ATA-8: WDC WD20EARS-00S8B1, 80.00A80, max UDMA/133
    Apr 18 16:53:17 Server kernel: [4487891.033360] ata4.00: 3907029168 sectors, multi 1: LBA48 NCQ (depth 31/32)
    Apr 18 16:53:17 Server kernel: [4487891.048347] ata4.00: configured for UDMA/133
    Apr 18 16:53:17 Server kernel: [4487891.048361] sd 3:0:0:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    Apr 18 16:53:17 Server kernel: [4487891.048365] sd 3:0:0:0: [sdc] Sense Key : Aborted Command [current] [descriptor]
    Apr 18 16:53:17 Server kernel: [4487891.048369] Descriptor sense data with sense descriptors (in hex):
    Apr 18 16:53:17 Server kernel: [4487891.048371]         72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
    Apr 18 16:53:17 Server kernel: [4487891.048378]         00 00 00 00
    Apr 18 16:53:17 Server kernel: [4487891.048382] sd 3:0:0:0: [sdc] Add. Sense: No additional sense information
    Apr 18 16:53:17 Server kernel: [4487891.048385] sd 3:0:0:0: [sdc] CDB: Write(10): 2a 00 e8 e0 88 00 00 00 08 00
    Apr 18 16:53:17 Server kernel: [4487891.048393] end_request: I/O error, dev sdc, sector 3907028992
    Apr 18 16:53:17 Server kernel: [4487891.048420] sd 3:0:0:0: rejecting I/O to offline device
    Apr 18 16:53:17 Server kernel: [4487891.048440] sd 3:0:0:0: rejecting I/O to offline device
    Apr 18 16:53:17 Server kernel: [4487891.048458] end_request: I/O error, dev sdc, sector 3907028992
    Apr 18 16:53:17 Server kernel: [4487891.048477] md: super_written gets error=-5, uptodate=0
    Apr 18 16:53:17 Server kernel: [4487891.048482] raid5: Disk failure on sdc, disabling device.
    Apr 18 16:53:17 Server kernel: [4487891.048483] raid5: Operation continuing on 3 devices.
    Apr 18 16:53:17 Server kernel: [4487891.048525] ata4: EH complete
    Apr 18 16:53:17 Server kernel: [4487891.048554] sd 3:0:0:0: rejecting I/O to offline device
    Apr 18 16:53:17 Server kernel: [4487891.048576] sd 3:0:0:0: rejecting I/O to offline device
    Apr 18 16:53:17 Server kernel: [4487891.048596] sd 3:0:0:0: rejecting I/O to offline device
    Apr 18 16:53:17 Server kernel: [4487891.048615] sd 3:0:0:0: [sdc] READ CAPACITY(16) failed
    Apr 18 16:53:17 Server kernel: [4487891.048617] sd 3:0:0:0: [sdc] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
    Apr 18 16:53:17 Server kernel: [4487891.048620] sd 3:0:0:0: [sdc] Sense not available.
    Apr 18 16:53:17 Server kernel: [4487891.048624] sd 3:0:0:0: rejecting I/O to offline device
    Apr 18 16:53:17 Server kernel: [4487891.048643] sd 3:0:0:0: rejecting I/O to offline device
    Apr 18 16:53:17 Server kernel: [4487891.048663] sd 3:0:0:0: rejecting I/O to offline device
    Apr 18 16:53:17 Server kernel: [4487891.048681] sd 3:0:0:0: [sdc] READ CAPACITY failed
    Apr 18 16:53:17 Server kernel: [4487891.048683] sd 3:0:0:0: [sdc] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
    Apr 18 16:53:17 Server kernel: [4487891.048685] sd 3:0:0:0: [sdc] Sense not available.
    Apr 18 16:53:17 Server kernel: [4487891.048689] sd 3:0:0:0: rejecting I/O to offline device
    Apr 18 16:53:17 Server kernel: [4487891.048709] sd 3:0:0:0: rejecting I/O to offline device
    Apr 18 16:53:17 Server kernel: [4487891.048800] sd 3:0:0:0: rejecting I/O to offline device
    Apr 18 16:53:17 Server kernel: [4487891.048860] sd 3:0:0:0: rejecting I/O to offline device
    Apr 18 16:53:17 Server kernel: [4487891.049028] sd 3:0:0:0: [sdc] Asking for cache data failed
    Apr 18 16:53:17 Server kernel: [4487891.049048] sd 3:0:0:0: [sdc] Assuming drive cache: write through
    Apr 18 16:53:17 Server kernel: [4487891.049071] sdc: detected capacity change from 2000398934016 to 0
    Apr 18 16:53:17 Server kernel: [4487891.049080] ata4.00: detaching (SCSI 3:0:0:0)
    Apr 18 16:53:18 Server kernel: [4487891.061149] sd 3:0:0:0: [sdc] Stopping disk
    Apr 18 16:53:18 Server kernel: [4487891.485492] RAID5 conf printout:
    Apr 18 16:53:18 Server kernel: [4487891.485496]  --- rd:4 wd:3
    Apr 18 16:53:18 Server kernel: [4487891.485500]  disk 0, o:1, dev:sdb
    Apr 18 16:53:18 Server kernel: [4487891.485502]  disk 1, o:0, dev:sdc
    Apr 18 16:53:18 Server kernel: [4487891.485504]  disk 2, o:1, dev:sdd
    Apr 18 16:53:18 Server kernel: [4487891.485506]  disk 3, o:1, dev:sde
    Apr 18 16:53:18 Server kernel: [4487891.497014] RAID5 conf printout:
    Apr 18 16:53:18 Server kernel: [4487891.497016]  --- rd:4 wd:3
    Apr 18 16:53:18 Server kernel: [4487891.497018]  disk 0, o:1, dev:sdb
    Apr 18 16:53:18 Server kernel: [4487891.497019]  disk 2, o:1, dev:sdd
    Apr 18 16:53:18 Server kernel: [4487891.497021]  disk 3, o:1, dev:sde
    Apr 18 16:53:18 Server kernel: [4487891.838719] scsi 3:0:0:0: Direct-Access     ATA      WDC WD20EARS-00S 80.0 PQ: 0 ANSI: 5
    Apr 18 16:53:18 Server kernel: [4487891.838886] sd 3:0:0:0: Attached scsi generic sg3 type 0
    Apr 18 16:53:18 Server kernel: [4487891.838911] sd 3:0:0:0: [sdf] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
    Apr 18 16:53:18 Server kernel: [4487891.838964] sd 3:0:0:0: [sdf] Write Protect is off
    Apr 18 16:53:18 Server kernel: [4487891.838967] sd 3:0:0:0: [sdf] Mode Sense: 00 3a 00 00
    Apr 18 16:53:18 Server kernel: [4487891.838988] sd 3:0:0:0: [sdf] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    Apr 18 16:53:20 Server kernel: [4487891.839147]  sdf: unknown partition table
    Apr 18 16:53:20 Server kernel: [4487893.130026] sd 3:0:0:0: [sdf] Attached SCSI disk

Neste momento, não consigo fazer nada em / dev / sdc. Existe alguma maneira de tentar anexá-lo novamente? Eu não quero desligar o servidor a menos que seja absolutamente necessário

Sistema:

  • Debian Stable 2.6.32-5-amd64
  • mdadm versão 3.1.4-1 + 8efb9d1

cat / proc / mdstat

Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdb[0] sdc[4](F) sde[3] sdd[2]
      5860543488 blocks level 5, 64k chunk, algorithm 2 [4/3] [U_UU]

unused devices: <none>

mdadm --examine --scan

ARRAY /dev/md0 UUID=1a7744b5:912ec7af:f82a9565:e3b453b4
    
por poscaman 18.04.2011 / 16:17

3 respostas

1

Tente o seguinte com o sistema de arquivos / proc:

link

    
por 18.04.2011 / 18:14
1

Não sei o que você acha que vai ganhar adicionando um disco com falha de volta ao seu array. Esses erros não são soft erros - o disco está em sua saída.

Apr 18 16:53:05 Server kernel: [4487878.816178] ata4.00: failed command: WRITE FPDMA QUEUED
Apr 18 16:53:05 Server kernel: [4487878.816199] ata4.00: cmd 61/08:00:00:88:e0/00:00:e8:00:00/40 tag 0 ncq 4096 out
Apr 18 16:53:05 Server kernel: [4487878.816200]          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Apr 18 16:53:05 Server kernel: [4487878.816253] ata4.00: status: { DRDY }
Apr 18 16:53:05 Server kernel: [4487878.816272] ata4: hard resetting link
Apr 18 16:53:05 Server kernel: [4487878.816274] ata4: nv: skipping hardreset on occupied port
Apr 18 16:53:06 Server kernel: [4487879.676029] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr 18 16:53:07 Server kernel: [4487880.416749] ata4.00: n_sectors mismatch 3907029168 != 268435455
Apr 18 16:53:07 Server kernel: [4487880.416752] ata4.00: revalidation failed (errno=-19)

Falha ao escrever um comando, redefinir o link, agora está vendo uma incompatibilidade de setor na unidade.

Apr 18 16:53:12 Server kernel: [4487885.240185] ata4.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
Apr 18 16:53:12 Server kernel: [4487885.240190] ata4.00: revalidation failed (errno=-5)
Apr 18 16:53:12 Server kernel: [4487885.240210] ata4.00: disabled

Falha ao responder a um comando IDENTIFY.

Apr 18 16:53:17 Server kernel: [4487891.048615] sd 3:0:0:0: [sdc] READ CAPACITY(16) failed
Apr 18 16:53:17 Server kernel: [4487891.048617] sd 3:0:0:0: [sdc] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Apr 18 16:53:17 Server kernel: [4487891.048620] sd 3:0:0:0: [sdc] Sense not available.

A unidade não respondeu a um comando READ CAPACITY.

O fato de o disco retornar tanto quanto apresentar um dispositivo de bloco ao Linux é um arenque vermelho. Você deve substituí-lo, não gastar tempo tentando obter um disco que parece muito com a falha de volta em uma matriz RAID. Mesmo que você o tenha devolvido, ele falhará novamente em breve, corromperá silenciosamente seus dados ou ambos.

A substituição de discos SATA não requer, tecnicamente, o desligamento dos discos. Compreendo que seu chassi pode não ter compartimentos de hotswap e pode não permitir acesso fácil para substituir os discos, mas você pode considerar aproveitar esta oportunidade para instalar um adaptador de bay hotswap SATA. Algo como este da Addonics, por exemplo - cabe em 3 baias de 5,25 "e fornece 5x 3,5" a quente -swap bandejas de unidade de acesso. Faz a substituição de discos muito mais fácil.

    
por 18.04.2011 / 22:02
0

Eu tive o mesmo problema com um controlador Marvell. Eu desabilitei o NCQ e isso não aconteceu novamente.

echo 1 > /sys/block/YOUR_DEVICE/device/queue_depth
    
por 28.06.2012 / 17:04