Tente o seguinte com o sistema de arquivos / proc:
Tendo estas linhas em / var / log / syslog
Apr 18 16:53:05 Server kernel: [4487878.816036] ata4: EH in SWNCQ mode,QC:qc_active 0x1 sactive 0x1
Apr 18 16:53:05 Server kernel: [4487878.816058] ata4: SWNCQ:qc_active 0x1 defer_bits 0x0 last_issue_tag 0x0
Apr 18 16:53:05 Server kernel: [4487878.816059] dhfis 0x1 dmafis 0x1 sdbfis 0x0
Apr 18 16:53:05 Server kernel: [4487878.816093] ata4: ATA_REG 0x40 ERR_REG 0x0
Apr 18 16:53:05 Server kernel: [4487878.816108] ata4: tag : dhfis dmafis sdbfis sacitve
Apr 18 16:53:05 Server kernel: [4487878.816125] ata4: tag 0x0: 1 1 0 1
Apr 18 16:53:05 Server kernel: [4487878.816150] ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
Apr 18 16:53:05 Server kernel: [4487878.816178] ata4.00: failed command: WRITE FPDMA QUEUED
Apr 18 16:53:05 Server kernel: [4487878.816199] ata4.00: cmd 61/08:00:00:88:e0/00:00:e8:00:00/40 tag 0 ncq 4096 out
Apr 18 16:53:05 Server kernel: [4487878.816200] res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Apr 18 16:53:05 Server kernel: [4487878.816253] ata4.00: status: { DRDY }
Apr 18 16:53:05 Server kernel: [4487878.816272] ata4: hard resetting link
Apr 18 16:53:05 Server kernel: [4487878.816274] ata4: nv: skipping hardreset on occupied port
Apr 18 16:53:06 Server kernel: [4487879.676029] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr 18 16:53:07 Server kernel: [4487880.416749] ata4.00: n_sectors mismatch 3907029168 != 268435455
Apr 18 16:53:07 Server kernel: [4487880.416752] ata4.00: revalidation failed (errno=-19)
Apr 18 16:53:07 Server kernel: [4487880.416773] ata4.00: limiting speed to UDMA/133:PIO2
Apr 18 16:53:11 Server kernel: [4487884.676024] ata4: hard resetting link
Apr 18 16:53:11 Server kernel: [4487884.676027] ata4: nv: skipping hardreset on occupied port
Apr 18 16:53:12 Server kernel: [4487885.144032] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr 18 16:53:12 Server kernel: [4487885.240185] ata4.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
Apr 18 16:53:12 Server kernel: [4487885.240190] ata4.00: revalidation failed (errno=-5)
Apr 18 16:53:12 Server kernel: [4487885.240210] ata4.00: disabled
Apr 18 16:53:17 Server kernel: [4487890.144023] ata4: hard resetting link
Apr 18 16:53:17 Server kernel: [4487891.024033] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr 18 16:53:17 Server kernel: [4487891.033357] ata4.00: ATA-8: WDC WD20EARS-00S8B1, 80.00A80, max UDMA/133
Apr 18 16:53:17 Server kernel: [4487891.033360] ata4.00: 3907029168 sectors, multi 1: LBA48 NCQ (depth 31/32)
Apr 18 16:53:17 Server kernel: [4487891.048347] ata4.00: configured for UDMA/133
Apr 18 16:53:17 Server kernel: [4487891.048361] sd 3:0:0:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 18 16:53:17 Server kernel: [4487891.048365] sd 3:0:0:0: [sdc] Sense Key : Aborted Command [current] [descriptor]
Apr 18 16:53:17 Server kernel: [4487891.048369] Descriptor sense data with sense descriptors (in hex):
Apr 18 16:53:17 Server kernel: [4487891.048371] 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
Apr 18 16:53:17 Server kernel: [4487891.048378] 00 00 00 00
Apr 18 16:53:17 Server kernel: [4487891.048382] sd 3:0:0:0: [sdc] Add. Sense: No additional sense information
Apr 18 16:53:17 Server kernel: [4487891.048385] sd 3:0:0:0: [sdc] CDB: Write(10): 2a 00 e8 e0 88 00 00 00 08 00
Apr 18 16:53:17 Server kernel: [4487891.048393] end_request: I/O error, dev sdc, sector 3907028992
Apr 18 16:53:17 Server kernel: [4487891.048420] sd 3:0:0:0: rejecting I/O to offline device
Apr 18 16:53:17 Server kernel: [4487891.048440] sd 3:0:0:0: rejecting I/O to offline device
Apr 18 16:53:17 Server kernel: [4487891.048458] end_request: I/O error, dev sdc, sector 3907028992
Apr 18 16:53:17 Server kernel: [4487891.048477] md: super_written gets error=-5, uptodate=0
Apr 18 16:53:17 Server kernel: [4487891.048482] raid5: Disk failure on sdc, disabling device.
Apr 18 16:53:17 Server kernel: [4487891.048483] raid5: Operation continuing on 3 devices.
Apr 18 16:53:17 Server kernel: [4487891.048525] ata4: EH complete
Apr 18 16:53:17 Server kernel: [4487891.048554] sd 3:0:0:0: rejecting I/O to offline device
Apr 18 16:53:17 Server kernel: [4487891.048576] sd 3:0:0:0: rejecting I/O to offline device
Apr 18 16:53:17 Server kernel: [4487891.048596] sd 3:0:0:0: rejecting I/O to offline device
Apr 18 16:53:17 Server kernel: [4487891.048615] sd 3:0:0:0: [sdc] READ CAPACITY(16) failed
Apr 18 16:53:17 Server kernel: [4487891.048617] sd 3:0:0:0: [sdc] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Apr 18 16:53:17 Server kernel: [4487891.048620] sd 3:0:0:0: [sdc] Sense not available.
Apr 18 16:53:17 Server kernel: [4487891.048624] sd 3:0:0:0: rejecting I/O to offline device
Apr 18 16:53:17 Server kernel: [4487891.048643] sd 3:0:0:0: rejecting I/O to offline device
Apr 18 16:53:17 Server kernel: [4487891.048663] sd 3:0:0:0: rejecting I/O to offline device
Apr 18 16:53:17 Server kernel: [4487891.048681] sd 3:0:0:0: [sdc] READ CAPACITY failed
Apr 18 16:53:17 Server kernel: [4487891.048683] sd 3:0:0:0: [sdc] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Apr 18 16:53:17 Server kernel: [4487891.048685] sd 3:0:0:0: [sdc] Sense not available.
Apr 18 16:53:17 Server kernel: [4487891.048689] sd 3:0:0:0: rejecting I/O to offline device
Apr 18 16:53:17 Server kernel: [4487891.048709] sd 3:0:0:0: rejecting I/O to offline device
Apr 18 16:53:17 Server kernel: [4487891.048800] sd 3:0:0:0: rejecting I/O to offline device
Apr 18 16:53:17 Server kernel: [4487891.048860] sd 3:0:0:0: rejecting I/O to offline device
Apr 18 16:53:17 Server kernel: [4487891.049028] sd 3:0:0:0: [sdc] Asking for cache data failed
Apr 18 16:53:17 Server kernel: [4487891.049048] sd 3:0:0:0: [sdc] Assuming drive cache: write through
Apr 18 16:53:17 Server kernel: [4487891.049071] sdc: detected capacity change from 2000398934016 to 0
Apr 18 16:53:17 Server kernel: [4487891.049080] ata4.00: detaching (SCSI 3:0:0:0)
Apr 18 16:53:18 Server kernel: [4487891.061149] sd 3:0:0:0: [sdc] Stopping disk
Apr 18 16:53:18 Server kernel: [4487891.485492] RAID5 conf printout:
Apr 18 16:53:18 Server kernel: [4487891.485496] --- rd:4 wd:3
Apr 18 16:53:18 Server kernel: [4487891.485500] disk 0, o:1, dev:sdb
Apr 18 16:53:18 Server kernel: [4487891.485502] disk 1, o:0, dev:sdc
Apr 18 16:53:18 Server kernel: [4487891.485504] disk 2, o:1, dev:sdd
Apr 18 16:53:18 Server kernel: [4487891.485506] disk 3, o:1, dev:sde
Apr 18 16:53:18 Server kernel: [4487891.497014] RAID5 conf printout:
Apr 18 16:53:18 Server kernel: [4487891.497016] --- rd:4 wd:3
Apr 18 16:53:18 Server kernel: [4487891.497018] disk 0, o:1, dev:sdb
Apr 18 16:53:18 Server kernel: [4487891.497019] disk 2, o:1, dev:sdd
Apr 18 16:53:18 Server kernel: [4487891.497021] disk 3, o:1, dev:sde
Apr 18 16:53:18 Server kernel: [4487891.838719] scsi 3:0:0:0: Direct-Access ATA WDC WD20EARS-00S 80.0 PQ: 0 ANSI: 5
Apr 18 16:53:18 Server kernel: [4487891.838886] sd 3:0:0:0: Attached scsi generic sg3 type 0
Apr 18 16:53:18 Server kernel: [4487891.838911] sd 3:0:0:0: [sdf] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
Apr 18 16:53:18 Server kernel: [4487891.838964] sd 3:0:0:0: [sdf] Write Protect is off
Apr 18 16:53:18 Server kernel: [4487891.838967] sd 3:0:0:0: [sdf] Mode Sense: 00 3a 00 00
Apr 18 16:53:18 Server kernel: [4487891.838988] sd 3:0:0:0: [sdf] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Apr 18 16:53:20 Server kernel: [4487891.839147] sdf: unknown partition table
Apr 18 16:53:20 Server kernel: [4487893.130026] sd 3:0:0:0: [sdf] Attached SCSI disk
Neste momento, não consigo fazer nada em / dev / sdc. Existe alguma maneira de tentar anexá-lo novamente? Eu não quero desligar o servidor a menos que seja absolutamente necessário
Sistema:
cat / proc / mdstat
Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdb[0] sdc[4](F) sde[3] sdd[2]
5860543488 blocks level 5, 64k chunk, algorithm 2 [4/3] [U_UU]
unused devices: <none>
mdadm --examine --scan
ARRAY /dev/md0 UUID=1a7744b5:912ec7af:f82a9565:e3b453b4
Tente o seguinte com o sistema de arquivos / proc:
Não sei o que você acha que vai ganhar adicionando um disco com falha de volta ao seu array. Esses erros não são soft erros - o disco está em sua saída.
Apr 18 16:53:05 Server kernel: [4487878.816178] ata4.00: failed command: WRITE FPDMA QUEUED
Apr 18 16:53:05 Server kernel: [4487878.816199] ata4.00: cmd 61/08:00:00:88:e0/00:00:e8:00:00/40 tag 0 ncq 4096 out
Apr 18 16:53:05 Server kernel: [4487878.816200] res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Apr 18 16:53:05 Server kernel: [4487878.816253] ata4.00: status: { DRDY }
Apr 18 16:53:05 Server kernel: [4487878.816272] ata4: hard resetting link
Apr 18 16:53:05 Server kernel: [4487878.816274] ata4: nv: skipping hardreset on occupied port
Apr 18 16:53:06 Server kernel: [4487879.676029] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr 18 16:53:07 Server kernel: [4487880.416749] ata4.00: n_sectors mismatch 3907029168 != 268435455
Apr 18 16:53:07 Server kernel: [4487880.416752] ata4.00: revalidation failed (errno=-19)
Falha ao escrever um comando, redefinir o link, agora está vendo uma incompatibilidade de setor na unidade.
Apr 18 16:53:12 Server kernel: [4487885.240185] ata4.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
Apr 18 16:53:12 Server kernel: [4487885.240190] ata4.00: revalidation failed (errno=-5)
Apr 18 16:53:12 Server kernel: [4487885.240210] ata4.00: disabled
Falha ao responder a um comando IDENTIFY.
Apr 18 16:53:17 Server kernel: [4487891.048615] sd 3:0:0:0: [sdc] READ CAPACITY(16) failed
Apr 18 16:53:17 Server kernel: [4487891.048617] sd 3:0:0:0: [sdc] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Apr 18 16:53:17 Server kernel: [4487891.048620] sd 3:0:0:0: [sdc] Sense not available.
A unidade não respondeu a um comando READ CAPACITY.
O fato de o disco retornar tanto quanto apresentar um dispositivo de bloco ao Linux é um arenque vermelho. Você deve substituí-lo, não gastar tempo tentando obter um disco que parece muito com a falha de volta em uma matriz RAID. Mesmo que você o tenha devolvido, ele falhará novamente em breve, corromperá silenciosamente seus dados ou ambos.
A substituição de discos SATA não requer, tecnicamente, o desligamento dos discos. Compreendo que seu chassi pode não ter compartimentos de hotswap e pode não permitir acesso fácil para substituir os discos, mas você pode considerar aproveitar esta oportunidade para instalar um adaptador de bay hotswap SATA. Algo como este da Addonics, por exemplo - cabe em 3 baias de 5,25 "e fornece 5x 3,5" a quente -swap bandejas de unidade de acesso. Faz a substituição de discos muito mais fácil.
Eu tive o mesmo problema com um controlador Marvell. Eu desabilitei o NCQ e isso não aconteceu novamente.
echo 1 > /sys/block/YOUR_DEVICE/device/queue_depth