Eu tenho um par de servidores HP DL320e configurados de forma idêntica com duas unidades WD Red 6TB em uma matriz de RAID 1 de software. O DL320e tem um controlador raid a bordo que é desativado em favor do ataque ao software linux.
Ambas as máquinas parecem funcionar bem e as raid arrays parecem normais, exceto que toda vez que o raid-check é executado (1am domingo do crontab semanal padrão, mas também se eu executar raid-check manualmente) uma unidade fica off line. Depois disso, os arquivos de dispositivo da unidade "com falha" foram removidos (por exemplo, / dev / sda2), mas após uma reinicialização a frio, eles reaparecem e a unidade "com falha" pode ser adicionada novamente à matriz e parece funcionar normalmente.
Isso vem acontecendo desde que as (novas) máquinas e discos foram instalados há alguns meses. De acordo com o smartctl nenhuma das unidades tem algum setor defeituoso sendo assim, com base em alguns posts em outros lugares, eu tentei usar o hdparm para escrever sobre os setores identificados em / var / log / messages, a fim de forçar a unidade a detectar e detectar trocar setores defeituosos sem efeito.
Eu também tentei escrever zeros em todo o / dev / sdb2 e / dev / sdb3 usando o dd. Isso foi concluído sem causar erros, mas não causou a troca de setores defeituosos, mas parece indicar que toda a superfície da unidade pode ser gravada com êxito.
Eu executei todos os diagnósticos inteligentes usando o smartctl e tudo OK.
Como todas as máquinas foram instaladas a partir de novas e a falha ocorreu em ambos os sistemas e pelo menos 3 das 4 unidades falharam (ambas as unidades em uma máquina falharam em momentos diferentes) não estou disposto a acreditar que isso é causado por hardware defeituoso. O fato de que um dd de / dev / zero para o conjunto de uma unidade com falha concluída prova que a unidade é gravável em toda a superfície.
As unidades são configuradas com 3 partições, biosboot, / boot e root + / home.
Os registros de ambos os servidores são mais ou menos idênticos, embora eles relatem números de setores diferentes, e os números de setor relatados semanalmente no mesmo servidor também são diferentes.
/ proc / mdstat reports
sh-4.2# cat /proc/mdstat
Personalities : [raid1]
md126 : active raid1 sda3[0] sdb3[1]
5859876672 blocks super 1.2 [2/2] [UU]
bitmap: 2/44 pages [8KB], 65536KB chunk
md127 : active raid1 sda2[2] sdb2[3]
511936 blocks super 1.0 [2/2] [UU]
unused devices: <none>
sh-4.2#
O tempo passa até 1 da manhã de domingo e depois:
WARNING: Your hard drive is failing
Device: /dev/sda [SAT], unable to open device
sh-4.2# cat /proc/mdstat
Personalities : [raid1]
md126 : active raid1 sda3[0](F) sdb3[1]
5859876672 blocks super 1.2 [2/1] [_U]
bitmap: 5/44 pages [20KB], 65536KB chunk
md127 : active raid1 sda2[2](F) sdb2[3]
511936 blocks super 1.0 [2/1] [_U]
unused devices: <none>
sh-4.2#
/ var / log / messages reports
Jun 7 01:00:01 1000 kernel: md: data-check of RAID array md126
Jun 7 01:00:01 1000 kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Jun 7 01:00:01 1000 kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Jun 7 01:00:01 1000 kernel: md: using 128k window, over a total of 5859876672k.
Jun 7 01:00:07 1000 kernel: md: delaying data-check of md127 until md126 has finished (they share one or more physical units)
Jun 7 01:01:01 1000 systemd: Starting Session 1544 of user root.
Jun 7 01:01:01 1000 systemd: Started Session 1544 of user root.
Jun 7 01:03:43 1000 kernel: ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x40000 action 0x6 frozen
Jun 7 01:03:43 1000 kernel: ata1: SError: { CommWake }
Jun 7 01:03:43 1000 kernel: ata1.00: failed command: READ FPDMA QUEUED
Jun 7 01:03:43 1000 kernel: ata1.00: cmd 60/80:00:80:1b:70/00:00:03:00:00/40 tag 0 ncq 65536 in
res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 7 01:03:43 1000 kernel: ata1.00: status: { DRDY }
Jun 7 01:03:43 1000 kernel: ata1.00: failed command: READ FPDMA QUEUED
Jun 7 01:03:43 1000 kernel: ata1.00: cmd 60/80:08:00:1c:70/00:00:03:00:00/40 tag 1 ncq 65536 in
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 7 01:03:43 1000 kernel: ata1.00: status: { DRDY }
Jun 7 01:03:43 1000 kernel: ata1.00: failed command: READ FPDMA QUEUED
Jun 7 01:03:43 1000 kernel: ata1.00: cmd 60/80:10:00:0d:70/00:00:03:00:00/40 tag 2 ncq 65536 in
res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
repetido com valores crescentes de tag até 30 seguidos de
Jun 7 01:07:10 1000 kernel: ata1.00: cmd 60/80:f0:00:cd:7f/00:00:06:00:00/40 tag 30 ncq 65536 in
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 7 01:07:10 1000 kernel: ata1.00: status: { DRDY }
Jun 7 01:07:10 1000 kernel: ata1: hard resetting link
Jun 7 01:07:11 1000 kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jun 7 01:07:11 1000 kernel: ata1.00: configured for UDMA/133
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:07:11 1000 kernel: ata1: EH complete
Jun 7 01:09:53 1000 kernel: ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x40000 action 0x6 frozen
Jun 7 01:09:53 1000 kernel: ata1: SError: { CommWake }
Jun 7 01:09:53 1000 kernel: ata1.00: failed command: READ FPDMA QUEUED
Jun 7 01:09:53 1000 kernel: ata1.00: cmd 60/80:00:80:f6:dd/00:00:08:00:00/40 tag 0 ncq 65536 in
res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 7 01:09:53 1000 kernel: ata1.00: status: { DRDY }
Jun 7 01:09:53 1000 kernel: ata1.00: failed command: READ FPDMA QUEUED
Jun 7 01:09:53 1000 kernel: ata1.00: cmd 60/80:08:00:f7:dd/00:00:08:00:00/40 tag 1 ncq 65536 in
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 7 01:09:53 1000 kernel: ata1.00: status: { DRDY }
Mais repetições até a tag 30 e depois
Jun 7 01:09:53 1000 kernel: ata1.00: failed command: READ FPDMA QUEUED
Jun 7 01:09:53 1000 kernel: ata1.00: cmd 60/80:f0:00:f6:dd/00:00:08:00:00/40 tag 30 ncq 65536 in
res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Jun 7 01:09:53 1000 kernel: ata1.00: status: { DRDY }
Jun 7 01:09:53 1000 kernel: ata1: hard resetting link
Jun 7 01:09:59 1000 kernel: ata1: link is slow to respond, please be patient (ready=0)
Jun 7 01:10:01 1000 systemd: Starting Session 1545 of user root.
Jun 7 01:10:01 1000 systemd: Started Session 1545 of user root.
Jun 7 01:10:03 1000 kernel: ata1: COMRESET failed (errno=-16)
Jun 7 01:10:03 1000 kernel: ata1: hard resetting link
Jun 7 01:10:04 1000 kernel: ata1: SATA link down (SStatus 0 SControl 300)
Jun 7 01:10:09 1000 kernel: ata1: hard resetting link
Jun 7 01:10:09 1000 kernel: ata1: SATA link down (SStatus 0 SControl 300)
Jun 7 01:10:09 1000 kernel: ata1: limiting SATA link speed to 1.5 Gbps
Jun 7 01:10:14 1000 kernel: ata1: hard resetting link
Jun 7 01:10:14 1000 kernel: ata1: SATA link down (SStatus 0 SControl 310)
Jun 7 01:10:14 1000 kernel: ata1.00: disabled
Jun 7 01:10:14 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:10:14 1000 kernel: ata1.00: device reported invalid CHS sector 0
Mais um bloco
Jun 7 01:10:14 1000 kernel: ata1.00: device reported invalid CHS sector 0
Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda]
Jun 7 01:10:14 1000 kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda]
Jun 7 01:10:14 1000 kernel: Sense Key : Aborted Command [current] [descriptor]
Jun 7 01:10:14 1000 kernel: Descriptor sense data with sense descriptors (in hex):
Jun 7 01:10:14 1000 kernel: 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
Jun 7 01:10:14 1000 kernel: 00 00 00 00
Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda]
Jun 7 01:10:14 1000 kernel: Add. Sense: No additional sense information
Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda] CDB:
Jun 7 01:10:14 1000 kernel: Read(16): 88 00 00 00 00 00 08 dd f6 80 00 00 00 80 00 00
Jun 7 01:10:14 1000 kernel: end_request: I/O error, dev sda, sector 148764288
Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: rejecting I/O to offline device
Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda] killing request
Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda]
e finalmente
Jun 7 01:10:14 1000 kernel: Read(16): 88 00 00 00 00 00 08 dd fb 00 00 00 00 80 00 00
Jun 7 01:10:14 1000 kernel: end_request: I/O error, dev sda, sector 148765440
Jun 7 01:10:14 1000 kernel: ata1: EH complete
Jun 7 01:10:14 1000 kernel: md: super_written gets error=-5, uptodate=0
Jun 7 01:10:14 1000 kernel: md/raid1:md126: Disk failure on sda3, disabling device.
md/raid1:md126: Operation continuing on 1 devices.
Jun 7 01:10:14 1000 kernel: ata1.00: detaching (SCSI 0:0:0:0)
Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda] Stopping disk
Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda] START_STOP FAILED
Jun 7 01:10:14 1000 kernel: sd 0:0:0:0: [sda]
Jun 7 01:10:14 1000 kernel: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jun 7 01:10:14 1000 udisksd[3364]: Unable to resolve /sys/devices/virtual/block/md126/md/dev-sda3/block symlink
Jun 7 01:10:14 1000 kernel: md: md126: data-check interrupted.
Jun 7 01:10:14 1000 kernel: md: super_written gets error=-19, uptodate=0
Jun 7 01:10:14 1000 kernel: md/raid1:md127: Disk failure on sda2, disabling device.
md/raid1:md127: Operation continuing on 1 devices.
Jun 7 01:10:15 1000 kernel: md: md127 still in use.
Jun 7 01:10:15 1000 kernel: md: md126 still in use.
Jun 7 01:10:15 1000 udisksd[3364]: Unable to resolve /sys/devices/virtual/block/md127/md/dev-sda2/block symlink
Jun 7 01:10:15 1000 udisksd[3364]: Unable to resolve /sys/devices/virtual/block/md126/md/dev-sda3/block symlink
Jun 7 01:10:15 1000 udisksd[3364]: Unable to resolve /sys/devices/virtual/block/md127/md/dev-sda2/block symlink
Jun 7 01:10:15 1000 udisksd[3364]: Unable to resolve /sys/devices/virtual/block/md127/md/dev-sda2/block symlink
Jun 7 01:10:15 1000 udisksd[3364]: Unable to resolve /sys/devices/virtual/block/md126/md/dev-sda3/block symlink
Jun 7 01:20:01 1000 systemd: Created slice user-0.slice.
Jun 7 01:20:01 1000 systemd: Starting Session 1546 of user root.
Jun 7 01:20:01 1000 systemd: Started Session 1546 of user root.
Jun 7 01:30:01 1000 systemd: Created slice user-0.slice.
Jun 7 01:30:01 1000 systemd: Starting Session 1547 of user root.
Jun 7 01:30:01 1000 systemd: Started Session 1547 of user root.
Jun 7 01:36:58 1000 smartd[977]: Device: /dev/sda [SAT], open() failed: No such device
Jun 7 01:36:58 1000 smartd[977]: Sending warning via /usr/libexec/smartmontools/smartdnotify to root ...
Jun 7 01:36:58 1000 smartd[977]: Warning via /usr/libexec/smartmontools/smartdnotify to root produced unexpected output (80 bytes) to STDOUT/STDERR:
Jun 7 01:36:58 1000 smartd[977]: /usr/libexec/smartmontools/smartdnotify: line 13: /dev/pts/0: Permission denied
Jun 7 01:36:58 1000 smartd[977]: Warning via /usr/libexec/smartmontools/smartdnotify to root: successful
Se alguém puder sugerir o que pode estar errado para mim aqui eu ficaria muito grato.