O que causa o tempo limite do comando SCSI em 30 segundos?

3

Descrição do sistema:
CentOS 7.
Placa SuperMicro X9DRW-iF.
BIOS mais recente (e ROM de opção SCU), usando a ROM de opção SCU.
Quatro HDDs de 300 GB da WD conectados ao SCU da placa. Seis HDDs conectados às seis portas SATA individuais. Software normal sendo executado para nossos negócios.

Eu configurei este sistema e duas vezes em duas semanas, descobri que todas as quatro unidades conectadas ao SCU estavam "desaparecidas", o que significa que os pontos de montagem do ls -l'ing mostravam

ls: cannot access disk1: Input/output error
ls: cannot access disk2: Input/output error
ls: cannot access disk3: Input/output error
ls: cannot access disk4: Input/output error

Eu vi do dmesg que isci estava abandonando comandos, tentando e não restaurando dispositivos, etc. (a saída do dmesg está na parte inferior).

De Controle do RedHat a página Temporizador de Comando SCSI e Status do Dispositivo , acho que esta é a sequência de eventos de erro "esperada" quando um comando SCSI expira e quando o dispositivo não pode ser redefinido. No entanto, não tenho certeza se é o barramento ou o host que está sendo redefinido. Mas afeta todas as quatro unidades no SCU.

Eu leio essa pergunta ServerFault que eu acho identifica esse problema como" redefinição de loop ", mas não consegui encontrar mais informações sobre redefinições de loop.

Verifiquei que os tempos limites do comando SCSI para essas quatro unidades são 30 segundos, o padrão para RHEL / CentOS 7. (cat / sys / block / sda / dispositivo / tempo limite) Reduzir o tempo limite para 1 segundo ou 0 segundos pode rapidamente fazer com que a uma unidade seja desabilitada, mas raramente, ou nunca consegui fazer com que uma unidade falhasse para desativar todas as quatro unidades na SCU.

Então, aqui estão algumas das minhas perguntas. O objetivo final é configurar as unidades para que esse loop-reset ou "se livrar de todas as unidades SCU" não aconteça.

  1. O que faz com que um comando SCSI demore mais de 30 segundos? Isso parece muito longo.
  2. O que faria com que todas as quatro unidades da SCU fossem afetadas? Isso implica que a reinicialização do dispositivo falhou e que o manipulador de erros foi ligado para redefinir o barramento / host, de acordo com o artigo da Red Hat. O que faria com que a reinicialização da unidade falhasse?
  3. Que tipos de coisas podem fazer isso acontecer raramente, na ordem de uma vez por semana, como eu observei?
  4. Os tempos limite dos comandos SCSI são aceitáveis ou devem ser evitados aumentando-se o tempo limite? De máxima importância para mim é manter as unidades on-line; esperar um pouco por uma resposta da unidade não é um grande problema.
  5. Eu diagnostiquei corretamente isso como uma resposta "esperada" para um tempo limite de comando SCSI? (isto é, um tempo limite de comando SCSI?)
  6. O que tornaria as unidades na SCU mais ou menos responsivas do que as unidades conectadas às portas SATA individuais? Eu não tive problemas com eles.
  7. O enfileiramento de comandos tem algo a ver com isso? A profundidade da fila é 31 para todas as quatro unidades.
  8. Quais são os valores recomendados para tempos limite de comando SCSI para unidades em uma SCU?
  9. No final do dia, como posso reconfigurar para evitar esse problema?

Obrigado por toda e qualquer ajuda. A saída do Dmesg (editada por brevidade) da primeira falha é copiada abaixo. Observe que as unidades são sda, b, c, d e ata7,8,9,10.

17:40:41: isci 0000:08:00.0: isci_task_abort_task: dev = ffff881fce964490 (STP/SATA), task = ffff881fcfc95900, old_request == ffff881fcea13000
17:40:51: isci 0000:08:00.0: isci_remote_device_terminate_requests host0 timeout single
17:41:01: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce964490
17:41:01: isci 0000:08:00.0: isci_task_abort_task: SATA/STP request or complete_in_target (0), or IDEV_GONE (0), thus no TMF
17:41:01: isci 0000:08:00.0: isci_task_abort_task: Done; dev = ffff881fce964490, task = ffff881fcfc95900 , old_request == ffff881fcea13000
17:41:01: isci 0000:08:00.0: isci_task_abort_task: dev = ffff881fce964310 (STP/SATA), task = ffff881fd0ce97c0, old_request == ffff881fcf125000
17:41:11: isci 0000:08:00.0: isci_remote_device_terminate_requests host0 timeout single
17:41:21: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce964310
17:41:21: isci 0000:08:00.0: isci_task_abort_task: SATA/STP request or complete_in_target (0), or IDEV_GONE (0), thus no TMF
17:41:21: isci 0000:08:00.0: isci_task_abort_task: Done; dev = ffff881fce964310, task = ffff881fd0ce97c0 , old_request == ffff881fcf125000
17:43:50: INFO: task app:20227 blocked for more than 120 seconds.
17:43:50: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
17:43:50: app           D ffff881fffcd3680     0 20227  13315 0x00000080
17:43:50: ffff883fcb15f8e0 0000000000000082 ffff883fcb15ffd8 0000000000013680
17:43:50: ffff883fcb15ffd8 0000000000013680 ffff881fd1434fa0 ffff883fc03d13b0
17:43:50: 7fffffffffffffff ffff881fd1434fa0 0000000000000002 0000000000000000
17:43:50: Call Trace:
17:43:50: [<ffffffff81609e39>] schedule+0x29/0x70
17:43:50: [<ffffffff81607d79>] schedule_timeout+0x209/0x2d0
17:43:50: [<ffffffff81609312>] __down_common+0xd2/0x14a
17:43:50: [<ffffffffa02292dd>] ? _xfs_buf_find+0x16d/0x2c0 [xfs]
17:43:50: [<ffffffff816093a7>] __down+0x1d/0x1f
17:43:50: [<ffffffff8109d311>] down+0x41/0x50
17:43:50: [<ffffffffa02290dc>] xfs_buf_lock+0x3c/0xd0 [xfs]
17:43:50: [<ffffffffa02292dd>] _xfs_buf_find+0x16d/0x2c0 [xfs]
17:43:50: [<ffffffffa022945a>] xfs_buf_get_map+0x2a/0x180 [xfs]
17:43:50: [<ffffffffa0229eec>] xfs_buf_read_map+0x2c/0x140 [xfs]
17:43:50: [<ffffffffa028d809>] xfs_trans_read_buf_map+0x2d9/0x4a0 [xfs]
17:43:50: [<ffffffffa02723cd>] xfs_read_agi+0x9d/0x110 [xfs]
17:43:50: [<ffffffffa0272474>] xfs_ialloc_read_agi+0x34/0xd0 [xfs]
17:43:50: [<ffffffffa0273138>] xfs_dialloc+0xe8/0x270 [xfs]
17:43:50: [<ffffffffa02757c1>] xfs_ialloc+0x71/0x6a0 [xfs]
17:43:50: [<ffffffffa0242b77>] ? kmem_zone_alloc+0x77/0x100 [xfs]
17:43:50: [<ffffffffa0275e6a>] xfs_dir_ialloc+0x7a/0x280 [xfs]
17:43:50: [<ffffffff81609222>] ? down_write+0x12/0x30
17:43:50: [<ffffffffa027656a>] xfs_create+0x48a/0x680 [xfs]
17:43:50: [<ffffffffa0237abb>] xfs_vn_mknod+0xbb/0x1e0 [xfs]
17:43:50: [<ffffffffa0237bf6>] xfs_vn_mkdir+0x16/0x20 [xfs]
17:43:50: [<ffffffff811d26c7>] vfs_mkdir+0xb7/0x160
17:43:50: [<ffffffff811d853f>] SyS_mkdirat+0x6f/0xe0
17:43:50: [<ffffffff811d85c9>] SyS_mkdir+0x19/0x20
17:43:50: [<ffffffff81614a29>] system_call_fastpath+0x16/0x1b
17:43:50: INFO: task app:20228 blocked for more than 120 seconds.
17:43:50: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
17:43:50: app           D ffff881fffc93680     0 20228  13315 0x00000080
17:43:50: ffff883fcef7b9e8 0000000000000082 ffff883fcef7bfd8 0000000000013680
17:43:50: ffff883fcef7bfd8 0000000000013680 ffff883fcc1d0b60 ffff881fffc93f48
17:43:50: ffff881fd07ea800 ffff883fcc1d0b60 0000000000000000 0000000000000000
17:43:50: Call Trace:
17:43:50: [<ffffffff8160a13d>] io_schedule+0x9d/0x140
17:43:50: [<ffffffff81204593>] do_blockdev_direct_IO+0xc03/0x2620
17:43:50: [<ffffffffa0222e60>] ? xfs_get_blocks+0x20/0x20 [xfs]
17:43:50: [<ffffffff81206005>] __blockdev_direct_IO+0x55/0x60
17:43:50: [<ffffffffa0222e60>] ? xfs_get_blocks+0x20/0x20 [xfs]
17:43:50: [<ffffffffa0222780>] ? xfs_finish_ioend_sync+0x30/0x30 [xfs]
17:43:50: [<ffffffffa0221b2a>] xfs_vm_direct_IO+0xda/0x180 [xfs]
17:43:50: [<ffffffffa0222e60>] ? xfs_get_blocks+0x20/0x20 [xfs]
17:43:50: [<ffffffffa0222780>] ? xfs_finish_ioend_sync+0x30/0x30 [xfs]
17:43:50: [<ffffffff8115872d>] generic_file_direct_write+0xcd/0x190
17:43:50: [<ffffffffa0299636>] xfs_file_dio_aio_write+0x215/0x254 [xfs]
17:43:50: [<ffffffffa022ee0d>] xfs_file_aio_write+0x13d/0x150 [xfs]
17:43:50: [<ffffffff811c64cd>] do_sync_write+0x8d/0xd0
17:43:50: [<ffffffff811c6c6d>] vfs_write+0xbd/0x1e0
17:43:50: [<ffffffff811c76b8>] SyS_write+0x58/0xb0
17:43:50: [<ffffffff81614a29>] system_call_fastpath+0x16/0x1b

... <Many more sequences like the above.>...

17:54:21: ata7.00: exception Emask 0x0 SAct 0x7e0000ff SErr 0x0 action 0x6 frozen
17:54:21: ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
17:54:21: ata8.00: failed command: FLUSH CACHE EXT
17:54:21: ata8.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 30
     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
17:54:21: ata8.00: status: { DRDY }
17:54:21: ata8: hard resetting link
17:54:21: ata9.00: exception Emask 0x0 SAct 0x1fff SErr 0x0 action 0x6 frozen
17:54:21: ata10.00: exception Emask 0x0 SAct 0x1fff SErr 0x0 action 0x6 frozen
17:54:21: ata9.00: failed command: WRITE FPDMA QUEUED
17:54:21: ata10.00: failed command: WRITE FPDMA QUEUED
17:54:21: ata9.00: cmd 61/50:00:40:08:00/00:00:00:00:00/40 tag 0 ncq 40960 out
     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
17:54:21: ata10.00: cmd 61/50:00:40:08:00/00:00:00:00:00/40 tag 0 ncq 40960 out
     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
17:54:21: ata9.00: status: { DRDY }
17:54:21: ata10.00: status: { DRDY }
17:54:21: ata9.00: failed command: WRITE FPDMA QUEUED
17:54:21: ata10.00: failed command: WRITE FPDMA QUEUED
17:54:21: ata9.00: cmd 61/18:00:08:08:00/00:00:00:00:00/40 tag 1 ncq 12288 out
     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
17:54:21: ata10.00: cmd 61/18:00:08:08:00/00:00:00:00:00/40 tag 1 ncq 12288 out
     res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
17:54:21: ata9.00: status: { DRDY }
17:54:21: ata10.00: status: { DRDY }
17:54:21: ata9.00: failed command: WRITE FPDMA QUEUED
17:54:21: ata10.00: failed command: WRITE FPDMA QUEUED
17:54:21: ata9.00: cmd 61/02:00:01:08:00/00:00:00:00:00/40 tag 2 ncq 1024 out
     res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
17:54:21: ata10.00: cmd 61/02:00:01:08:00/00:00:00:00:00/40 tag 2 ncq 1024 out
     res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
17:54:21: ata9.00: status: { DRDY }
17:54:21: ata10.00: status: { DRDY }
17:54:21: ata9.00: failed command: WRITE FPDMA QUEUED
17:54:21: ata10.00: failed command: WRITE FPDMA QUEUED
17:54:22: ata7.00: status: { DRDY }
17:54:22: ata7: hard resetting link
17:54:24: perf interrupt took too long (2510 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
17:54:31: isci 0000:08:00.0: isci_remote_device_terminate_requests host0 timeout all
17:54:31: isci 0000:08:00.0: isci_remote_device_terminate_requests host0 timeout all
17:54:31: isci 0000:08:00.0: isci_remote_device_terminate_requests host0 timeout all
17:54:32: isci 0000:08:00.0: isci_remote_device_terminate_requests host0 timeout all
17:54:56: isci 0000:08:00.0: isci_port_perform_hard_reset: iport = ffff881fce962ab8; hard reset failed (0x21)
17:54:56: isci 0000:08:00.0: isci_port_perform_hard_reset: iport = ffff881fce962bb0; hard reset failed (0x21)
17:54:56: isci 0000:08:00.0: isci_port_perform_hard_reset: iport = ffff881fce9629c0; hard reset failed (0x21)
17:54:57: isci 0000:08:00.0: isci_port_perform_hard_reset: iport = ffff881fce9628c8; hard reset failed (0x21)
17:55:06: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce964310
17:55:06: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce964490
17:55:06: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce9643d0
17:55:06: sas: ata8: end_device-0:1: reset failed (errno=-16)
17:55:06: sas: ata10: end_device-0:3: reset failed (errno=-16)
17:55:06: ata10: hard resetting link
17:55:06: ata8: hard resetting link
17:55:06: sas: ata9: end_device-0:2: reset failed (errno=-16)
17:55:06: ata9: hard resetting link
17:55:07: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce964250
17:55:07: sas: ata7: end_device-0:0: reset failed (errno=-16)
17:55:07: ata7: hard resetting link

...<More of the above.>...


17:56:22: sas: ata10: end_device-0:3: reset failed (errno=-16)
17:56:22: ata10: reset failed, giving up
17:56:22: sas: ata8: end_device-0:1: reset failed (errno=-16)
17:56:22: ata8: reset failed, giving up
17:56:22: ata8.00: disabled
17:56:22: ata8.00: device reported invalid CHS sector 0
17:56:22: ata8: EH complete
17:56:22: ata10.00: disabled
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10.00: device reported invalid CHS sector 0
17:56:22: ata10: EH complete
17:56:22: sas: ata9: end_device-0:2: reset failed (errno=-16)
17:56:22: ata9: reset failed, giving up
17:56:22: ata9.00: disabled
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9.00: device reported invalid CHS sector 0
17:56:22: ata9: EH complete
17:56:22: isci 0000:08:00.0: isci_remote_device_wait_for_resume_from_abort: #### Timeout waiting for resume: ffff881fce964250
17:56:22: sas: ata7: end_device-0:0: reset failed (errno=-16)
17:56:22: ata7: reset failed, giving up
17:56:22: ata7.00: disabled
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7.00: device reported invalid CHS sector 0
17:56:22: ata7: EH complete
17:56:22: sd 0:0:0:0: [sda]
17:56:22: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
17:56:22: sd 0:0:0:0: [sda] CDB:
17:56:22: Write(10): 2a 00 08 70 08 01 00 00 02 00
17:56:22: end_request: I/O error, dev sda, sector 141559809
17:56:22: sd 0:0:0:0: [sda]
17:56:22: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
17:56:22: XFS (sda1): metadata I/O error: block 0x8700001 ("xfs_buf_iodone_callbacks") error 5 numblks 1
17:56:22: XFS (sdc1): metadata I/O error: block 0x19500040 ("xfs_buf_iodone_callbacks") error 5 numblks 16
17:56:22: XFS (sdd1): metadata I/O error: block 0x19500040 ("xfs_buf_iodone_callbacks") error 5 numblks 16
17:56:22: XFS (sda1): metadata I/O error: block 0x8700002 ("xfs_buf_iodone_callbacks") error 5 numblks 1
17:56:22: XFS (sdc1): metadata I/O error: block 0x19500050 ("xfs_buf_iodone_callbacks") error 5 numblks 16
17:56:22: XFS (sdd1): metadata I/O error: block 0x19500050 ("xfs_buf_iodone_callbacks") error 5 numblks 16
17:56:22: Buffer I/O error on device sdb1, logical block 35424025
17:56:22: lost page write due to I/O error on sdb1
    
por jqr 04.12.2015 / 16:54

0 respostas