Multipathd config para LSI HBA 3008

0

Eu tenho 5 jbod conectados via LSI-SAS3008 ao controlador. Estou usando o Arch-Linux 4.14.41-1-lts & multipath-tools v0.7.6 (03 / 10,2018)

Meu problema é quando um disco começa a dar erro de E / S e começa a piscar Multipath tentando verificar o disco e remapear o caminho com falha.

Jul 23 04:59:51 FKM1 multipathd[5315]: 35000c50093d4e7c7: sdbe - tur checker timed out
Jul 23 04:59:51 FKM1 multipathd[5315]: checker failed path 67:128 in map 35000c50093d4e7c7
Jul 23 04:59:51 FKM1 multipathd[5315]: 35000c50093d4e7c7: remaining active paths: 0
Jul 23 04:59:51 FKM1 multipathd[5315]: sdbe: mark as failed
Jul 23 04:59:56 FKM1 multipathd[5315]: checker failed path 67:128 in map 35000c50093d4e7c7
Jul 23 05:04:37 FKM1 multipathd[5315]: 67:128: reinstated
Jul 23 05:04:37 FKM1 multipathd[5315]: 35000c50093d4e7c7: remaining active paths: 1
Jul 23 05:05:27 FKM1 multipathd[5315]: 35000c50093d4e7c7: sdbe - tur checker timed out
Jul 23 05:05:27 FKM1 multipathd[5315]: checker failed path 67:128 in map 35000c50093d4e7c7
Jul 23 05:05:27 FKM1 multipathd[5315]: 35000c50093d4e7c7: remaining active paths: 0
Jul 23 05:05:27 FKM1 multipathd[5315]: sdbe: mark as failed

Por causa do multipath de disco defeituoso que tenta remapear toda vez que o disco é exibido.

[Fri Aug  3 00:18:37 2018] alua: device handler registered
[Fri Aug  3 00:18:37 2018] emc: device handler registered
[Fri Aug  3 00:18:37 2018] rdac: device handler registered
[Fri Aug  3 00:18:37 2018] device-mapper: uevent: version 1.0.3
[Fri Aug  3 00:18:37 2018] device-mapper: ioctl: 4.37.0-ioctl (2017-09-20) initialised: [email protected]
[Fri Aug  3 00:18:43 2018] device-mapper: multipath service-time: version 0.3.0 loaded
[Fri Aug  3 00:18:43 2018] device-mapper: table: 254:0: multipath: error getting device
[Fri Aug  3 00:18:43 2018] device-mapper: ioctl: error adding target to table
[Fri Aug  3 00:18:43 2018] device-mapper: table: 254:0: multipath: error getting device
[Fri Aug  3 00:18:43 2018] device-mapper: ioctl: error adding target to table
[Fri Aug  3 00:21:19 2018] sd 12:0:16:0: attempting task abort! scmd(ffffa03a6c4de948)
[Fri Aug  3 00:21:19 2018] sd 12:0:16:0: [sdbh] tag#1 CDB: opcode=0x88 88 00 00 00 00 02 ba a0 f0 00 00 00 02 00 00 00
[Fri Aug  3 00:21:19 2018] scsi target12:0:16: handle(0x001c), sas_address(0x5000c50093d5135d), phy(8)
[Fri Aug  3 00:21:19 2018] scsi target12:0:16: enclosure_logical_id(0x500304800929f87f), slot(8)
[Fri Aug  3 00:21:19 2018] scsi target12:0:16: enclosure level(0x0001),connector name(1   )
[Fri Aug  3 00:21:19 2018] sd 12:0:16:0: task abort: SUCCESS scmd(ffffa03a6c4de948)
[Fri Aug  3 00:21:19 2018] sd 12:0:16:0: attempting task abort! scmd(ffffa07b2eb87d48)
[Fri Aug  3 00:21:19 2018] sd 12:0:16:0: [sdbh] tag#0 CDB: opcode=0x88 88 00 00 00 00 02 ba a0 f0 00 00 00 02 00 00 00
[Fri Aug  3 00:21:19 2018] scsi target12:0:16: handle(0x001c), sas_address(0x5000c50093d5135d), phy(8)
[Fri Aug  3 00:21:19 2018] scsi target12:0:16: enclosure_logical_id(0x500304800929f87f), slot(8)
[Fri Aug  3 00:21:19 2018] scsi target12:0:16: enclosure level(0x0001),connector name(1   )
[Fri Aug  3 00:21:19 2018] sd 12:0:16:0: task abort: SUCCESS scmd(ffffa07b2eb87d48)
[Fri Aug  3 00:21:21 2018] device-mapper: multipath: Failing path 67:176.
[Fri Aug  3 00:21:21 2018] sd 12:0:16:0: attempting task abort! scmd(ffffa03a89b38148)
[Fri Aug  3 00:21:21 2018] sd 12:0:16:0: [sdbh] tag#11 CDB: opcode=0x0 00 00 00 00 00 00
[Fri Aug  3 00:21:21 2018] scsi target12:0:16: handle(0x001c), sas_address(0x5000c50093d5135d), phy(8)
[Fri Aug  3 00:21:21 2018] scsi target12:0:16: enclosure_logical_id(0x500304800929f87f), slot(8)
[Fri Aug  3 00:21:21 2018] scsi target12:0:16: enclosure level(0x0001),connector name(1   )
[Fri Aug  3 00:21:21 2018] sd 12:0:16:0: task abort: SUCCESS scmd(ffffa03a89b38148)
[Fri Aug  3 00:21:26 2018] print_req_error: I/O error, dev dm-208, sector 11721044480
[Fri Aug  3 00:21:26 2018] print_req_error: I/O error, dev dm-208, sector 0
[Fri Aug  3 00:21:26 2018] print_req_error: I/O error, dev dm-208, sector 512
[Fri Aug  3 00:21:26 2018] print_req_error: I/O error, dev dm-208, sector 11721043968
[Fri Aug  3 00:21:26 2018] print_req_error: I/O error, dev dm-208, sector 11721044480
[Fri Aug  3 00:21:26 2018] print_req_error: I/O error, dev dm-208, sector 0
[Fri Aug  3 00:21:26 2018] print_req_error: I/O error, dev dm-208, sector 512
[Fri Aug  3 00:21:26 2018] print_req_error: I/O error, dev dm-208, sector 11721043968
[Fri Aug  3 00:21:26 2018] print_req_error: I/O error, dev dm-208, sector 11721044480
[Fri Aug  3 00:21:57 2018] sd 12:0:16:0: attempting task abort! scmd(ffffa03a89b3f148)

Depois de um tempo, o ciclo continua quando o MPT3SAS Driver abre e reinicia a placa LSI.

[Fri Aug  3 00:18:12 2018] mpt3sas_cm3: iomem(0x00000000fbe40000), mapped(0xffffbe0e8dca0000), size(65536)
[Fri Aug  3 00:18:12 2018] mpt3sas_cm3: ioport(0x000000000000e000), size(256)
[Fri Aug  3 00:18:12 2018] usb 2-1-port6: over-current condition
[Fri Aug  3 00:18:12 2018] mpt3sas_cm3: sending message unit reset !!
[Fri Aug  3 00:18:12 2018] mpt3sas_cm3: message unit reset: SUCCESS
[Fri Aug  3 00:18:12 2018] mpt3sas_cm3: Allocated physical memory: size(20778 kB)
[Fri Aug  3 00:18:12 2018] mpt3sas_cm3: Current Controller Queue Depth(9564),Max Controller Queue Depth(9664)
[Fri Aug  3 00:18:12 2018] mpt3sas_cm3: Scatter Gather Elements per IO(128)
[Fri Aug  3 00:18:12 2018] usb 3-14.1: new low-speed USB device number 3 using xhci_hcd
[Fri Aug  3 00:18:12 2018] mpt3sas_cm3: LSISAS3008: FWVersion(15.00.02.00), ChipRevision(0x02), BiosVersion(08.35.00.00)
[Fri Aug  3 00:18:12 2018] mpt3sas_cm3: Protocol=(
[Fri Aug  3 00:18:12 2018] Initiator
[Fri Aug  3 00:18:12 2018] ,Target
[Fri Aug  3 00:18:12 2018] ),
[Fri Aug  3 00:18:12 2018] Capabilities=(
[Fri Aug  3 00:18:12 2018] TLR
[Fri Aug  3 00:18:12 2018] ,EEDP
[Fri Aug  3 00:18:12 2018] ,Snapshot Buffer
[Fri Aug  3 00:18:12 2018] ,Diag Trace Buffer
[Fri Aug  3 00:18:12 2018] ,Task Set Full
[Fri Aug  3 00:18:12 2018] ,NCQ
[Fri Aug  3 00:18:12 2018] )
[Fri Aug  3 00:18:12 2018] scsi host13: Fusion MPT SAS Host
[Fri Aug  3 00:18:12 2018] mpt3sas_cm3: sending port enable !!
[Fri Aug  3 00:18:12 2018] mpt3sas_cm4: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (528262416 kB)
[Fri Aug  3 00:18:12 2018] mpt3sas_cm3: host_add: handle(0x0001), sas_addr(0x500605b00c482a80), phys(8)
[Fri Aug  3 00:18:12 2018] mpt3sas_cm3: expander_add: handle(0x0009), parent(0x0001), sas_addr(0x5003048017aed57f), phys(38)
[Fri Aug  3 00:18:12 2018] scsi 13:0:0:0: Direct-Access     SEAGATE  ST800FM0173      0007 PQ: 0 ANSI: 6

Quando Mpt3sas envia "diag reset" significa que estou perdendo um jbod "90 disk" ao mesmo tempo! E por causa disso, um simples disco defeituoso pode suspender meu Pool do ZFS.

Agora estou procurando uma solução e acho que se eu disser multipath; "não remapear se um disco falhar 3 vezes", então meu problema será resolvido porque o disco não estará sendo usado pelo pool e se meu pool não usar o disco defeituoso, o disco não poderá causar erro de E / S.

Portanto, com uma explicação simples, estou procurando uma maneira de desativar o uso do disco com falha.

Encontrei algumas configurações para /etc/multipath.conf Mas não tenho certeza se isso resolverá meu problema ou não. Você pode me dizer a melhor solução para o meu problema?

defaults {
    user_friendly_names no
    path_grouping_policy failover
  polling_interval        10
  path_selector           "round-robin 0"
  path_grouping_policy    failover
  path_checker            readsector0
  failback                manual
  no_path_retry           3
  prio            rdac
}


blacklist_exceptions {
        property "(ID_WWN|SCSI_IDENT_.*|ID_SERIAL)"
}

Este é o log DMESG completo - > link

    
por Morphinz 07.08.2018 / 17:02

1 resposta

0

Não será enviado o multipath abortando esses comandos SCSI, será o kernel do Linux. Uma vez que essa anulação não seja tratada no devido tempo, tratamento de erros SCSI entrará em vigor e progressivamente mais e mais serão redefinidos (todo o caminho até uma reinicialização HBA) em uma tentativa de trazer o disco de volta. Você precisa convencer o Linux a declarar o disco morto mais cedo.

Você pode gravar uma regra udev de modo que você reduza o timeout no disco correspondente a esse caminho link por isso é declarado offline, mas provavelmente exigirá muita experimentação (e o risco é que isso pode acabar se aplicando a todos os caminhos).

    
por 22.09.2018 / 18:55