Eu tenho uma mensagem de erro estranha nos registros, que começou assim:
:39:35 host1 kernel: [54674279.243416] mpt2sas0: fault_state(0x2651)!
:39:35 host1 kernel: [54674279.243543] mpt2sas0: sending diag reset !!
:39:36 host1 kernel: [54674280.481215] mpt2sas0: diag reset: SUCCESS
:39:36 host1 kernel: [54674280.713443] mpt2sas0: LSISAS2008: FWVersion(07.15.08.00), ChipRevision(0x03), BiosVersion(07.02.03.00)
:39:36 host1 kernel: [54674280.713451] mpt2sas0: Dell 6Gbps SAS HBA: Vendor(0x1000), Device(0x0072), SSVID(0x1028), SSDID(0x1F1C)
:39:36 host1 kernel: [54674280.713455] mpt2sas0: Protocol=(Initiator,Target), Capabilities=(Raid,TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)
:39:36 host1 kernel: [54674280.713518] mpt2sas0: sending port enable !!
:39:43 host1 kernel: [54674287.616666] mpt2sas0: port enable: SUCCESS
:39:43 host1 kernel: [54674287.616814] mpt2sas0: search for end-devices: start
:39:43 host1 kernel: [54674287.617657] scsi target7:0:3: handle(0x0009), sas_addr(0x590b11c410294314), enclosure logical id(0x590b11c007729400), slot(7)
:39:43 host1 kernel: [54674287.617735] scsi target7:0:2: handle(0x000a), sas_addr(0x590b11c41025f914), enclosure logical id(0x590b11c007729400), slot(3)
:39:43 host1 kernel: [54674287.617807] mpt2sas0: search for end-devices: complete
:39:43 host1 kernel: [54674287.617810] mpt2sas0: search for raid volumes: start
:39:43 host1 kernel: [54674287.617813] mpt2sas0: search for responding raid volumes: complete
:39:43 host1 kernel: [54674287.617816] mpt2sas0: search for expanders: start
:39:43 host1 kernel: [54674287.617818] mpt2sas0: search for expanders: complete
:39:43 host1 kernel: [54674287.617833] mpt2sas0: search for end-devices: start
:39:43 host1 kernel: [54674287.618468] scsi target7:0:3: handle(0x0009), sas_addr(0x590b11c410294314), enclosure logical id(0x590b11c007729400), slot(7)
:39:43 host1 kernel: [54674287.618543] scsi target7:0:2: handle(0x000a), sas_addr(0x590b11c41025f914), enclosure logical id(0x590b11c007729400), slot(3)
:39:43 host1 kernel: [54674287.618614] mpt2sas0: search for end-devices: complete
:39:43 host1 kernel: [54674287.618617] mpt2sas0: search for raid volumes: start
:39:43 host1 kernel: [54674287.618619] mpt2sas0: search for responding raid volumes: complete
:39:43 host1 kernel: [54674287.618622] mpt2sas0: search for expanders: start
:39:43 host1 kernel: [54674287.618624] mpt2sas0: search for expanders: complete
:39:43 host1 kernel: [54674287.618632] mpt2sas0: _base_fault_reset_work: hard reset: success
:39:43 host1 kernel: [54674287.618639] mpt2sas0: removing unresponding devices: start
:39:43 host1 kernel: [54674287.618642] mpt2sas0: removing unresponding devices: complete
:39:43 host1 kernel: [54674287.618654] mpt2sas0: scan devices: start
:39:43 host1 kernel: [54674287.619530] mpt2sas0: failure at /build/buildd/linux-3.2.0/drivers/scsi/mpt2sas/mpt2sas_scsih.c:5157/_scsih_add_device()!
:39:43 host1 kernel: [54674287.619866] mpt2sas0: failure at /build/buildd/linux-3.2.0/drivers/scsi/mpt2sas/mpt2sas_scsih.c:5157/_scsih_add_device()!
e a última mensagem é repetida muitas vezes por segundo. Outras informações consideradas relevantes:
Esta é uma máquina Dell com kernel antigo do Linux conectado ao SAS para o disk array da Dell.
# uname -a
Linux host1 3.2.0-34-generic #53-Ubuntu SMP Thu Nov 15 10:48:16 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
# modinfo -F version mpt2sas
10.100.00.00
lspci | grep LSI
01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2008 [Falcon] (rev 03)
08:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)
Quando mais debug adicionado ao mpt2sas, este é o resultado:
mpt2sas0: failure at /build/buildd/linux-3.2.0/drivers/scsi/mpt2sas/mpt2sas_scsih.c:5157/_scsih_add_device()!
phy-7:4: refresh: parent sas_addr(0x590b11c007729400),
link_rate(0x08), phy(4)
attached_handle(0x0000), sas_addr(0x0000000000000000)
Outras máquinas, conectadas a diferentes volumes da matriz de disco, funcionam normalmente. O disk array e o iDrac não fornecem pistas nos logs, tudo parece normal. Googling forneceu algumas histórias de horror de que o RAID pode acabar com todos os discos. O problema não está ligado a uma carga extraordinariamente alta.
O comportamento continua por horas.
A Red Hat parece ter uma pergunta muito parecida, mas ainda não há solução (?):
Infelizmente, não consigo reiniciar a máquina para realizar experiências.