Primeiro, aqui está a versão abreviada da minha pergunta. Eu tenho uma luz vermelha piscando em uma unidade em uma matriz RAID e, embora a MegaCli não relate falhas de disco ou avisos, alguns comandos MegaCli mostram 24 discos enquanto outros mostram apenas 23. Também vejo o seguinte erro ocorrendo diariamente:
Event Description: Controller encountered a fatal error and was reset
Estas coisas estão relacionadas? Existe algum problema aqui?
Agora aqui está a versão mais longa. Eu herdei a responsabilidade por um servidor (vamos chamá-lo de my_server
) que está sendo hospedado em um data center e que eu acredito ter um LSI MegaRAID SAS 9265-8i com uma configuração RAID 50 / RAID 5 + 0. Recebi um email do datacenter informando que uma luz vermelha está piscando em um dos discos rígidos desse servidor. Infelizmente eu não sei quase nada sobre matrizes RAID, então eu tenho que sentir o meu caminho através do Manual do usuário do software MegaRAID SAS e vários tutoriais on-line.
Eu ssh'ed no servidor para tentar diagnosticar o problema. O que segue é um exemplo de sessão de shell que demonstra meus esforços e fornece algumas informações relevantes sobre o sistema em questão.
Primeiro, verifico algumas informações básicas do sistema:
$ cat /etc/issue
CentOS release 6.4 (Final)
Kernel \r on an \m
$ uname -a
Linux my_server 2.6.32-358.11.1.el6.x86_64 #1
SMP Wed Jun 12 03:34:52 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
A seguir, verifico o array RAID e a versão MegaCli:
$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -adpallinfo -aALL | grep "Product Name"
Product Name : LSI MegaRAID SAS 9265-8i
$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -CfgDsply -a0 | grep 'RAID Level'
RAID Level : Primary-5, Secondary-0, RAID Level Qualifier-3
$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -v
MegaCLI SAS RAID Management Tool Ver 8.04.07 May 28, 2012
(c)Copyright 2011, LSI Corporation, All Rights Reserved.
Exit Code: 0x00
Agora, algumas informações resumidas sobre as unidades na matriz:
$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -adpallinfo -a0 | grep -A8 "Device Present"
Device Present
================
Virtual Drives : 1
Degraded : 0
Offline : 0
Physical Devices : 27
Disks : 24
Critical Disks : 0
Failed Disks : 0
Aqui parece que está tudo bem. Então eu verifico para S.M.A.R.T. alertas:
$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL | grep 'S.M.A.R.T.'
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No
[...]
Drive has flagged a S.M.A.R.T alert : No
Drive has flagged a S.M.A.R.T alert : No
Sem S.M.A.R.T. alerta então, depois de ler alguns tutoriais, eu corro alguns outros comandos:
$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -ldinfo -lall -a0 | grep Drives
Number Of Drives : 23
$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -CfgDsply -aALL | grep -Pi 'SPAN|Span\ Ref|Number\ of'
Number of DISK GROUPS: 1
Number of Spans: 1
SPAN: 0
Span Reference: 0x00
Number of PDs: 23
Number of VDs: 1
Number of dedicated Hotspares: 0
Number Of Drives : 23
Span Depth : 1
Drive's postion: DiskGroup: 0, Span: 0, Arm: 0
Drive's postion: DiskGroup: 0, Span: 0, Arm: 1
Drive's postion: DiskGroup: 0, Span: 0, Arm: 2
Drive's postion: DiskGroup: 0, Span: 0, Arm: 3
[...]
Drive's postion: DiskGroup: 0, Span: 0, Arm: 20
Drive's postion: DiskGroup: 0, Span: 0, Arm: 21
Drive's postion: DiskGroup: 0, Span: 0, Arm: 22
Agora estou um pouco confuso, porque alguns comandos (por exemplo, adpallinfo e pdlist)
mostra 24 discos presentes e outros (por exemplo, ldinfo e CfgDsply) mostram apenas 23.
Finalmente eu gero um arquivo de log de eventos e procuro por sinais de problemas:
$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -adpeventlog -getevents -f lsi-events.log -a0 -nolog
$ cat lsi-events.log | grep -P -i 'fail|error|warn'
[...]
Event Description: Controller encountered a fatal error and was reset
Event Description: Controller encountered a fatal error and was reset
Event Description: Controller encountered a fatal error and was reset
Event Description: Controller encountered a fatal error and was reset
Event Description: Controller encountered a fatal error and was reset
$ cat lsi-events.log | grep -B6 -A3 -P -i 'fail|error|warn'
[...]
seqNum: 0x000f8644
Time: Sun Feb 26 07:32:16 2017
Code: 0x00000159
Class: 2
Locale: 0x20
Event Description: Controller encountered a fatal error and was reset
Event Data:
===========
None
E também procure mensagens especificamente relacionadas ao slot 23:
$ cat lsi-events.log | grep -P -i 's23' | tail -30
Event Description: Power state change on PD 1f(e0x21/s23) from POWERSAVE(1) to TRANSITION(ff)
Event Description: Power state change on PD 1f(e0x21/s23) from TRANSITION(ff) to ON(0)
Event Description: Power state change on PD 1f(e0x21/s23) from ON(0) to POWERSAVE(1)
Event Description: Inserted: PD 1f(e0x21/s23)
Event Description: Inserted: PD 1f(e0x21/s23) Info: enclPd=21, scsiType=0, portMap=10, sasAddr=5000c50034366199,5000c5003436619a
Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev)
Event Description: Power state change on PD 1f(e0x21/s23) from ON(0) to POWERSAVE(1)
Event Description: Power state change on PD 1f(e0x21/s23) from POWERSAVE(1) to TRANSITION(ff)
Event Description: Power state change on PD 1f(e0x21/s23) from TRANSITION(ff) to ON(0)
Event Description: Inserted: PD 1f(e0x21/s23)
Event Description: Inserted: PD 1f(e0x21/s23) Info: enclPd=21, scsiType=0, portMap=10, sasAddr=5000c50034366199,5000c5003436619a
Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev)
Event Description: Inserted: PD 1f(e0x21/s23)
Event Description: Inserted: PD 1f(e0x21/s23) Info: enclPd=21, scsiType=0, portMap=10, sasAddr=5000c50034366199,5000c5003436619a
Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev)
Event Description: Power state change on PD 1f(e0x21/s23) from ON(0) to POWERSAVE(1)
Event Description: Inserted: PD 1f(e0x21/s23)
Event Description: Inserted: PD 1f(e0x21/s23) Info: enclPd=21, scsiType=0, portMap=10, sasAddr=5000c50034366199,5000c5003436619a
Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev)
Event Description: Inserted: PD 1f(e0x21/s23)
Event Description: Inserted: PD 1f(e0x21/s23) Info: enclPd=21, scsiType=0, portMap=10, sasAddr=5000c50034366199,5000c5003436619a
Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev)
Event Description: Power state change on PD 1f(e0x21/s23) from ON(0) to POWERSAVE(1)
Event Description: Global Hot Spare PD 1f(e0x21/s23) (global,rev) disabled
Event Description: State change on PD 1f(e0x21/s23) from HOT SPARE(2) to UNCONFIGURED_GOOD(0)
Event Description: Power state change on PD 1f(e0x21/s23) from POWERSAVE(1) to TRANSITION(ff)
Event Description: Global Hot Spare created on PD 1f(e0x21/s23) (global,rev)
Event Description: State change on PD 1f(e0x21/s23) from UNCONFIGURED_GOOD(0) to HOT SPARE(2)
Event Description: Power state change on PD 1f(e0x21/s23) from TRANSITION(ff) to ON(0)
Event Description: Power state change on PD 1f(e0x21/s23) from ON(0) to POWERSAVE(1)
Entrei em contato com o data center e fui informado de que a luz piscante estava ocorrendo na unidade 10, então olhei para essa unidade:
$ sudo /opt/MegaRAID/MegaCli/MegaCli64 -PDInfo -PhysDrv [33:10] -a0
Enclosure Device ID: 33
Slot Number: 10
Drive's postion: DiskGroup: 0, Span: 0, Arm: 10
Enclosure position: 1
Device Id: 18
WWN: 5000C500344D5940
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 1.819 TB [0xe8e088b0 Sectors]
Non Coerced Size: 1.818 TB [0xe8d088b0 Sectors]
Coerced Size: 1.818 TB [0xe8d00000 Sectors]
Emulated Drive: No
Firmware state: Online, Spun Up
Commissioned Spare : No
Emergency Spare : No
Device Firmware Level: 0006
Shield Counter: 0
Successful diagnostics completion on : N/A
SAS Address(0): 0x5000c500344d5941
SAS Address(1): 0x5000c500344d5942
Connected Port Number: 0(path0) 1(path1)
Inquiry Data: SEAGATE ST32000444SS 00069WM6369D
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive Temperature :26C (78.80 F)
PI Eligibility: No
Drive is formatted for PI information: No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Port-1 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Drive has flagged a S.M.A.R.T alert : No
Exit Code: 0x00
Eu também tentei usar o smartctl:
$ sudo smartctl -a -d megaraid,18 /dev/sdc
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-358.11.1.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
Vendor: SEAGATE
Product: ST32000444SS
Revision: 0006
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Logical block size: 512 bytes
Logical Unit id: 0x5000c500344d5943
Serial number: 9WM6369D0000914458SC
Device type: disk
Transport protocol: SAS
Local Time is: Tue Feb 28 17:18:33 2017 CST
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK
Current Drive Temperature: 26 C
Drive Trip Temperature: 68 C
Manufactured in week 21 of year 2011
Specified cycle count over device lifetime: 10000
Accumulated start-stop cycles: 41
Specified load-unload count over device lifetime: 300000
Accumulated load-unload cycles: 41
Elements in grown defect list: 0
Vendor (Seagate) cache information
Blocks sent to initiator = 3508224337
Blocks received from initiator = 38846232
Blocks read from cache and sent to initiator = 44013719
Number of read and write commands whose size <= segment size = 2649500
Number of read and write commands whose size > segment size = 4
Vendor (Seagate/Hitachi) factory information
number of hours powered up = 45862.30
number of minutes until next internal SMART test = 46
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 22540834 0 0 22540834 22540834 230.346 0
write: 0 0 0 0 0 20.012 0
verify: 161330204 1 0 161330205 161330205 1896.577 0
Non-medium error count: 0
[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
No self-tests have been logged
Long (extended) Self Test duration: 18500 seconds [308.3 minutes]