Desempenho de RAID repentinamente lento

16

Recentemente, notamos que nossas consultas ao banco de dados estavam demorando mais do que o normal para ser executado. Após algumas investigações, parece que estamos recebendo leituras de disco muito lentas.

Nós nos deparamos com um problema semelhante no passado causado pelo controlador RAID iniciando um reaprenda o ciclo na BBU e mude para o write-through. Não parece que este é o caso desta vez.

Eu corri bonnie++ algumas vezes ao longo de alguns dias. Aqui estão os resultados:

Os22-82M/sparecemmuitoabismais.Aexecuçãodeddnodispositivobrutoporalgunsminutosmostrade15,8MB/sa225MB/sdeleituras(consulteaatualizaçãoabaixo).iotopnãoindicanenhumoutroprocessocompetindoporIO,porissonãoseiporqueavelocidadedeleituraétãovariável.

OcartãoRAIDéumMegaRAIDSAS9280com12drivesSAS(15k,300GB)emRAID10comumsistemadearquivosXFS(OSemdoisSSDsconfiguradosemRAID1).EunãovejonenhumS.M.A.R.T.alertaseamatriznãopareceestardegradada.

Eutambémcorrixfs_checkeparecenãohaverproblemasdeconsistênciadoXFS.

Quaisdevemserospróximospassosinvestigativosaqui?

Especificaçõesdoservidor

Ubuntu12.04.5LTS128GBRAMIntel(R)Xeon(R)[email protected]

Saídadexfs_repair-n:

Phase1-findandverifysuperblock...Phase2-usinginternallog-scanfilesystemfreespaceandinodemaps...-foundrootinodechunkPhase3-foreachAG...-scan(butdon'tclear)agiunlinkedlists...-processknowninodesandperforminodediscovery...-agno=0-agno=1-agno=2-agno=3-processnewlydiscoveredinodes...Phase4-checkforduplicateblocks...-settingupduplicateextentlist...-checkforinodesclaimingduplicateblocks...-agno=1-agno=3-agno=2-agno=0Nomodifyflagset,skippingphase5Phase6-checkinodeconnectivity...-traversingfilesystem...-traversalfinished...-movingdisconnectedinodestolost+found...Phase7-verifylinkcounts...Nomodifyflagset,skippingfilesystemflushandexiting.

Saídademegacli-AdpAllInfo-aAll:

Versions================ProductName:LSIMegaRAIDSAS9280-4i4eSerialNo:SV24919344FWPackageBuild:12.12.0-0124Mfg.Data================Mfg.Date:12/06/12ReworkDate:00/00/00RevisionNo:04BBatteryFRU:N/AImageVersionsinFlash:================FWVersion:2.130.363-1846BIOSVersion:3.25.00_4.12.05.00_0x05180000PrebootCLIVersion:04.04-020:#%00009WebBIOSVersion:6.0-51-e_47-RelNVDATAVersion:2.09.03-0039BootBlockVersion:2.02.00.00-0000BOOTVersion:09.250.01.219PendingImagesinFlash================NonePCIInfo================ControllerId:0000VendorId:1000DeviceId:0079SubVendorId:1000SubDeviceId:9282HostInterface:PCIEChipRevision:B4LinkSpeed:0NumberofFrontendPort:0DeviceInterface:PCIENumberofBackendPort:8Port:Address05003048001c1e47f10000000000000000200000000000000003000000000000000040000000000000000500000000000000006000000000000000070000000000000000HWConfiguration================SASAddress:500605b005a6cbc0BBU:PresentAlarm:PresentNVRAM:PresentSerialDebugger:PresentMemory:PresentFlash:PresentMemorySize:512MBTPM:AbsentOnboardExpander:AbsentUpgradeKey:AbsentTemperaturesensorforROC:AbsentTemperaturesensorforcontroller:AbsentSettings================CurrentTime:14:58:517/11,2016PredictiveFailPollInterval:300secInterruptThrottleActiveCount:16InterruptThrottleCompletion:50usRebuildRate:30%PRRate:30%BGIRate:30%CheckConsistencyRate:30%ReconstructionRate:30%CacheFlushInterval:4sMaxDrivestoSpinupatOneTime:4DelayAmongSpinupGroups:2sPhysicalDriveCoercionMode:DisabledClusterMode:DisabledAlarm:EnabledAutoRebuild:EnabledBatteryWarning:EnabledEccBucketSize:15EccBucketLeakRate:1440MinutesRestoreHotSpareonInsertion:DisabledExposeEnclosureDevices:EnabledMaintainPDFailHistory:EnabledHostRequestReordering:EnabledAutoDetectBackPlaneEnabled:SGPIO/i2cSEPLoadBalanceMode:AutoUseFDEOnly:NoSecurityKeyAssigned:NoSecurityKeyFailed:NoSecurityKeyNotBackedup:NoDefaultLDPowerSavePolicy:ControllerDefinedMaximumnumberofdirectattacheddrivestospinupin1min:120AutoEnhancedImport:NoAnyOfflineVDCachePreserved:NoAllowBootwithPreservedCache:NoDisableOnlineControllerReset:NoPFKinNVRAM:NoUsediskactivityforlocate:NoPOSTdelay:90secondsBIOSErrorHandling:StopOnErrorsCurrentBootMode:NormalCapabilities================RAIDLevelSupported:RAID0,RAID1,RAID5,RAID6,RAID00,RAID10,RAID50,RAID60,PRL11,PRL11withspanning,SRL3supported,PRL11-RLQ0DDFlayoutwithnospan,PRL11-RLQ0DDFlayoutwithspanSupportedDrives:SAS,SATAAllowedMixing:MixinEnclosureAllowedMixofSAS/SATAofHDDtypeinVDAllowedStatus================ECCBucketCount:0Limitations================MaxArmsPerVD:32MaxSpansPerVD:8MaxArrays:128MaxNumberofVDs:64MaxParallelCommands:1008MaxSGECount:80MaxDataTransferSize:8192sectorsMaxStripsPerIO:42MaxLDperarray:16MinStripSize:8KBMaxStripSize:1.0MBMaxConfigurableCacheCadeSize:0GBCurrentSizeofCacheCade:0GBCurrentSizeofFWCache:350MBDevicePresent================VirtualDrives:2Degraded:0Offline:0PhysicalDevices:16Disks:14CriticalDisks:0FailedDisks:0SupportedAdapterOperations================RebuildRate:YesCCRate:YesBGIRate:YesReconstructRate:YesPatrolReadRate:YesAlarmControl:YesClusterSupport:NoBBU:YesSpanning:YesDedicatedHotSpare:YesRevertibleHotSpares:YesForeignConfigImport:YesSelfDiagnostic:YesAllowMixedRedundancyonArray:NoGlobalHotSpares:YesDenySCSIPassthrough:NoDenySMPPassthrough:NoDenySTPPassthrough:NoSupportSecurity:NoSnapshotEnabled:NoSupporttheOCEwithoutaddingdrives:YesSupportPFK:YesSupportPI:NoSupportBootTimePFKChange:NoDisableOnlinePFKChange:NoPFKTrailTimeRemaining:0days0hoursSupportShieldState:NoBlockSSDWriteDiskCacheChange:NoSupportedVDOperations================ReadPolicy:YesWritePolicy:YesIOPolicy:YesAccessPolicy:YesDiskCachePolicy:YesReconstruction:YesDenyLocate:NoDenyCC:NoAllowCtrlEncryption:NoEnableLDBBM:NoSupportBreakmirror:NoPowerSavings:NoSupportedPDOperations================ForceOnline:YesForceOffline:YesForceRebuild:YesDenyForceFailed:NoDenyForceGood/Bad:NoDenyMissingReplace:NoDenyClear:NoDenyLocate:NoSupportTemperature:YesNCQ:NoDisableCopyback:NoEnableJBOD:NoEnableCopybackonSMART:NoEnableCopybacktoSSDonSMARTError:YesEnableSSDPatrolRead:NoPRCorrectUnconfiguredAreas:YesEnableSpinDownofUnConfiguredDrives:YesDisableSpinDownofhotspares:NoSpinDowntime:30T10PowerState:NoErrorCounters================MemoryCorrectableErrors:0MemoryUncorrectableErrors:0ClusterInformation================ClusterPermitted:NoClusterActive:NoDefaultSettings================PhyPolarity:0PhyPolaritySplit:0BackgroundRate:30StripSize:256kBFlushTime:4secondsWritePolicy:WBReadPolicy:AdaptiveCacheWhenBBUBad:DisabledCachedIO:NoSMARTMode:Mode6AlarmDisable:YesCoercionMode:NoneZCRConfig:UnknownDirtyLEDShowsDriveActivity:NoBIOSContinueonError:0SpinDownMode:NoneAllowedDeviceType:SAS/SATAMixAllowMixinEnclosure:YesAllowHDDSAS/SATAMixinVD:YesAllowSSDSAS/SATAMixinVD:NoAllowHDD/SSDMixinVD:NoAllowSATAinCluster:NoMaxChainedEnclosures:16DisableCtrl-R:YesEnableWebBIOS:YesDirectPDMapping:NoBIOSEnumerateVDs:YesRestoreHotSpareonInsertion:NoExposeEnclosureDevices:YesMaintainPDFailHistory:YesDisablePuncturing:NoZeroBasedEnclosureEnumeration:NoPreBootCLIEnabled:YesLEDShowDriveActivity:YesClusterDisable:YesSASDisable:NoAutoDetectBackPlaneEnable:SGPIO/i2cSEPUseFDEOnly:NoEnableLedHeader:NoDelayduringPOST:0EnableCrashDump:NoDisableOnlineControllerReset:NoEnableLDBBM:NoUn-CertifiedHardDiskDrives:AllowTreatSinglespanR1EasR10:NoMaxLDperarray:16PowerSavingoption:Don'tAutospindownConfiguredDrivesMaxpowersavingsoptionisnotallowedforLDs.OnlyT10powerconditionsaretobeused.Defaultspindowntimeinminutes:30EnableJBOD:NoTTYLogInFlash:NoAutoEnhancedImport:NoBreakMirrorRAIDSupport:NoDisableJoinMirror:NoEnableShieldState:NoTimetakentodetectCME:60s

Saídademegacli-AdpBbuCmd-GetBbuSTatus-aAll:

BBUstatusforAdapter:0BatteryType:iBBUVoltage:4068mVCurrent:0mATemperature:30CBatteryState:OptimalBBUFirmwareStatus:ChargingStatus:ChargingVoltage:OKTemperature:OKLearnCycleRequested:NoLearnCycleActive:NoLearnCycleStatus:OKLearnCycleTimeout:NoI2cErrorsDetected:NoBatteryPackMissing:NoBatteryReplacementrequired:NoRemainingCapacityLow:NoPeriodicLearnRequired:NoTransparentLearn:NoNospacetocacheoffload:NoPackisabouttofail&shouldbereplaced:NoCacheOffloadpremiumfeaturerequired:NoModulemicrocodeupdaterequired:NoGasGuageStatus:FullyDischarged:NoFullyCharged:NoDischarging:YesInitialized:YesRemainingTimeAlarm:NoDischargeTerminated:NoOverTemperature:NoChargingTerminated:NoOverCharged:NoRelativeStateofCharge:88%ChargerSystemState:49169ChargerSystemCtrl:0Chargingcurrent:512mAAbsolutestateofcharge:87%MaxError:4%ExitCode:0x00

Saídademegacli-LDInfo-Lall-aAll:

Adapter0--VirtualDriveInformation:VirtualDrive:0(TargetId:0)Name:RAIDLevel:Primary-1,Secondary-0,RAIDLevelQualifier-0Size:111.281GBSectorSize:512MirrorData:111.281GBState:OptimalStripSize:256KBNumberOfDrives:2SpanDepth:1DefaultCachePolicy:WriteBack,ReadAhead,Direct,NoWriteCacheifBadBBUCurrentCachePolicy:WriteBack,ReadAhead,Direct,NoWriteCacheifBadBBUDefaultAccessPolicy:Read/WriteCurrentAccessPolicy:Read/WriteDiskCachePolicy:Disk'sDefaultEncryptionType:NoneIsVDCached:NoVirtualDrive:1(TargetId:1)Name:RAIDLevel:Primary-1,Secondary-0,RAIDLevelQualifier-0Size:1.633TBSectorSize:512MirrorData:1.633TBState:OptimalStripSize:256KBNumberOfDrivesperspan:2SpanDepth:6DefaultCachePolicy:WriteBack,ReadAhead,Direct,WriteCacheOKifBadBBUCurrentCachePolicy:WriteBack,ReadAhead,Direct,WriteCacheOKifBadBBUDefaultAccessPolicy:Read/WriteCurrentAccessPolicy:Read/WriteDiskCachePolicy:Disk'sDefaultEncryptionType:NoneIsVDCached:No

Atualização:PorconselhodeAndrew,euexecuteiddporalgunsminutosparaverquetipodetaxaeuobterianasleiturasdediscobruto:

ddif=/dev/sdbof=/dev/nullbs=256k19701+0recordsin19700+0recordsout5164236800bytes(5.2GB)copied,202.553s,25.5MB/s

Resultadosdeoutrasexecuções,comtaxadetransferênciaaltamentevariável:

18706857984bytes(19GB)copied,1181.51s,15.8MB/s20923023360bytes(21GB)copied,388.137s,53.9MB/s21205876736bytes(21GB)copied,55.5997s,381MB/s25391005696bytes(25GB)copied,153.903s,165MB/s

Atualização2:saídademegacli-PDlist-aall: link

    
por danpelota 11.07.2016 / 17:25

3 respostas

5

Como Michal apontou em seu comentário , a questão era um disco "prefailing". Não houve sinalizadores vermelhos no diagnóstico do controlador megaraid e SMART Health Status: do smartctl foi OK , mas a execução de smartctl em cada disco revelou uma enorme contagem de erros não médios (escrevi uma rápida script bash para percorrer cada ID de disco). Aqui estão os bits relevantes da saída completa :

# Ran this for each individual disk on the /dev/sdb array:
smartctl -a -d megaraid,18  /dev/sdb

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:    7950078        0         0   7950078    7950078        660.801           0
write:         0        0         0         0          0        363.247           0
verify:       12        0         0        12         12          0.002           0

Non-medium error count:  3253718

Todas as outras unidades apresentaram uma contagem de erros não médios de 0, exceto por essa (ID do disco 18). Identifiquei o disco, troquei-o por um novo e voltei a receber leituras de 3gbps.

De acordo com o wiki do smartmontools :

The displayed error logs (if available) are displayed on separate lines:

  • write error counters

  • read error counters

  • verify error counters (only displayed if non-zero)

  • non-medium error counter (only a single number displayed). This represents the number of recoverable events other than write, read or verify errors.

  • error events are held in the "Last n error events" log page. The number of error event records held (i.e. "n") is vendor specific (e.g. up to 23 records are held for Hitachi 10K300 model disks). The contents of each error event record is in ASCII and vendor specific. The parameter code associated with each error event record indicates the relative time at which the error event occurred. A higher parameter code indicates that the error event occurred later in time. If this log page is not supported by the device then "Error Events logging not supported" is output. If this log page is supported and there are error event records then each one is prefixed by "Error event :" where is the parameter code.

    
por 03.06.2017 / 18:44
0

Você precisa verificar a fragmentação da sua unidade:

xfs_db -r /dev/sdbx
frag

Você terá uma resposta assim:

actual 347954, ideal 15723, fragmentation factor 95.48%

Se o seu fator de fragmentação for alto, você precisará desfragmentar seu disco. (sim eu sei, como no Windows ...): /

Para desfragmentar seu disco: xfs_fsr -v /dev/sdbx

    
por 28.03.2017 / 15:27
0

Com o LSI, há algumas coisas que são realmente importantes.

1) Pisca o firmware do RAID. Você está com poucas rotações atuais.

2) Faça o flash do firmware nas unidades e verifique se ele também está atualizado.

3) Atualize seu driver. Com base nas notas de lançamento no site da LSI, eles acabaram de lançar um novo driver no final de janeiro.

Depois, você pode executar novamente seus testes para ver se há alguma alteração.

    
por 21.04.2017 / 00:18