Razões para o Linux excluir um disco da matriz RAID10 (degradá-lo), exceto por falha física

1

Eu gostaria de entender o conjunto de condições nas quais o Linux considera uma unidade realmente defeituosa, exclui-a da matriz e considera a matriz RAID10 degradada.

Como eu sei, nem os relatórios de autoverificação SMART nem nenhum dos valores SMART sobre setores realocados nem problemas de paridade de bloco (/ sys / block / md0 / md / mismatch_cnt > 0 ou até mesmo > 10.000) são considerados um motivo para excluir um disco da matriz.

Quais são essas razões (exceto quando a unidade é fisicamente incapaz de se comunicar, é claro)?

    
por Vladislav Rastrusny 28.11.2017 / 18:58

1 resposta

2

Em princípio, tanto o bloco de leitura e gravação de erro pode levar um disco offline. No entanto, um comportamento específico depende do kernel usado.

Na seção RECOVERY da página do manual md :

If the md driver detects a write error on a device in a RAID1, RAID4, RAID5, RAID6, or RAID10 array, it immediately disables that device (marking it as faulty) and continues operation on the remaining devices. If there are spare drives, the driver will start recreating on one of the spare drives the data which was on that failed drive, either by copying a working drive in a RAID1 configuration, or by doing calculations with the parity block on RAID4, RAID5 or RAID6, or by finding and copying originals for RAID10.

In kernels prior to about 2.6.15, a read error would cause the same effect as a write error. In later kernels, a read-error will instead cause md to attempt a recovery by overwriting the bad block. i.e. it will find the correct data from elsewhere, write it over the block that failed, and then try to read it back again. If either the write or the re-read fail, md will treat the error the same way that a write error is treated, and will fail the whole device.

Não deixe de ler também a seção LISTA DE BLOCOS BÁSICOS :

From Linux 3.5 each device in an md array can store a list of known- bad-blocks. This list is 4K in size and usually positioned at the end of the space between the superblock and the data.

When a block cannot be read and cannot be repaired by writing data recovered from other devices, the address of the block is stored in the bad block list. Similarly if an attempt to write a block fails, the address will be recorded as a bad block. If attempting to record the bad block fails, the whole device will be marked faulty.

Attempting to read from a known bad block will cause a read error. Attempting to write to a known bad block will be ignored if any write errors have been reported by the device. If there have been no write errors then the data will be written to the known bad block and if that succeeds, the address will be removed from the list.

This allows an array to fail more gracefully - a few blocks on different devices can be faulty without taking the whole array out of action.

    
por 28.11.2017 / 19:08