Disco rígido no software RAID5 dissapeared

2

Eu tenho 6 HDDs na minha caixa, SSDs ATA OCZ-AGILITY4 de 2x256GB.

  • Cada um é particionado em um bloco de 48Gb e um bloco de 208Gb.
  • As partições de 48 Gb são RAID0 listradas e servem como espaço de troca (md0).
  • As partições de 208 GB são RAID1 distribuídas e servem como o sistema de arquivos / (md1).

Existem também unidades ATA ST3000DM001-9YN166 de 4x3.0Tb cada uma com uma partição, essas 4 partições são distribuídas RAID5 (md128). Todos são ext4 formatados em execução no servidor do Ubuntu 12.04. sda e sdb são os SSDs, enquanto sdc, sdd, sde e sdf são os HDDs.

Aleatoriamente (até onde eu sei) o md128 se tornou somente leitura. O log sys se parece com isso para o evento:

Aug 23 16:25:24 crick kernel: [617040.416257] ata4.00: exception Emask 0x0 SAct 0x1f SErr 0x0 action 0x0
Aug 23 16:25:24 crick kernel: [617040.416260] ata4.00: irq_stat 0x40000008
Aug 23 16:25:24 crick kernel: [617040.416262] ata4.00: failed command: READ FPDMA QUEUED
Aug 23 16:25:24 crick kernel: [617040.416265] ata4.00: cmd 60/08:08:00:af:cc/00:00:9c:00:00/40 tag 1 ncq 4096 in
Aug 23 16:25:24 crick kernel: [617040.416265]          res 41/40:08:00:af:cc/00:00:9c:00:00/00 Emask 0x409 (media error) <F>
Aug 23 16:25:24 crick kernel: [617040.416266] ata4.00: status: { DRDY ERR }
Aug 23 16:25:24 crick kernel: [617040.416267] ata4.00: error: { UNC }
Aug 23 16:25:24 crick kernel: [617040.417510] ata4.00: configured for UDMA/133
Aug 23 16:25:24 crick kernel: [617040.417527] sd 3:0:0:0: [sdd] Unhandled sense code
Aug 23 16:25:24 crick kernel: [617040.417528] sd 3:0:0:0: [sdd]  
Aug 23 16:25:24 crick kernel: [617040.417529] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Aug 23 16:25:24 crick kernel: [617040.417530] sd 3:0:0:0: [sdd]  
Aug 23 16:25:24 crick kernel: [617040.417531] Sense Key : Medium Error [current] [descriptor]
Aug 23 16:25:24 crick kernel: [617040.417533] Descriptor sense data with sense descriptors (in hex):
Aug 23 16:25:24 crick kernel: [617040.417534]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Aug 23 16:25:24 crick kernel: [617040.417538]         9c cc af 00 
Aug 23 16:25:24 crick kernel: [617040.417540] sd 3:0:0:0: [sdd]  
Aug 23 16:25:24 crick kernel: [617040.417541] Add. Sense: Unrecovered read error - auto reallocate failed
Aug 23 16:25:24 crick kernel: [617040.417542] sd 3:0:0:0: [sdd] CDB: 
Aug 23 16:25:24 crick kernel: [617040.417543] Read(10): 28 00 9c cc af 00 00 00 08 00
Aug 23 16:25:24 crick kernel: [617040.417547] end_request: I/O error, dev sdd, sector 2630659840
Aug 23 16:25:24 crick kernel: [617040.417550] md/raid:md128: read error not correctable (sector 2630657792 on sdd1).
Aug 23 16:25:24 crick kernel: [617040.417552] md/raid:md128: Disk failure on sdd1, disabling device.
Aug 23 16:25:24 crick kernel: [617040.417552] md/raid:md128: Operation continuing on 2 devices.
Aug 23 16:25:24 crick kernel: [617040.417563] ata4: EH complete
Aug 23 16:25:25 crick kernel: [617040.455605] RAID conf printout:
Aug 23 16:25:25 crick kernel: [617040.455609]  --- level:5 rd:4 wd:2
Aug 23 16:25:25 crick kernel: [617040.455610]  disk 0, o:1, dev:sdc1
Aug 23 16:25:25 crick kernel: [617040.455611]  disk 1, o:0, dev:sdd1
Aug 23 16:25:25 crick kernel: [617040.455612]  disk 2, o:1, dev:sde1
Aug 23 16:25:25 crick kernel: [617040.489941] RAID conf printout:
Aug 23 16:25:25 crick kernel: [617040.489945]  --- level:5 rd:4 wd:2
Aug 23 16:25:25 crick kernel: [617040.489947]  disk 0, o:1, dev:sdc1
Aug 23 16:25:25 crick kernel: [617040.489948]  disk 2, o:1, dev:sde1

A slew of:
Aug 23 16:25:25 crick kernel: [617040.539926] Buffer I/O error on device md128, logical block 986401023
with different block addresses then

Aug 23 16:25:25 crick kernel: [617040.539929] EXT4-fs warning (device md128): ext4_end_bio:248: I/O error writing to inode 42993727 (offset 11551637504 size 524288 starting block 986400896)
Aug 23 16:25:25 crick kernel: [617040.541690] JBD2: Detected IO errors while flushing file data on md128-8
Aug 23 16:25:25 crick kernel: [617040.541707] Aborting journal on device md128-8.
Aug 23 16:25:25 crick kernel: [617040.541720] EXT4-fs error (device md128) in ext4_free_blocks:4702: Journal has aborted
Aug 23 16:25:25 crick kernel: [617040.541727] Buffer I/O error on device md128, logical block 1098416128
Aug 23 16:25:25 crick kernel: [617040.541729] lost page write due to I/O error on md128
Aug 23 16:25:25 crick kernel: [617040.541734] Buffer I/O error on device md128, logical block 0
Aug 23 16:25:25 crick kernel: [617040.541736] lost page write due to I/O error on md128
Aug 23 16:25:25 crick kernel: [617040.541740] JBD2: Error -5 detected when updating journal superblock for md128-8.
Aug 23 16:25:25 crick kernel: [617040.541743] EXT4-fs (md128): delayed block allocation failed for inode 49152114 at logical offset 994912 with max blocks 2048 with error -30
Aug 23 16:25:25 crick kernel: [617040.541745] EXT4-fs (md128): This should not happen!! Data will be lost
Aug 23 16:25:25 crick kernel: [617040.541745] 
Aug 23 16:25:25 crick kernel: [617040.541806] EXT4-fs (md128): previous I/O error to superblock detected
Aug 23 16:25:25 crick kernel: [617040.542518] EXT4-fs error (device md128) in ext4_da_writepages:2390: Journal has aborted
Aug 23 16:25:25 crick kernel: [617040.542526] JBD2: Detected IO errors while flushing file data on md128-8
Aug 23 16:25:25 crick kernel: [617040.542529] Buffer I/O error on device md128, logical block 0
Aug 23 16:25:25 crick kernel: [617040.542531] lost page write due to I/O error on md128
Aug 23 16:25:25 crick kernel: [617040.542798] EXT4-fs error (device md128): ext4_journal_start_sb:371: Detected aborted journal
Aug 23 16:25:25 crick kernel: [617040.542810] EXT4-fs (md128): Remounting filesystem read-only
Aug 23 16:25:25 crick kernel: [617040.542815] EXT4-fs (md128): previous I/O error to superblock detected
Aug 23 16:25:25 crick kernel: [617040.542835] EXT4-fs (md128): I/O error while writing superblock
Aug 23 16:25:25 crick kernel: [617040.542838] EXT4-fs (md128): ext4_da_writepages: jbd2_start: 15984 pages, ino 49152114; err -30
Aug 23 16:25:25 crick kernel: [617040.544887] EXT4-fs error (device md128): ext4_journal_start_sb:371: Detected aborted journal

Quando uso o utilitário de disco ou o gparted sdf se foi, ele nem é mais detectado. Como tal, o md128 é degradado. A primeira coisa que fiz foi desmontá-lo através do utilitário de disco, depois tentei ver se alguma outra coisa poderia encontrá-lo:

fdisk -l

Disk /dev/sda: 256.1 GB, 256060514304 bytes
255 heads, 63 sectors/track, 31130 cylinders, total 500118192 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000a88fe

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1       406368256   500117503    46874624   fd  Linux raid autodetect
/dev/sda2            2046   406368255   203183105    5  Extended
/dev/sda5            2048   406368255   203183104   fd  Linux raid autodetect

Partition table entries are not in disk order

Disk /dev/md0: 96.0 GB, 95998181376 bytes
2 heads, 4 sectors/track, 23437056 cylinders, total 187496448 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 524288 bytes / 1048576 bytes
Disk identifier: 0x00000000

Disk /dev/md0 doesn't contain a valid partition table

Disk /dev/md1: 207.9 GB, 207925149696 bytes
2 heads, 4 sectors/track, 50762976 cylinders, total 406103808 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

Disk /dev/md1 doesn't contain a valid partition table

Disk /dev/sdb: 256.1 GB, 256060514304 bytes
255 heads, 63 sectors/track, 31130 cylinders, total 500118192 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000b0740

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *   406368256   500117503    46874624   fd  Linux raid autodetect
/dev/sdb2            2046   406368255   203183105    5  Extended
/dev/sdb5            2048   406368255   203183104   fd  Linux raid autodetect

Partition table entries are not in disk order

WARNING: GPT (GUID Partition Table) detected on '/dev/sdc'! The util fdisk doesn't support GPT. Use GNU Parted.


Disk /dev/sdc: 3000.6 GB, 3000592982016 bytes
255 heads, 63 sectors/track, 364801 cylinders, total 5860533168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1               1  4294967295  2147483647+  ee  GPT
Partition 1 does not start on physical sector boundary.

Disk /dev/md128: 9001.4 GB, 9001370124288 bytes
2 heads, 4 sectors/track, -2097367168 cylinders, total 17580801024 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 524288 bytes / 1572864 bytes
Disk identifier: 0x00000000

Disk /dev/md128 doesn't contain a valid partition table

WARNING: GPT (GUID Partition Table) detected on '/dev/sde'! The util fdisk doesn't support GPT. Use GNU Parted.


Disk /dev/sde: 3000.6 GB, 3000592982016 bytes
255 heads, 63 sectors/track, 364801 cylinders, total 5860533168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sde1               1  4294967295  2147483647+  ee  GPT
Partition 1 does not start on physical sector boundary.

WARNING: GPT (GUID Partition Table) detected on '/dev/sdd'! The util fdisk doesn't support GPT. Use GNU Parted.


Disk /dev/sdd: 3000.6 GB, 3000592982016 bytes
255 heads, 63 sectors/track, 364801 cylinders, total 5860533168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sdd1               1  4294967295  2147483647+  ee  GPT
Partition 1 does not start on physical sector boundary.

cat / etc / fstab

# <file system> <mount point>   <type>  <options>       <dump>  <pass>
proc            /proc           proc    nodev,noexec,nosuid 0       0
# / was on /dev/md127 during installation
UUID=b63b7341-0b85-40ed-a67b-acfe4f65f563 /               ext4    errors=remount-ro 0       1
# /genome was on /dev/md128 during installation
UUID=bd6b54be-12ca-479d-879e-8d788fa9d039 /genome         ext4    defaults        0       2
# swap was on /dev/md126 during installation
UUID=0470550a-6e92-485d-ad41-665c3f313287 none            swap    sw              0       0

~ # cat / proc / mdstat

Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md128 : active raid5 sde1[2] sdc1[0]
      8790400512 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/2] [U_U_]

md1 : active raid1 sdb5[1] sda5[0]
      203051904 blocks super 1.2 [2/2] [UU]

md0 : active raid0 sdb1[1] sda1[0]
      93748224 blocks super 1.2 512k chunks

unused devices: <none>

~ # mdadm --detail --scan

ARRAY /dev/md/0 metadata=1.2 name=crick:0 UUID=d0d08eab:a7e54021:25973acb:10dd5fba
ARRAY /dev/md/1 metadata=1.2 name=crick:1 UUID=e2774945:5b3ee3eb:2ad9390f:35153b82
ARRAY /dev/md/128 metadata=1.2 name=crick:128 UUID=345cb755:0ae1c919:d98a45ca:1baf3364

É aqui que o meu entendimento do Linux não é o melhor. Pelo que eu posso dizer, o superbloco de array vê o UUID para o sdf no qual ele foi construído, mas (por alguma razão) o sdf não está sendo detectado. Isso significa que a matriz RAID está quebrada e, portanto, o ubuntu configura o sistema de arquivos para RO como uma precaução de segurança. Eu ainda posso abrir, editar e criar arquivos em / sem nenhum problema, o sistema operacional ainda funciona perfeitamente, eu simplesmente não consigo acessar os dados no md128. Ainda não desliguei a máquina para remover a unidade para ver se ela está com defeito ou não.

Cerca de uma semana atrás, o mesmo problema (mudança aleatória para RO) ocorreu, apenas quando atingiu o valor de md1 (os SSDs). Quando isso aconteceu o superbloco foi apagado em md128. Tive sorte ao reinstalar o sistema operacional e tive que reconstruir o RAID usando as partições existentes (o IE não formatou as partições) e consegui recuperar 99,9% dos dados (os arquivos que estavam sendo acessados quando estavam danificados).

  • O problema atual com o md128 é um problema que eu consegui band-aid até agora ou isso é um sinal de alguma questão mais profunda com a qual eu deveria estar preocupado?
  • Qual é a melhor maneira de seguir em frente: substituir sdf e restripe em md128, substituir sdf e desmontar / reconstruir md128, ou outra coisa?

EDIT: Desculpe o OP, sou novo no fórum e não li as instruções de formatação, corrigi-lo agora

    
por Seth_m55 26.08.2013 / 19:02

1 resposta

1

O mais provável é que o sdf tenha morrido há algum tempo e você não percebeu. Agora o SDD tem alguns setores defeituosos. As leituras com falha fazem com que o sistema de arquivos mude para somente leitura. Você precisa substituir a unidade com falha e fsck o sistema de arquivos.

    
por psusi 27.08.2013 / 15:37