Muitos problemas divertidos com controladores RAID e SATA

1

Comprei um cartão de ataque 3ware usado do ebay há alguns meses e configurei quatro caviares de 1TB em RAID5. A invasão de hardware só conseguiu gerenciar cerca de 5MB / s escrevendo sem gravar em cache, e eu não tenho um no-break, então deixar o cache de gravação pareceu uma má idéia. Então, peguei uma unidade sobressalente de 2 TB do trabalho, copiei meus dados para isso e comecei a configurar o mdadm em uma nova instalação do Debian 6. Isso estava funcionando bem * por cerca de duas semanas e depois comecei a receber erros de leitura. mdadm disse que duas unidades falharam. Então eu desliguei, inicializei no instalador do Debian e comecei a ler o syslog.

A primeira coisa que vi foi um monte deles:

May 18 10:31:14 osiris kernel: [937288.579972] kswapd0: page allocation failure. order:5, mode:0x4020
May 18 10:31:14 osiris kernel: [937288.579978] Pid: 47, comm: kswapd0 Not tainted 2.6.32-5-686 #1
May 18 10:31:14 osiris kernel: [937288.579981] Call Trace:
May 18 10:31:14 osiris kernel: [937288.579994]  [] ? __alloc_pages_nodemask+0x484/0x4d9
May 18 10:31:14 osiris kernel: [937288.580000]  [] ? __get_free_pages+0xc/0x17
May 18 10:31:14 osiris kernel: [937288.580037]  [] ? __kmalloc+0x30/0x128
May 18 10:31:14 osiris kernel: [937288.580049]  [] ? pskb_expand_head+0x4f/0x157
May 18 10:31:14 osiris kernel: [937288.580059]  [] ? __pskb_pull_tail+0x3f/0x1fb
May 18 10:31:14 osiris kernel: [937288.580071]  [] ? sock_wfree+0x17/0x4b
May 18 10:31:14 osiris kernel: [937288.580084]  [] ? dev_queue_xmit+0xe4/0x38e
May 18 10:31:14 osiris kernel: [937288.580096]  [] ? neigh_resolve_output+0x1df/0x227
May 18 10:31:14 osiris kernel: [937288.580109]  [] ? ip_finish_output2+0x187/0x1c2
May 18 10:31:14 osiris kernel: [937288.580121]  [] ? ip_local_out+0x15/0x17
May 18 10:31:14 osiris kernel: [937288.580132]  [] ? ip_queue_xmit+0x31d/0x378
May 18 10:31:14 osiris kernel: [937288.580145]  [] ? bictcp_cong_avoid+0x14/0x2c9
May 18 10:31:14 osiris kernel: [937288.580157]  [] ? tcp_write_xmit+0x3e7/0x874
May 18 10:31:14 osiris kernel: [937288.580167]  [] ? tcp_ack+0x1611/0x1802
May 18 10:31:14 osiris kernel: [937288.580178]  [] ? tcp_transmit_skb+0x595/0x5cc
May 18 10:31:14 osiris kernel: [937288.580189]  [] ? tcp_write_xmit+0x7a3/0x874
May 18 10:31:14 osiris kernel: [937288.580200]  [] ? tcp_ack+0x1611/0x1802
May 18 10:31:14 osiris kernel: [937288.580210]  [] ? tcp_established_options+0x1d/0x8b
May 18 10:31:14 osiris kernel: [937288.580221]  [] ? tcp_current_mss+0x38/0x53
May 18 10:31:14 osiris kernel: [937288.580232]  [] ? __tcp_push_pending_frames+0x1e/0x50
May 18 10:31:14 osiris kernel: [937288.580243]  [] ? tcp_data_snd_check+0x1b/0xd2
May 18 10:31:14 osiris kernel: [937288.580254]  [] ? tcp_rcv_established+0xd2/0x626
May 18 10:31:14 osiris kernel: [937288.580266]  [] ? tcp_v4_do_rcv+0x15f/0x2cf
May 18 10:31:14 osiris kernel: [937288.580276]  [] ? tcp_v4_rcv+0x3d2/0x602
May 18 10:31:14 osiris kernel: [937288.580288]  [] ? ip_local_deliver_finish+0x10c/0x18c
May 18 10:31:14 osiris kernel: [937288.580299]  [] ? ip_rcv_finish+0x2c4/0x2d8
May 18 10:31:14 osiris kernel: [937288.580310]  [] ? netif_receive_skb+0x3bb/0x3d6
May 18 10:31:14 osiris kernel: [937288.580340]  [] ? e1000_clean_jumbo_rx_irq+0x4f8/0x5bb [e1000]
May 18 10:31:14 osiris kernel: [937288.580356]  [] ? e1000_clean+0x29f/0x40d [e1000]
May 18 10:31:14 osiris kernel: [937288.580370]  [] ? e1000_clean_jumbo_rx_irq+0x579/0x5bb [e1000]
May 18 10:31:14 osiris kernel: [937288.580382]  [] ? net_rx_action+0x96/0x194
May 18 10:31:14 osiris kernel: [937288.580395]  [] ? __do_softirq+0xaa/0x156
May 18 10:31:14 osiris kernel: [937288.580406]  [] ? do_softirq+0x31/0x3c
May 18 10:31:14 osiris kernel: [937288.580416]  [] ? irq_exit+0x26/0x58
May 18 10:31:14 osiris kernel: [937288.580429]  [] ? do_IRQ+0x78/0x89
May 18 10:31:14 osiris kernel: [937288.580440]  [] ? common_interrupt+0x30/0x38
May 18 10:31:14 osiris kernel: [937288.580452]  [] ? free_hot_cold_page+0x182/0x1a3
May 18 10:31:14 osiris kernel: [937288.580463]  [] ? __pagevec_free+0x4e/0x58
May 18 10:31:14 osiris kernel: [937288.580473]  [] ? release_pages+0xe7/0x124
May 18 10:31:14 osiris kernel: [937288.580484]  [] ? __pagevec_release+0x15/0x1d
May 18 10:31:14 osiris kernel: [937288.580495]  [] ? invalidate_mapping_pages+0x6a/0x98
May 18 10:31:14 osiris kernel: [937288.580505]  [] ? shrink_icache_memory+0xd7/0x1d3
May 18 10:31:14 osiris kernel: [937288.580515]  [] ? shrink_slab+0xe6/0x13f
May 18 10:31:14 osiris kernel: [937288.580525]  [] ? kswapd+0x3d8/0x54f
May 18 10:31:14 osiris kernel: [937288.580536]  [] ? isolate_pages_global+0x0/0x1bc
May 18 10:31:14 osiris kernel: [937288.580550]  [] ? autoremove_wake_function+0x0/0x2d
May 18 10:31:14 osiris kernel: [937288.580564]  [] ? complete+0x28/0x36
May 18 10:31:14 osiris kernel: [937288.580574]  [] ? kswapd+0x0/0x54f
May 18 10:31:14 osiris kernel: [937288.580584]  [] ? kthread+0x61/0x66
May 18 10:31:14 osiris kernel: [937288.580595]  [] ? kthread+0x0/0x66
May 18 10:31:14 osiris kernel: [937288.580606]  [] ? kernel_thread_helper+0x7/0x10
May 18 10:31:14 osiris kernel: [937288.580612] Mem-Info:
May 18 10:31:14 osiris kernel: [937288.580618] DMA per-cpu:
May 18 10:31:14 osiris kernel: [937288.580625] CPU    0: hi:    0, btch:   1 usd:   0
May 18 10:31:14 osiris kernel: [937288.580632] CPU    1: hi:    0, btch:   1 usd:   0
May 18 10:31:14 osiris kernel: [937288.580639] CPU    2: hi:    0, btch:   1 usd:   0
May 18 10:31:14 osiris kernel: [937288.580647] CPU    3: hi:    0, btch:   1 usd:   0
May 18 10:31:14 osiris kernel: [937288.580654] Normal per-cpu:
May 18 10:31:14 osiris kernel: [937288.580660] CPU    0: hi:  186, btch:  31 usd: 157
May 18 10:31:14 osiris kernel: [937288.580668] CPU    1: hi:  186, btch:  31 usd:  93
May 18 10:31:14 osiris kernel: [937288.580676] CPU    2: hi:  186, btch:  31 usd:  91
May 18 10:31:14 osiris kernel: [937288.580683] CPU    3: hi:  186, btch:  31 usd: 167
May 18 10:31:14 osiris kernel: [937288.580690] HighMem per-cpu:
May 18 10:31:14 osiris kernel: [937288.580697] CPU    0: hi:  186, btch:  31 usd: 155
May 18 10:31:14 osiris kernel: [937288.580704] CPU    1: hi:  186, btch:  31 usd: 173
May 18 10:31:14 osiris kernel: [937288.580711] CPU    2: hi:  186, btch:  31 usd:  85
May 18 10:31:14 osiris kernel: [937288.580718] CPU    3: hi:  186, btch:  31 usd: 165
May 18 10:31:14 osiris kernel: [937288.580730] active_anon:22503 inactive_anon:10669 isolated_anon:0
May 18 10:31:14 osiris kernel: [937288.580733]  active_file:25150 inactive_file:287773 isolated_file:0
May 18 10:31:14 osiris kernel: [937288.580737]  unevictable:0 dirty:0 writeback:52 unstable:0
May 18 10:31:14 osiris kernel: [937288.580741]  free:20455 slab_reclaimable:8509 slab_unreclaimable:7454
May 18 10:31:14 osiris kernel: [937288.580744]  mapped:5500 shmem:1407 pagetables:627 bounce:0
May 18 10:31:14 osiris kernel: [937288.580759] DMA free:3588kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:1116kB inactive_file:288kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15784kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:176kB slab_unreclaimable:528kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
May 18 10:31:14 osiris kernel: [937288.580781] lowmem_reserve[]: 0 861 2015 2015
May 18 10:31:14 osiris kernel: [937288.580810] Normal free:42680kB min:3720kB low:4648kB high:5580kB active_anon:12kB inactive_anon:780kB active_file:46720kB inactive_file:188384kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:881880kB mlocked:0kB dirty:0kB writeback:0kB mapped:76kB shmem:0kB slab_reclaimable:33860kB slab_unreclaimable:29288kB kernel_stack:1408kB pagetables:124kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
May 18 10:31:14 osiris kernel: [937288.580833] lowmem_reserve[]: 0 0 9234 9234
May 18 10:31:14 osiris kernel: [937288.580862] HighMem free:35552kB min:512kB low:1756kB high:3004kB active_anon:90000kB inactive_anon:41896kB active_file:52764kB inactive_file:962420kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1182056kB mlocked:0kB dirty:0kB writeback:208kB mapped:21924kB shmem:5628kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:2384kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
May 18 10:31:14 osiris kernel: [937288.580885] lowmem_reserve[]: 0 0 0 0
May 18 10:31:14 osiris kernel: [937288.580908] DMA: 13*4kB 4*8kB 13*16kB 1*32kB 1*64kB 1*128kB 2*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 3588kB
May 18 10:31:14 osiris kernel: [937288.580956] Normal: 8286*4kB 356*8kB 85*16kB 160*32kB 2*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 42600kB
May 18 10:31:14 osiris kernel: [937288.581007] HighMem: 1248*4kB 2476*8kB 620*16kB 8*32kB 5*64kB 0*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 35552kB
May 18 10:31:14 osiris kernel: [937288.581062] 315224 total pagecache pages
May 18 10:31:14 osiris kernel: [937288.581068] 947 pages in swap cache
May 18 10:31:14 osiris kernel: [937288.581075] Swap cache stats: add 7675, delete 6728, find 14833/15430
May 18 10:31:14 osiris kernel: [937288.581082] Free swap  = 3900340kB
May 18 10:31:14 osiris kernel: [937288.581089] Total swap = 3905528kB
May 18 10:31:14 osiris kernel: [937288.594263] 524144 pages RAM
May 18 10:31:14 osiris kernel: [937288.594268] 297858 pages HighMem
May 18 10:31:14 osiris kernel: [937288.594270] 5625 pages reserved
May 18 10:31:14 osiris kernel: [937288.594272] 123757 pages shared
May 18 10:31:14 osiris kernel: [937288.594275] 454801 pages non-shared

Parece que isso pode ser um bug do kernel e não causar problemas.

O próximo evento de nota foi aqui:

May 18 13:13:03 osiris kernel: [946997.132469] ata4: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
May 18 13:13:03 osiris kernel: [946997.132507] ata4: SError: { PHYRdyChg }
May 18 13:13:03 osiris kernel: [946997.132536] ata4: hard resetting link
May 18 13:13:06 osiris kernel: [947000.544016] ata4: COMRESET failed (errno=-19)
May 18 13:13:06 osiris kernel: [947000.544044] ata4: reset failed (errno=-19), retrying in 7 secs
May 18 13:13:08 osiris kernel: [947002.255353] 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0019): Drive removed:port=2.
May 18 13:13:08 osiris kernel: [947002.255525] 3w-9xxx: scsi0: AEN: WARNING (0x04:0x0019): Drive removed:port=3.
May 18 13:13:13 osiris kernel: [947007.132027] ata4: hard resetting link
May 18 13:13:18 osiris kernel: [947012.040156] 3w-9xxx: scsi0: AEN: INFO (0x04:0x001A): Drive inserted:port=3.
May 18 13:13:19 osiris kernel: [947012.896015] ata4: link is slow to respond, please be patient (ready=-19)
May 18 13:13:20 osiris kernel: [947013.913066] 3w-9xxx: scsi0: AEN: INFO (0x04:0x001A): Drive inserted:port=2.
May 18 13:13:20 osiris kernel: [947014.352032] ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
May 18 13:13:20 osiris kernel: [947014.377670] ata4.00: configured for UDMA/100
May 18 13:13:20 osiris kernel: [947014.377680] ata4: EH complete

Ninguém estava tocando a máquina fisicamente na hora, então interpretei isso como um controlador com falha - compreensível, considerando o uso relativamente antigo e adquirido.

Eu overnighted um novo controlador e instalei no dia seguinte. A SMART não relatou erros em nenhuma unidade e todos os testes curtos foram aprovados. Por isso, fiz um mdadm --assemble --force , que limpou o sinalizador de falha de todas as unidades e inicializou a matriz sem problemas. fsck diz que o sistema de arquivos está limpo e é montado sem problemas.

Então eu digo "ótimo!" e reinicie. A máquina pula direto para a tela do netboot, aparentemente ignorando o disco rígido inicializável. (note que / boot NÃO está na matriz mdadm - a matriz é apenas / home).

Aqui é onde eu estou preso. Eu não tenho idéia porque o BIOS não quer arrancar a partir desta unidade. Sem comida nem nada. É muito frustrante ter um sistema totalmente funcional em um ambiente chroot em um instalador e, depois, não conseguir inicializá-lo.

    
por Alex S 21.05.2011 / 21:42

0 respostas