Tenho o Solaris 11 (+ SRU mais recente) em execução em um HP DL385 G7 (conectado a discos P2000 de armazenamento com 30 discos; eles são registrados como unidades RAID0 separadas, mas estou usando o RAIDZ1 do ZFS). nosso servidor de arquivos. A cada dois dias, o sistema congela e precisa ser reiniciado. Não há nada de especial nos logs e no fmdump.
Acabei com um cron job jogando várias estatísticas a cada 2 minutos no disco rígido, o que mostra que há um aumento de carga e diminuição de memória pouco antes do acidente:
$ grep load top.120512*
top.120512063601:last pid: 21751; load avg: 0.61, 2.30, 2.93; up 4+17:03:45 06:36:02
top.120512063800:last pid: 21765; load avg: 0.27, 1.62, 2.59; up 4+17:05:44 06:38:01
top.120512064000:last pid: 21779; load avg: 0.29, 1.17, 2.30; up 4+17:07:45 06:40:02
top.120512064200:last pid: 21793; load avg: 0.56, 0.97, 2.09; up 4+17:09:44 06:42:01
top.120512064400:last pid: 21807; load avg: 0.20, 0.71, 1.85; up 4+17:11:45 06:44:02
top.120512064600:last pid: 21821; load avg: 0.60, 0.66, 1.68; up 4+17:13:45 06:46:02
top.120512064800:last pid: 21835; load avg: 1.25, 0.87, 1.64; up 4+17:15:44 06:48:01
top.120512065000:last pid: 21851; load avg: 4.77, 2.35, 2.10; up 4+17:17:45 06:50:02
top.120512065200:last pid: 21864; load avg: 5.10, 3.20, 2.45; up 4+17:19:45 06:52:02
top.120512065400:last pid: 21878; load avg: 5.81, 4.16, 2.91; up 4+17:21:44 06:54:01
top.120512065601:last pid: 21892; load avg: 5.26, 4.53, 3.20; up 4+17:23:45 06:56:02
top.120512065800:last pid: 21906; load avg: 5.36, 4.79, 3.46; up 4+17:25:45 06:58:02
// here was the crash
top.120512163801:last pid: 701; load avg: 1.18, 0.29, 0.10; up 0+00:01:16 16:38:02
top.120512164000:last pid: 1456; load avg: 0.36, 0.33, 0.14; up 0+00:03:16 16:40:02
top.120512164200:last pid: 1470; load avg: 0.14, 0.26, 0.14; up 0+00:05:16 16:42:02
top.120512164400:last pid: 1499; load avg: 0.39, 0.35, 0.19; up 0+00:07:15 16:44:01
top.120512164600:last pid: 1513; load avg: 0.10, 0.26, 0.17; up 0+00:09:16 16:46:02
Ou grep Memory
:
top.120512064600:Memory: 16G phys mem, 2031M free mem, 2048M total swap, 2048M free swap
top.120512064800:Memory: 16G phys mem, 2047M free mem, 2048M total swap, 2048M free swap
top.120512065000:Memory: 16G phys mem, 1443M free mem, 2048M total swap, 2048M free swap
top.120512065200:Memory: 16G phys mem, 1313M free mem, 2048M total swap, 2048M free swap
top.120512065400:Memory: 16G phys mem, 892M free mem, 2048M total swap, 2048M free swap
top.120512065601:Memory: 16G phys mem, 418M free mem, 2048M total swap, 2048M free swap
top.120512065800:Memory: 16G phys mem, 294M free mem, 2048M total swap, 2044M free swap
// restart
top.120512163801:Memory: 16G phys mem, 14G free mem, 2048M total swap, 2048M free swap
ou grep trap
:
top.120512064800:Kernel: 50542 ctxsw, 13 trap, 113144 intr, 850 syscall, 9 flt
top.120512065000:Kernel: 76357 ctxsw, 9 trap, 199203 intr, 399 syscall, 9 flt
top.120512065200:Kernel: 72294 ctxsw, 13 trap, 254779 intr, 481 syscall, 9 flt
top.120512065400:Kernel: 87671 ctxsw, 11 trap, 256663 intr, 401 syscall, 11 flt
top.120512065601:Kernel: 72696 ctxsw, 11 trap, 281765 intr, 402 syscall, 11 flt
top.120512065800:Kernel: 77316 ctxsw, 458 trap, 272329 intr, 412 syscall, 450 flt
// restarted here
top.120512163801:Kernel: 1570 ctxsw, 10 trap, 2380 intr, 1741 syscall, 9 flt
este é de echo "::memstat" | mdb -k
:
top.120512064800:ZFS File Data 2898132 11320 69%
top.120512065000:ZFS File Data 3039466 11872 73%
top.120512065200:ZFS File Data 3081508 12037 74%
top.120512065400:ZFS File Data 3188175 12453 76%
top.120512065601:ZFS File Data 3309405 12927 79%
top.120512065800:ZFS File Data 3393392 13255 81%
// restart
top.120512163801:ZFS File Data 70094 273 2%
top.120512164000:ZFS File Data 93547 365 2%
top.120512164200:ZFS File Data 197571 771 5%
top.120512164400:ZFS File Data 1175965 4593 28%
top.120512164600:ZFS File Data 1205865 4710 29%
top.120512164800:ZFS File Data 2537072 9910 61%
O pool do ZFS não está corrompido, a carga real está abaixo da média (em comparação com nossos outros servidores de arquivos), o hardware parece estar ok também.
O que você acha que pode ser a razão para esse comportamento? Quais outras estatísticas eu preciso coletar?
Editar:
$ zpool status -v
pool: rpool
state: ONLINE
scan: scrub repaired 0 in 0h6m with 0 errors on Wed Apr 25 14:40:49 2012
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c3t0d0s0 ONLINE 0 0 0
c3t1d0s0 ONLINE 0 0 0
errors: No known data errors
pool: volume
state: ONLINE
scan: resilvered 285G in 2h57m with 0 errors on Mon May 7 22:01:38 2012
config:
NAME STATE READ WRITE CKSUM
volume ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
c0t600C0FF00012FBB1F749674F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7DDDA1154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FBB1DEA1154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7D7CA2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FBB1EAA1154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7DEBA1154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FBB1F0A1154F01000000d0 ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
c0t600C0FF00012FC7DFCA1154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FBB1FDA1154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7D08A2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FBB109A2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7D14A2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FBB115A2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7D20A2154F01000000d0 ONLINE 0 0 0
raidz1-2 ONLINE 0 0 0
c0t600C0FF00012FBB171A2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7D2CA2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FBB12DA2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7D38A2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FBB139A2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7D44A2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7DA3CA754F01000000d0 ONLINE 0 0 0
raidz1-3 ONLINE 0 0 0
c0t600C0FF00012FC7D50A2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FBB151A2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7D5CA2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FBB15DA2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7D68A2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FBB169A2154F01000000d0 ONLINE 0 0 0
c0t600C0FF00012FC7D70A2154F01000000d0 ONLINE 0 0 0
spares
c0t600C0FF00012FBB1D7A1154F01000000d0 AVAIL
c0t600C0FF000131E9277AD154F01000000d0 AVAIL
errors: No known data errors
$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
rpool 24.4G 249G 39.5K /rpool
rpool/ROOT 14.1G 249G 31K legacy
rpool/ROOT/solaris 5.59M 249G 11.5G /
rpool/ROOT/solaris-1 14.1G 249G 11.5G /
rpool/ROOT/solaris-1/var 2.15G 249G 1.94G /var
rpool/ROOT/solaris/var 2.71M 249G 1.29G /var
rpool/dump 8.24G 250G 7.98G -
rpool/export 63K 249G 32K /export
rpool/export/home 31K 249G 31K /export/home
rpool/swap 2.06G 249G 2.00G -
volume 8.77T 33.6T 6.77T /volume
volume/gluster 33.5G 1.97T 33.5G /volume/gluster
Editar 2
Aqui está um diff de várias estatísticas: link (à esquerda: estado do sistema "normal", à direita: apenas um minuto antes da falha)