Eu tenho um servidor executando aplicativos de computação científica. É o SLES 11sp3.
Aqui estão algumas das minhas observações:
ps aux
ele irá travar. No entanto, o comando top
está bem e aqui está a saída:
top - 21:02:49 up 403 days, 5:36, 5 users, load average: 21.01, 20.31, 18.79
Tasks: 271 total, 6 running, 241 sleeping, 24 stopped, 0 zombie
Cpu(s): 0.0%us, 6.3%sy, 0.0%ni, 49.9%id, 43.8%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 258428M total, 156884M used, 101543M free, 0M buffers
Swap: 7999M total, 2588M used, 5411M free, 151700M cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
39993 root 20 0 9064 1276 824 R 0 0.0 0:00.43 top
1 root 20 0 10548 68 36 S 0 0.0 5:24.51 init
2 root 20 0 0 0 0 S 0 0.0 0:06.20 kthreadd 3 root 20 0 0 0 0 S 0 0.0 5:17.22 ksoftirqd/0
6 root RT 0 0 0 0 S 0 0.0 51:15.16 migration/0 8 root RT 0 0 0 0 S 0 0.0 4:39.14 migration/1
10 root 20 0 0 0 0 S 0 0.0 1:30.39 ksoftirqd/1 13 root RT 0 0 0 0 S 0 0.0 1:34.02 migration/2
15 root 20 0 0 0 0 S 0 0.0 0:16.03 ksoftirqd/2 17 root RT 0 0 0 0 S 0 0.0 1:19.64 migration/3
19 root 20 0 0 0 0 S 0 0.0 0:13.99 ksoftirqd/3 21 root RT 0 0 0 0 S 0 0.0 1:44.44 migration/4
23 root 20 0 0 0 0 S 0 0.0 0:18.40 ksoftirqd/4 25 root RT 0 0 0 0 S 0 0.0 1:42.13 migration/5
26 root 20 0 0 0 0 S 0 0.0 14:56.25 kworker/5:0 27 root 20 0 0 0 0 S 0 0.0 0:18.18 ksoftirqd/5
29 root RT 0 0 0 0 S 0 0.0 1:43.97 migration/6 30 root 20 0 0 0 0 S 0 0.0 12:30.00 kworker/6:0
31 root 20 0 0 0 0 S 0 0.0 0:15.76 ksoftirqd/6 33 root RT 0 0 0 0 S 0 0.0 1:41.60 migration/7
35 root 20 0 0 0 0 S 0 0.0 0:12.94 ksoftirqd/7 37 root RT 0 0 0 0 S 0 0.0 5:13.03 migration/8
39 root 20 0 0 0 0 S 0 0.0 1:05.10 ksoftirqd/8 41 root RT 0 0 0 0 R 0 0.0 3:35.18 migration/9
43 root 20 0 0 0 0 S 0 0.0 0:45.77 ksoftirqd/9 44 root RT 0 0 0 0 R 0 0.0 2:21.35 watchdog/9
45 root RT 0 0 0 0 S 0 0.0 3:14.10 migration/10 46 root 20 0 0 0 0 S 0 0.0 25:52.76 kworker/10:0
47 root 20 0 0 0 0 S 0 0.0 0:29.33 ksoftirqd/10 48 root RT 0 0 0 0 S 0 0.0 2:11.92 watchdog/10
49 root RT 0 0 0 0 S 0 0.0 3:03.78 migration/11 51 root 20 0 0 0 0 S 0 0.0 0:29.36 ksoftirqd/11
52 root RT 0 0 0 0 S 0 0.0 2:09.54 watchdog/11 53 root RT 0 0 0 0 S 0 0.0 3:13.56 migration/12
Mais detalhes sobre o servidor:
# cat /etc/SuSE-release
SUSE Linux Enterprise Server 11 (x86_64)
VERSION = 11
PATCHLEVEL = 3
# uname -a
Linux n049 3.0.76-0.11-default #1 SMP Fri Jun 14 08:21:43 UTC 2013 (ccab990) x86_64 x86_64 x86_64 GNU/Linux
Eu tentei escrever para o procfs por:
echo 0 > /proc/sys/kernel/nmi_watchdog
Ele também trava sem resposta.
Gostaria de saber qual é o motivo possível para esse tipo de problema.
EDIT: De acordo com a sugestão do @Mat, também postei a saída dos seguintes comandos:
iostat
Linux 3.0.76-0.11-default (n049) 03/01/2016 _x86_64_
avg-cpu: %user %nice %system %iowait %steal %idle
79.98 0.00 0.35 0.25 0.00 19.41
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
sda 2.46 38.15 845.71 1329385997 29467274973
nfsstat
Client rpc stats:
calls retrans authrefrsh
544779536 157 544901831
Client nfs v3:
null getattr setattr lookup access readlink
0 0% 296447353 54% 17723933 3% 6435727 1% 44311228 8% 204 0%
read write create mkdir symlink mknod
47603223 8% 122630268 22% 750048 0% 45661 0% 4 0% 0 0%
remove rmdir rename link readdir readdirplus
509509 0% 42896 0% 5249 0% 198 0% 0 0% 856848 0%
fsstat fsinfo pathconf commit
15677 0% 22716 0% 11358 0% 7380246 1%
panfs_stat / projects /
opstats timestamp: 1456830706:536977000
PanFS Client Exported Opstats
callback__breaks 806486
callback__break_all 4
callback__cancels 8389
ioctl__getattr 0
ioctl__setattr 0
op__device_create 30
op__dir_create 72765
op__dir_delete 3389
op__dir_fmlookup 2913280
op__dir_lookup 4415755
op__dir_fmreaddir 1457842
op__file_create 1423833
op__file_link_create 0
op__file_delete 867145
op__file_rename 59860
op__file_silly_rename 4453
op__getattr_total 36999003848
op__ioctl 356581655
op__llapi_sync 120384107
op__read 552594596
op__read__total_bytes 14363262382940
op__sync 501053283
op__sync__total_bytes 7978169577178
op__symlink_create 0
op__symlink_follow 3125353
op__symlink_read 3550281
op__setattr 65362728
op__write 30887557434
op__write__total_bytes 15818900301081
op__write_retried 0
op__writepage 411
PanFS Syscall Opstats
close suc 262891996 / unsuc 396 / started 262892393 longest 37:592322933 / 0:000578105 avg 0:000151250 / 0:000061983
create suc 1423829 / unsuc 5 / started 1423834 longest 185:525407516 / 0:007318163 avg 0:009095410 / 0:002275271
fsync suc 425 / unsuc 0 / started 425 longest 0:961370596 / 0:000000000 avg 0:070172226 / 0:000000000
getattr suc 491151434 / unsuc 1 / started 491151435 longest 48:897901802 / 0:039782489 avg 0:000001457 / 0:039782489
getxattr suc 6516 / unsuc 9640 / started 16156 longest 0:000270379 / 0:000034562 avg 0:000006670 / 0:000000955
ioctl suc 1 / unsuc 356582148 / started 356582150 longest 0:000008650 / 0:096541018 avg 0:000008650 / 0:000000963
link suc 0 / unsuc 0 / started 0 longest 0:000000000 / 0:000000000 avg 0:000000000 / 0:000000000
llseek suc 741005516 / unsuc 1021903 / started 742027419 longest 0:279553783 / 0:000021232 avg 0:000000451 / 0:000000282
lock suc 1452 / unsuc 0 / started 1452 longest 0:016063605 / 0:000000000 avg 0:001258309 / 0:000000000
lookup suc 4415772 / unsuc 0 / started 4415772 longest 139:467332687 / 0:000000000 avg 0:010307139 / 0:000000000
mkdir suc 72765 / unsuc 0 / started 72765 longest 1:101200148 / 0:000000000 avg 0:003296509 / 0:000000000
mknod suc 30 / unsuc 0 / started 30 longest 0:159658010 / 0:000000000 avg 0:021141579 / 0:000000000
mmap suc 10953869 / unsuc 0 / started 10953869 longest 0:011133309 / 0:000000000 avg 0:000000688 / 0:000000000
open suc 262892479 / unsuc 16 / started 262892495 longest 244:019141374 / 0:171940090 avg 0:000008309 / 0:079117046
permission suc 4479767553 / unsuc 1975542 / started 4481743095 longest 148:505477264 / 2:209368229 avg 0:000001927 / 0:000035222
put_super suc 5464 / unsuc 0 / started 5464 longest 0:001834006 / 0:000000000 avg 0:000040821 / 0:000000000
read suc 552597080 / unsuc 1 / started 552597081 longest 193:103296270 / 0:012257444 avg 0:000047965 / 0:012257444
readdir suc 1664834 / unsuc 10 / started 1664844 longest 3:179359832 / 1:464181024 avg 0:003504540 / 0:155413678
rename suc 55407 / unsuc 0 / started 55407 longest 20:462678110 / 0:000000000 avg 0:008082163 / 0:000000000
rmdir suc 3153 / unsuc 236 / started 3389 longest 0:433589477 / 0:000464708 avg 0:002029711 / 0:000313255
setattr suc 65373798 / unsuc 0 / started 65373798 longest 243:868835310 / 0:000000000 avg 0:004721321 / 0:000000000
setxattr suc 0 / unsuc 0 / started 0 longest 0:000000000 / 0:000000000 avg 0:000000000 / 0:000000000
statfs suc 214 / unsuc 0 / started 214 longest 0:012208633 / 0:000000000 avg 0:000528460 / 0:000000000
symlink suc 0 / unsuc 0 / started 0 longest 0:000000000 / 0:000000000 avg 0:000000000 / 0:000000000
unlink suc 870893 / unsuc 0 / started 870893 longest 14:820432065 / 0:000000000 avg 0:001506087 / 0:000000000
vfs_admit suc 0 / unsuc 0 / started 0 longest 0:000000000 / 0:000000000 avg 0:000000000 / 0:000000000
write suc 30888837091 / unsuc 0 / started 30888838130 longest 1041:545276662 / 0:000000000 avg 0:000010204 / 0:000000000
sar -n DEV
Average: lo 0.03 0.03 0.00 0.00 0.00 0.00 0.00
Average: eth0 1.34 0.53 0.16 0.10 0.00 0.00 0.01
Average: eth1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Average: ib0 1.05 5.14 0.15 0.81 0.00 0.00 0.00