O nó do nosso cluster hadoop está executando redhat5.3 2.6.18-194.17.4. (uma versão antiga do kernel). Encontramos alguns hosts com menos de 100% de utilização da CPU e especialmente todos os núcleos do processador estão em 100% sy%
top - 20:56:21 up 340 days, 22:28, 1 user, load average: 2297.16, 2298.69, 2298.88
Tasks: 17923 total, 132 running, 17753 sleeping, 0 stopped, 38 zombie
Cpu(s): 0.2%us, 99.7%sy, 0.1%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 35840000k total, 33995836k used, 1844164k free, 2432312k buffers
Swap: 0k total, 0k used, 0k free, 12193444k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3362 eo 18 0 0 0 0 Z 10.0 0.0 101:32.83 java <defunct>
12818 eo 22 0 3896m 1.2g 18m S 7.4 3.6 728:05.05 java
21396 qifei 19 0 26240 13m 812 R 6.1 0.0 9:48.80 top
1425 eo 18 0 3632m 1.0g 26m D 4.2 3.0 42:11.92 java
1398 eo 15 0 0 0 0 Z 4.2 0.0 41:09.95 java <defunct>
1595 eo 18 0 0 0 0 Z 3.8 0.0 41:11.94 java <defunct>
6079 root 25 0 93744 19m 3004 R 3.7 0.1 20:34.63 apolloHostComma
6254 root 25 0 8068 456 380 R 3.7 0.0 20:28.19 date
2671 root 25 0 25004 3996 1404 R 2.5 0.0 265:33.27 apolloHostComma
4573 root 25 0 23420 2352 1376 R 2.5 0.0 20:10.33 apolloHostComma
4710 root 25 0 25400 4436 1404 R 2.5 0.0 19:50.97 apolloHostComma
5047 root 25 0 174m 17m 5852 R 2.5 0.1 19:19.46 yum
5568 root 25 0 25136 4104 1404 R 2.5 0.0 19:36.23 apolloHostComma
5649 root 25 0 24344 3296 1400 R 2.5 0.0 19:54.40 apolloHostComma
6132 root 25 0 25004 4056 1404 R 2.5 0.0 19:26.55 apolloHostComma
7084 snitch 25 0 8708 252 112 R 2.5 0.0 20:06.13 sh
7201 root 25 0 8368 716 584 R 2.5 0.0 19:27.99 ps
7749 root 25 0 27808 2840 1484 R 2.5 0.0 19:58.13 auth-sync.pl
7975 root 25 0 31168 4000 1548 R 2.5 0.0 20:04.87 report
7977 root 25 0 9772 772 476 R 2.5 0.0 19:55.76 apollo-polling-
8174 snitch 25 0 8708 708 588 R 2.5 0.0 19:52.57 sh
8307 eo 25 0 26008 3000 1480 R 2.5 0.0 19:49.94 perl
8583 root 25 0 25268 4296 1404 R 2.5 0.0 19:05.10 apolloHostComma
9832 eo 18 0 0 0 0 Z 2.5 0.0 18:08.24 java <defunct>
9856 eo 18 0 3454m 12m 7572 D 2.5 0.0 18:08.24 java
9882 eo 18 0 0 0 0 Z 2.5 0.0 18:24.09 java <defunct>
666 root 25 0 174m 17m 5876 R 2.5 0.1 12:47.36 yum
1343 root 25 0 74820 1240 592 R 2.5 0.0 277:03.67 crond
1571 eo 18 0 3649m 563m 26m D 2.5 1.6 20:27.40 java
1601 eo 18 0 0 0 0 Z 2.5 0.0 21:15.44 java <defunct>
2858 root 25 0 24872 3944 1404 R 2.5 0.0 20:30.74 apolloHostComma
2881 root 25 0 53016 15m 1852 R 2.5 0.0 19:25.97 apolloHostComma
3166 root 25 0 29396 4340 1452 R 2.5 0.0 264:38.79 RotateLogFiles.
4392 root 25 0 29988 6980 1520 R 2.5 0.0 20:59.13 apolloHostComma
4608 root 25 0 55224 15m 1804 R 2.5 0.0 20:46.56 apolloHostComma
4624 root 25 0 24740 3808 1404 R 2.5 0.0 20:46.17 apolloHostComma
4637 root 25 0 25004 4036 1404 R 2.5 0.0 20:46.43 apolloHostComma
4681 root 25 0 28736 3608 1452 R 2.5 0.0 20:55.49 RotateLogFiles.
4760 eo 18 0 0 0 0 Z 2.5 0.0 20:04.55 java <defunct>
4979 root 25 0 74820 860 212 R 2.5 0.0 19:58.63 crond
5023 root 25 0 25484 2492 1472 R 2.5 0.0 19:41.18 auth-sync.pl
5460 eo 25 0 23288 2220 1272 R 2.5 0.0 19:37.19 cron-babysit
5551 eo 25 0 31916 6912 1608 R 2.5 0.0 19:36.55 cron-babysit
5560 root 25 0 22496 696 532 R 2.5 0.0 20:42.10 report
5564 root 25 0 8708 244 92 R 2.5 0.0 19:36.86 SnitchAgentCont
Desde as primeiras várias linhas de saída superior, não é óbvio dizer como a CPU é consumida.
Às vezes, vemos que o kswapd0 está nas linhas de cima, isso pode ser causado pelo fato de não termos espaço de troca.
É impossível imprimir a linha de comando do processo java com top, ps ou / proc // cmdline, porque o console será interrompido se fizermos isso.
Minha pergunta é: Como podemos descobrir o que está atrelando a CPU no kernel?