picos médios de carga do Ubuntu, mas as CPUs estão ociosas

7

Temos um servidor em uma rede na nuvem, fornecido por terceiros. Estamos executando o Ubuntu 10.04 Server Edition.

O problema acontece em momentos aparentemente aleatórios, cerca de uma a três vezes por dia. A média de carga na parte superior geralmente fica em torno de 2 e o servidor executa peachy, mas nesses tempos aleatórios, a média da carga aumenta para cerca de 30 a 35 e tudo fica paralisado. Não é possível acessar nossos sites, não pode executar comandos no servidor, não pode fazer nada. Não é possível efetuar login se você ainda não estava conectado.

A única maneira pela qual conseguimos ver as médias de carga alta é por executar constantemente a parte superior, de modo que ela já esteja em execução quando o problema ocorrer. Parece que, se já estiver em execução, continuará funcionando corretamente, mas, se não estiver em execução, não será possível iniciá-lo. A incapacidade de executar quaisquer comandos quando chega a esse estado tem dificultado o diagnóstico do problema ... além disso, não nos consideramos especialistas em servidor de maneira alguma.

O que parece estranho para mim é que a média de carga aumenta muito, mas os processadores permanecem ociosos e há bastante memória livre. Novamente, eu não sou um especialista, mas meu entendimento básico é que, se a memória estiver disponível e os processadores não estiverem no máximo, então não deve haver processos esperando (muito possivelmente eu esteja enganado).

Como eu estava digitando isso, eu peguei como ele estava começando a cravar e conseguiu executar alguns comandos antes de tudo bloqueado. A saída é a seguinte:

uname -a

Linux <server name> 2.6.32-308-ec2 #16-Ubuntu SMP Thu Sep 16 14:28:38 UTC 2010 i686 GNU/Linux

topo

top - 10:55:08 up 15:28,  4 users,  load average: 12.29, 7.01, 3.89
Tasks: 313 total,   3 running, 308 sleeping,   0 stopped,   2 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   4210704k total,  2163024k used,  2047680k free,   162320k buffers
Swap:  2096440k total,        0k used,  2096440k free,  1690464k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
    1 root      20   0  2804 1644 1204 S    0  0.0   0:00.18 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd
    3 root      RT   0     0    0    0 R    0  0.0   0:00.08 migration/0
    4 root      20   0     0    0    0 S    0  0.0   0:00.01 ksoftirqd/0
    5 root      RT   0     0    0    0 R    0  0.0   0:00.01 watchdog/0
    6 root      20   0     0    0    0 S    0  0.0   0:00.06 events/0

ps axf

  PID TTY      STAT   TIME COMMAND
    2 ?        S      0:00 [kthreadd]
    3 ?        R      0:00  \_ [migration/0]
    4 ?        S      0:00  \_ [ksoftirqd/0]
    5 ?        R      0:00  \_ [watchdog/0]
    6 ?        S      0:00  \_ [events/0]
    7 ?        S      0:00  \_ [cpuset]
    8 ?        S      0:00  \_ [khelper]

<removed a bunch of processes to save space in the post, all had status S>

    1 ?        Ss     0:00 /sbin/init
  245 ?        S      0:00 upstart-udev-bridge --daemon
  251 ?        S /var/run/motd.new
25554 ?        S      0:00          \_ run-parts --lsbsysinit /etc/update-motd.d
25558 ?        S      0:00              \_ /bin/sh /etc/update-motd.d/10-help-text
25560 ?        D      0:00                  \_ /bin/sh /etc/update-motd.d/10-help-text
  852 ?        Ss     0:00 cron
 1374 ?        S      0:00  \_ CRON
 1377 ?        Ss     0:00  |   \_ /bin/sh -c /var/www/secure/caddy2_prod/scripts/main.pl
 1379 ?        S      0:02  |       \_ /usr/bin/perl /var/www/secure/caddy2_prod/scripts/main.pl
 1385 ?        Z      0:00  |           \_ [check.pl] 
 1375 ?        S      0:00  \_ CRON
 1376 ?        Ss     0:00      \_ /bin/sh -c /var/www/secure/caddy2_test/scripts/main.pl
 1378 ?        S      0:00          \_ /usr/bin/perl /var/www/secure/caddy2_test/scripts/main.pl
 1384 ?        Z      0:00              \_ [check.pl] 
  855 ?        Ss     0:00 atd
  868 ?        Ssl    6:36 /usr/sbin/mysqld
  890 ?        S      0:00 /bin/bash /usr/sbin/xe-daemon -p /var/run/xe-daemon.pid
25563 ?        S      0:00  \_ /bin/sh /usr/sbin/xe-update-guest-attrs --memory
25564 ?        D      0:00      \_ /bin/sh /usr/sbin/xe-update-guest-attrs --memory
 1161 ?        Ss     0:00 /usr/lib/postfix/master
 3102 ?        S      0:00  \_ qmgr -l -t fifo -u
22013 ?        S      0:00  \_ pickup -l -t fifo -u -c
 1181 ?        Ssl    3:17 /usr/sbin/asterisk -p -U asterisk
 1182 ?        S      0:00  \_ astcanary /var/run/asterisk/alt.asterisk.canary.tweet.tweet.tweet
 1222 ?        Ss     0:00 /usr/sbin/apache2 -k start
31682 ?        S      0:01  \_ /usr/sbin/apache2 -k start
31716 ?        S      0:01  \_ /usr/sbin/apache2 -k start
13548 ?        S      0:00  \_ /usr/sbin/apache2 -k start
25593 ?        S      0:00  |   \_ /usr/bin/perl -w /usr/lib/cgi-bin/caddy2/patch.pl
25594 ?        D      0:00  |       \_ /usr/bin/perl -w /usr/lib/cgi-bin/caddy2/patch.pl
13637 ?        S      0:00  \_ /usr/sbin/apache2 -k start
16061 ?        S      0:00  \_ /usr/sbin/apache2 -k start
23116 ?        S      0:00  \_ /usr/sbin/apache2 -k start
25565 ?        D      0:00  |   \_ /usr/sbin/apache2 -k start
23117 ?        S      0:00  \_ /usr/sbin/apache2 -k start
23118 ?        S      0:00  \_ /usr/sbin/apache2 -k start
23119 ?        S      0:00  \_ /usr/sbin/apache2 -k start
23121 ?        S      0:00  \_ /usr/sbin/apache2 -k start
 1268 tty1     Ss+    0:00 /sbin/getty -8 38400 tty1
 1396 ?        S      0:00 /usr/local/caddy2/servers/test/caddy2serverd localhost caddy2test 1981

<removed a bunch of processes like the one above to save space in the post, there were about 100, all with status S>

25590 ?        S      0:00  \_ /usr/local/caddy2/servers/prod/caddy2serverd localhost caddy2prod 1991
25538 ?        D      0:00 /bin/bash ./impsys-snap.sh nohup
25596 ?        Ss     0:00 /sbin/getty -L hvc0 9600 linux

Eu noto que existem alguns processos no estado D, o que eu acho que indica que é um processo zumbi. Eu não tenho idéia se essas são a causa do problema, ou qual é a diferença entre processos no estado D versus processos no estado Z.

Se pensarmos que estas são as causas do problema, o que posso fazer sobre isso? Não tenho ideia do que faz com que um processo entre no estado D e, portanto, não tenha idéia de como impedir que isso aconteça.

Sua ajuda é muito apreciada. Obrigado!

ATUALIZAÇÃO:

Eu olhei para o nosso kern.log e descobri que ele está cheio de mensagens como estas:

<removed to clean up post, further detail added below>

O registro de data e hora em alguns deles parece coincidir com o tempo que o servidor nos trava, então acho que isso tem algo a ver com isso. Também passamos essas informações para o nosso provedor de servidores, mas isso indica alguma coisa útil? Em caso afirmativo, isso indica um problema no meu final ou no fim do provedor do meu servidor?

UPDATE 2:

Aqui está o kern.log o tempo todo que pareceu relevante. Às 10:52:24, a média de carga começou a subir. Eu tenho a saída do ps axf por volta de 10:54:02. Depois de um minuto ou dois (possivelmente exatamente dois minutos, se 10:56:02 indicar), o sistema ficou sem resposta e eu não consegui executar comandos. Aqui está o log:


Mar 25 08:08:57 cloud kernel: [45483.026983] INFO: task apache2:9642 blocked for more than 120 seconds.
Mar 25 08:08:57 cloud kernel: [45483.026986] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 25 08:08:57 cloud kernel: [45483.026989] apache2       D ea63de60     0  9642   9068 0x00000000
Mar 25 08:08:57 cloud kernel: [45483.026992]  ea63de78 00000282 ea63ded4 ea63de60 c01d9096 00000000 00000000 00000000
Mar 25 08:08:57 cloud kernel: [45483.026996]  c06f61c0 ea1a6fac c06f61c0 c06f61c0 c06f61c0 ea1a6fac c06f61c0 c06f61c0
Mar 25 08:08:57 cloud kernel: [45483.027000]  c1305740 0000296f ea1a6d00 ec844d00 9a352c09 0003c456 ea63dea0 c0106c51
Mar 25 08:08:57 cloud kernel: [45483.027003] Call Trace:
Mar 25 08:08:57 cloud kernel: [45483.027006]  [] ? __link_path_walk+0x626/0xc20
Mar 25 08:08:57 cloud kernel: [45483.027010]  [] ? sched_clock+0x21/0x80
Mar 25 08:08:57 cloud kernel: [45483.027013]  [] schedule_timeout+0x175/0x250
Mar 25 08:08:57 cloud kernel: [45483.027018]  [] ? sched_clock_cpu+0x14d/0x190
Mar 25 08:08:57 cloud kernel: [45483.027021]  [] ? find_idlest_group+0xa8/0x1b0
Mar 25 08:08:57 cloud kernel: [45483.027023]  [] wait_for_common+0xc6/0x180
Mar 25 08:08:57 cloud kernel: [45483.027026]  [] ? default_wake_function+0x0/0x10
Mar 25 08:08:57 cloud kernel: [45483.027028]  [] wait_for_completion+0x12/0x20
Mar 25 08:08:57 cloud kernel: [45483.027031]  [] sched_migrate_task+0xe4/0x100
Mar 25 08:08:57 cloud kernel: [45483.027033]  [] sched_exec+0x3b/0x50
Mar 25 08:08:57 cloud kernel: [45483.027036]  [] do_execve+0xc4/0x360
Mar 25 08:08:57 cloud kernel: [45483.027038]  [] sys_execve+0x28/0x60
Mar 25 08:08:57 cloud kernel: [45483.027041]  [] syscall_call+0x7/0xb
Mar 25 09:27:03 cloud kernel: [50344.466167] nf_conntrack version 0.5.0 (16384 buckets, 65536 max)
Mar 25 09:27:03 cloud kernel: [50344.466452] CONFIG_NF_CT_ACCT is deprecated and will be removed soon. Please use
Mar 25 09:27:03 cloud kernel: [50344.466454] nf_conntrack.acct=1 kernel parameter, acct=1 nf_conntrack module option or
Mar 25 09:27:03 cloud kernel: [50344.466455] sysctl net.netfilter.nf_conntrack_acct=1 to enable it.
Mar 25 10:52:24 cloud kernel: [55167.785176] BUG: soft lockup - CPU#0 stuck for 61s! [swapper:0]
Mar 25 10:52:24 cloud kernel: [55167.785202] Modules linked in: nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipv6 scsi_mod lp parport
Mar 25 10:52:24 cloud kernel: [55167.785217] 
Mar 25 10:52:24 cloud kernel: [55167.785221] Pid: 0, comm: swapper Not tainted (2.6.32-308-ec2 #16-Ubuntu) 
Mar 25 10:52:24 cloud kernel: [55167.785224] EIP: 0061:[] EFLAGS: 00000246 CPU: 0
Mar 25 10:52:24 cloud kernel: [55167.785228] EIP is at 0xc01013a7
Mar 25 10:52:24 cloud kernel: [55167.785230] EAX: 00000000 EBX: 00000001 ECX: 00000000 EDX: c0689f58
Mar 25 10:52:24 cloud kernel: [55167.785232] ESI: c06beb08 EDI: a8b0d3bc EBP: c0689f78 ESP: c0689f70
Mar 25 10:52:24 cloud kernel: [55167.785235]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
Mar 25 10:52:24 cloud kernel: [55167.785242] CR0: 8005003b CR2: b6620000 CR3: 2a65e000 CR4: 00002660
Mar 25 10:52:24 cloud kernel: [55167.785247] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
Mar 25 10:52:24 cloud kernel: [55167.785250] DR6: ffff0ff0 DR7: 00000400
Mar 25 10:52:24 cloud kernel: [55167.785252] Call Trace:
Mar 25 10:52:24 cloud kernel: [55167.785259]  [] ? xen_safe_halt+0x15/0x40
Mar 25 10:52:24 cloud kernel: [55167.785264]  [] xen_idle+0x29/0x80
Mar 25 10:52:24 cloud kernel: [55167.785267]  [] cpu_idle+0x8f/0xc0
Mar 25 10:52:24 cloud kernel: [55167.785272]  [] rest_init+0x53/0x60
Mar 25 10:52:24 cloud kernel: [55167.785278]  [] start_kernel+0x379/0x37f
Mar 25 10:52:24 cloud kernel: [55167.785282]  [] ? unknown_bootoption+0x0/0x1a0
Mar 25 10:52:24 cloud kernel: [55167.785286]  [] i386_start_kernel+0x67/0x6e
Mar 25 10:53:30 cloud kernel: [55233.281412] BUG: soft lockup - CPU#0 stuck for 61s! [swapper:0]
Mar 25 10:53:30 cloud kernel: [55233.281421] Modules linked in: nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipv6 scsi_mod lp parport
Mar 25 10:53:30 cloud kernel: [55233.281444] 
Mar 25 10:53:30 cloud kernel: [55233.281449] Pid: 0, comm: swapper Not tainted (2.6.32-308-ec2 #16-Ubuntu) 
Mar 25 10:53:30 cloud kernel: [55233.281453] EIP: 0061:[] EFLAGS: 00000246 CPU: 0
Mar 25 10:53:30 cloud kernel: [55233.281457] EIP is at 0xc01013a7
Mar 25 10:53:30 cloud kernel: [55233.281460] EAX: 00000000 EBX: 00000001 ECX: 00000000 EDX: c0689f58
Mar 25 10:53:30 cloud kernel: [55233.281463] ESI: c06beb08 EDI: a8b0d3bc EBP: c0689f78 ESP: c0689f70
Mar 25 10:53:30 cloud kernel: [55233.281466]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
Mar 25 10:53:30 cloud kernel: [55233.281474] CR0: 8005003b CR2: 09827024 CR3: 013c4000 CR4: 00002660
Mar 25 10:53:30 cloud kernel: [55233.281480] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
Mar 25 10:53:30 cloud kernel: [55233.281484] DR6: ffff0ff0 DR7: 00000400
Mar 25 10:53:30 cloud kernel: [55233.281487] Call Trace:
Mar 25 10:53:30 cloud kernel: [55233.281498]  [] ? xen_safe_halt+0x15/0x40
Mar 25 10:53:30 cloud kernel: [55233.281504]  [] xen_idle+0x29/0x80
Mar 25 10:53:30 cloud kernel: [55233.281509]  [] cpu_idle+0x8f/0xc0
Mar 25 10:53:30 cloud kernel: [55233.281516]  [] rest_init+0x53/0x60
Mar 25 10:53:30 cloud kernel: [55233.281524]  [] start_kernel+0x379/0x37f
Mar 25 10:53:30 cloud kernel: [55233.281529]  [] ? unknown_bootoption+0x0/0x1a0
Mar 25 10:53:30 cloud kernel: [55233.281535]  [] i386_start_kernel+0x67/0x6e
Mar 25 10:54:36 cloud kernel: [55298.785478] BUG: soft lockup - CPU#0 stuck for 61s! [swapper:0]
Mar 25 10:54:36 cloud kernel: [55298.785538] Modules linked in: nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipv6 scsi_mod lp parport
Mar 25 10:54:36 cloud kernel: [55298.785551] 
Mar 25 10:54:36 cloud kernel: [55298.785554] Pid: 0, comm: swapper Not tainted (2.6.32-308-ec2 #16-Ubuntu) 
Mar 25 10:54:36 cloud kernel: [55298.785556] EIP: 0061:[] EFLAGS: 00000246 CPU: 0
Mar 25 10:54:36 cloud kernel: [55298.785560] EIP is at 0xc01013a7
Mar 25 10:54:36 cloud kernel: [55298.785561] EAX: 00000000 EBX: 00000001 ECX: 00000000 EDX: c0689f58
Mar 25 10:54:36 cloud kernel: [55298.785563] ESI: c06beb08 EDI: a8b0d3bc EBP: c0689f78 ESP: c0689f70
Mar 25 10:54:36 cloud kernel: [55298.785565]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
Mar 25 10:54:36 cloud kernel: [55298.785570] CR0: 8005003b CR2: 08ddb00c CR3: 28e7a000 CR4: 00002660
Mar 25 10:54:36 cloud kernel: [55298.785573] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
Mar 25 10:54:36 cloud kernel: [55298.785575] DR6: ffff0ff0 DR7: 00000400
Mar 25 10:54:36 cloud kernel: [55298.785576] Call Trace:
Mar 25 10:54:36 cloud kernel: [55298.785584]  [] ? xen_safe_halt+0x15/0x40
Mar 25 10:54:36 cloud kernel: [55298.785588]  [] xen_idle+0x29/0x80
Mar 25 10:54:36 cloud kernel: [55298.785591]  [] cpu_idle+0x8f/0xc0
Mar 25 10:54:36 cloud kernel: [55298.785596]  [] rest_init+0x53/0x60
Mar 25 10:54:36 cloud kernel: [55298.785602]  [] start_kernel+0x379/0x37f
Mar 25 10:54:36 cloud kernel: [55298.785605]  [] ? unknown_bootoption+0x0/0x1a0
Mar 25 10:54:36 cloud kernel: [55298.785608]  [] i386_start_kernel+0x67/0x6e
Mar 25 10:54:57 cloud kernel: [55318.911014] INFO: task impsys-snap.sh:25538 blocked for more than 120 seconds.
Mar 25 10:54:57 cloud kernel: [55318.911056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 25 10:54:57 cloud kernel: [55318.911061] impsys-snap.s D e8ac5e60     0 25538      1 0x00000004
Mar 25 10:54:57 cloud kernel: [55318.911065]  e8ac5e78 00000282 ec53bb64 e8ac5e60 c01d9498 00000000 00000000 00000000
Mar 25 10:54:57 cloud kernel: [55318.911072]  c06f61c0 c13088dc c06f61c0 c06f61c0 c06f61c0 c13088dc c06f61c0 c06f61c0
Mar 25 10:54:57 cloud kernel: [55318.911077]  ea1d5040 00003221 c1308630 c068d280 c012abc8 0003cd53 0004ae99 e8ac5eac
Mar 25 10:54:57 cloud kernel: [55318.911083] Call Trace:
Mar 25 10:54:57 cloud kernel: [55318.911092]  [] ? __link_path_walk+0xa28/0xc20
Mar 25 10:54:57 cloud kernel: [55318.911098]  [] ? inc_rt_group+0xf8/0x110
Mar 25 10:54:57 cloud kernel: [55318.911103]  [] ? update_curr+0x169/0x2c0
Mar 25 10:54:57 cloud kernel: [55318.911114]  [] schedule_timeout+0x175/0x250
Mar 25 10:54:57 cloud kernel: [55318.911117]  [] ? check_preempt_wakeup+0x152/0x370
Mar 25 10:54:57 cloud kernel: [55318.911120]  [] wait_for_common+0xc6/0x180
Mar 25 10:54:57 cloud kernel: [55318.911122]  [] ? default_wake_function+0x0/0x10
Mar 25 10:54:57 cloud kernel: [55318.911125]  [] wait_for_completion+0x12/0x20
Mar 25 10:54:57 cloud kernel: [55318.911128]  [] sched_migrate_task+0xe4/0x100
Mar 25 10:54:57 cloud kernel: [55318.911130]  [] sched_exec+0x3b/0x50
Mar 25 10:54:57 cloud kernel: [55318.911134]  [] do_execve+0xc4/0x360
Mar 25 10:54:57 cloud kernel: [55318.911137]  [] sys_execve+0x28/0x60
Mar 25 10:54:57 cloud kernel: [55318.911139]  [] syscall_call+0x7/0xb
Mar 25 10:54:57 cloud kernel: [55318.911142] INFO: task lesspipe:25544 blocked for more than 120 seconds.
Mar 25 10:54:57 cloud kernel: [55318.911145] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 25 10:54:57 cloud kernel: [55318.911148] lesspipe      D e9487e60     0 25544  25543 0x00000000
Mar 25 10:54:57 cloud kernel: [55318.911151]  e9487e78 00000286 e9487ed4 e9487e60 c01d9096 c01e1204 00000000 00000000
Mar 25 10:54:57 cloud kernel: [55318.911155]  c06f61c0 ec3bb4ec c06f61c0 c06f61c0 c06f61c0 ec3bb4ec c06f61c0 c06f61c0
Mar 25 10:54:57 cloud kernel: [55318.911159]  ec269900 00003221 ec3bb240 c068d280 6d9860b9 0003cd53 e9487ea0 c0106c51
Mar 25 10:54:57 cloud kernel: [55318.911163] Call Trace:
Mar 25 10:54:57 cloud kernel: [55318.911165]  [] ? __link_path_walk+0x626/0xc20
Mar 25 10:54:57 cloud kernel: [55318.911169]  [] ? dput+0x84/0x160
Mar 25 10:54:57 cloud kernel: [55318.911172]  [] ? sched_clock+0x21/0x80
Mar 25 10:54:57 cloud kernel: [55318.911175]  [] schedule_timeout+0x175/0x250
Mar 25 10:54:57 cloud kernel: [55318.911179]  [] ? sched_clock_cpu+0x14d/0x190
Mar 25 10:54:57 cloud kernel: [55318.911181]  [] ? find_idlest_group+0xa8/0x1b0
Mar 25 10:54:57 cloud kernel: [55318.911184]  [] wait_for_common+0xc6/0x180
Mar 25 10:54:57 cloud kernel: [55318.911187]  [] ? default_wake_function+0x0/0x10
Mar 25 10:54:57 cloud kernel: [55318.911189]  [] wait_for_completion+0x12/0x20
Mar 25 10:54:57 cloud kernel: [55318.911192]  [] sched_migrate_task+0xe4/0x100
Mar 25 10:54:57 cloud kernel: [55318.911194]  [] sched_exec+0x3b/0x50
Mar 25 10:54:57 cloud kernel: [55318.911197]  [] do_execve+0xc4/0x360
Mar 25 10:54:57 cloud kernel: [55318.911199]  [] sys_execve+0x28/0x60
Mar 25 10:54:57 cloud kernel: [55318.911201]  [] syscall_call+0x7/0xb
Mar 25 10:54:57 cloud kernel: [55318.911204] INFO: task 10-help-text:25560 blocked for more than 120 seconds.
Mar 25 10:54:57 cloud kernel: [55318.911206] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 25 10:54:57 cloud kernel: [55318.911209] 10-help-text  D e8995e60     0 25560  25558 0x00000000
Mar 25 10:54:57 cloud kernel: [55318.911211]  e8995e78 00000282 e8995ed4 e8995e60 c01d9096 00000000 00000000 00000000
Mar 25 10:54:57 cloud kernel: [55318.911215]  c06f61c0 e8af122c c06f61c0 c06f61c0 c06f61c0 e8af122c c06f61c0 c06f61c0
Mar 25 10:54:57 cloud kernel: [55318.911219]  ec295e40 00003229 e8af0f80 c068d280 3d76c004 0003cd5c e8995ea0 c0106c51
Mar 25 10:54:57 cloud kernel: [55318.911223] Call Trace:
Mar 25 10:54:57 cloud kernel: [55318.911225]  [] ? __link_path_walk+0x626/0xc20
Mar 25 10:54:57 cloud kernel: [55318.911228]  [] ? sched_clock+0x21/0x80
Mar 25 10:54:57 cloud kernel: [55318.911231]  [] schedule_timeout+0x175/0x250
Mar 25 10:54:57 cloud kernel: [55318.911233]  [] ? sched_clock_cpu+0x14d/0x190
Mar 25 10:54:57 cloud kernel: [55318.911236]  [] ? find_idlest_group+0xa8/0x1b0
Mar 25 10:54:57 cloud kernel: [55318.911238]  [] wait_for_common+0xc6/0x180
Mar 25 10:54:57 cloud kernel: [55318.911241]  [] ? default_wake_function+0x0/0x10
Mar 25 10:54:57 cloud kernel: [55318.911243]  [] wait_for_completion+0x12/0x20
Mar 25 10:54:57 cloud kernel: [55318.911246]  [] sched_migrate_task+0xe4/0x100
Mar 25 10:54:57 cloud kernel: [55318.911248]  [] sched_exec+0x3b/0x50
Mar 25 10:54:57 cloud kernel: [55318.911251]  [] do_execve+0xc4/0x360
Mar 25 10:54:57 cloud kernel: [55318.911253]  [] sys_execve+0x28/0x60
Mar 25 10:54:57 cloud kernel: [55318.911256]  [] syscall_call+0x7/0xb
Mar 25 10:54:57 cloud kernel: [55318.911258] INFO: task xe-update-guest:25564 blocked for more than 120 seconds.
Mar 25 10:54:57 cloud kernel: [55318.911260] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 25 10:54:57 cloud kernel: [55318.911263] xe-update-gue D e9535e60     0 25564  25563 0x00000000
Mar 25 10:54:57 cloud kernel: [55318.911265]  e9535e78 00000286 e9535ed4 e9535e60 c01d9096 c01e1204 00000000 00000000
Mar 25 10:54:57 cloud kernel: [55318.911269]  c06f61c0 ea09685c c06f61c0 c06f61c0 c06f61c0 ea09685c c06f61c0 c06f61c0
Mar 25 10:54:57 cloud kernel: [55318.911273]  ec269ac0 0000322b ea0965b0 c068d280 f58f849c 0003cd5d e9535ea0 c0106c51
Mar 25 10:54:57 cloud kernel: [55318.911277] Call Trace:
Mar 25 10:54:57 cloud kernel: [55318.911279]  [] ? __link_path_walk+0x626/0xc20
Mar 25 10:54:57 cloud kernel: [55318.911282]  [] ? dput+0x84/0x160
Mar 25 10:54:57 cloud kernel: [55318.911284]  [] ? sched_clock+0x21/0x80
Mar 25 10:54:57 cloud kernel: [55318.911287]  [] schedule_timeout+0x175/0x250
Mar 25 10:54:57 cloud kernel: [55318.911290]  [] ? sched_clock_cpu+0x14d/0x190
Mar 25 10:54:57 cloud kernel: [55318.911292]  [] ? find_idlest_group+0xa8/0x1b0
Mar 25 10:54:57 cloud kernel: [55318.911294]  [] wait_for_common+0xc6/0x180
Mar 25 10:54:57 cloud kernel: [55318.911297]  [] ? default_wake_function+0x0/0x10
Mar 25 10:54:57 cloud kernel: [55318.911299]  [] wait_for_completion+0x12/0x20
Mar 25 10:54:57 cloud kernel: [55318.911302]  [] sched_migrate_task+0xe4/0x100
Mar 25 10:54:57 cloud kernel: [55318.911305]  [] sched_exec+0x3b/0x50
Mar 25 10:54:57 cloud kernel: [55318.911307]  [] do_execve+0xc4/0x360
Mar 25 10:54:57 cloud kernel: [55318.911310]  [] sys_execve+0x28/0x60
Mar 25 10:54:57 cloud kernel: [55318.911312]  [] syscall_call+0x7/0xb
Mar 25 10:54:57 cloud kernel: [55318.911314] INFO: task apache2:25565 blocked for more than 120 seconds.
Mar 25 10:54:57 cloud kernel: [55318.911316] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 25 10:54:57 cloud kernel: [55318.911319] apache2       D e899be60     0 25565  23116 0x00000000
Mar 25 10:54:57 cloud kernel: [55318.911321]  e899be78 00000282 e899bed4 e899be60 c01d9096 c068afa0 00000000 00000000
Mar 25 10:54:57 cloud kernel: [55318.911325]  c06f61c0 e949522c c06f61c0 c06f61c0 c06f61c0 e949522c c06f61c0 c06f61c0
Mar 25 10:54:57 cloud kernel: [55318.911329]  ec31f040 0000322c e9494f80 c068d280 1fcfd913 0003cd5f e899bea0 c0106c51
Mar 25 10:54:57 cloud kernel: [55318.911333] Call Trace:
Mar 25 10:54:57 cloud kernel: [55318.911336]  [] ? __link_path_walk+0x626/0xc20
Mar 25 10:54:57 cloud kernel: [55318.911338]  [] ? sched_clock+0x21/0x80
Mar 25 10:54:57 cloud kernel: [55318.911341]  [] schedule_timeout+0x175/0x250
Mar 25 10:54:57 cloud kernel: [55318.911343]  [] ? sched_clock_cpu+0x14d/0x190
Mar 25 10:54:57 cloud kernel: [55318.911346]  [] ? find_idlest_group+0xa8/0x1b0
Mar 25 10:54:57 cloud kernel: [55318.911348]  [] wait_for_common+0xc6/0x180
Mar 25 10:54:57 cloud kernel: [55318.911351]  [] ? default_wake_function+0x0/0x10
Mar 25 10:54:57 cloud kernel: [55318.911353]  [] wait_for_completion+0x12/0x20
Mar 25 10:54:57 cloud kernel: [55318.911356]  [] sched_migrate_task+0xe4/0x100
Mar 25 10:54:57 cloud kernel: [55318.911359]  [] sched_exec+0x3b/0x50
Mar 25 10:54:57 cloud kernel: [55318.911361]  [] do_execve+0xc4/0x360
Mar 25 10:54:57 cloud kernel: [55318.911364]  [] sys_execve+0x28/0x60
Mar 25 10:54:57 cloud kernel: [55318.911366]  [] syscall_call+0x7/0xb
Mar 25 10:56:02 cloud kernel: [55383.610034] BUG: soft lockup - CPU#0 stuck for 61s! [swapper:0]
Mar 25 10:56:02 cloud kernel: [55383.610056] Modules linked in: nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ipv6 scsi_mod lp parport
Mar 25 10:56:02 cloud kernel: [55383.610073] 
Mar 25 10:56:02 cloud kernel: [55383.610077] Pid: 0, comm: swapper Not tainted (2.6.32-308-ec2 #16-Ubuntu) 
Mar 25 10:56:02 cloud kernel: [55383.610080] EIP: 0061:[] EFLAGS: 00000246 CPU: 0
Mar 25 10:56:02 cloud kernel: [55383.610084] EIP is at 0xc01013a7
Mar 25 10:56:02 cloud kernel: [55383.610086] EAX: 00000000 EBX: 00000001 ECX: 00000000 EDX: c0689f58
Mar 25 10:56:02 cloud kernel: [55383.610089] ESI: c06beb08 EDI: a8b0d3bc EBP: c0689f78 ESP: c0689f70
Mar 25 10:56:02 cloud kernel: [55383.610091]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0069
Mar 25 10:56:02 cloud kernel: [55383.610097] CR0: 8005003b CR2: 098c801c CR3: 2a6d2000 CR4: 00002660
Mar 25 10:56:02 cloud kernel: [55383.610104] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
Mar 25 10:56:02 cloud kernel: [55383.610106] DR6: ffff0ff0 DR7: 00000400
Mar 25 10:56:02 cloud kernel: [55383.610107] Call Trace:
Mar 25 10:56:02 cloud kernel: [55383.610115]  [] ? xen_safe_halt+0x15/0x40
Mar 25 10:56:02 cloud kernel: [55383.610119]  [] xen_idle+0x29/0x80
Mar 25 10:56:02 cloud kernel: [55383.610122]  [] cpu_idle+0x8f/0xc0
Mar 25 10:56:02 cloud kernel: [55383.610127]  [] rest_init+0x53/0x60
Mar 25 10:56:02 cloud kernel: [55383.610133]  [] start_kernel+0x379/0x37f
Mar 25 10:56:02 cloud kernel: [55383.610136]  [] ? unknown_bootoption+0x0/0x1a0
Mar 25 10:56:02 cloud kernel: [55383.610139]  [] i386_start_kernel+0x67/0x6e
    
por Travesty3 25.03.2011 / 17:34

2 respostas

9

D é espera ininterrupta, que era uma vez D isk wait, mas esses dias geralmente têm mais a ver com esperar por sistemas de arquivos de rede. Esses processos contam na média de carga, o que pode ser seu problema.

A coisa que aparece na saída ps é a segunda e última linha: um sistema de arquivos (NAS?) está sendo snapshot e provavelmente toda a atividade do disco nesse sistema de arquivos está sendo bloqueada durante o snapshot. As formas de lidar com isso incluem, mas não se limitam a, usar cachefs e instantâneos com mais frequência para que sejam mais rápidas; o qual usar depende do NAS (alguns sugam durante instantâneos, não importa o que), espaço em disco disponível e suas necessidades. Gostaria de começar procurando a comunidade de suporte do s (não necessariamente o suporte do fornecedor) para o NAS ver quais truques outros usuários criaram para minimizar a latência do snapshot.

    
por 25.03.2011 / 17:48
1

No final, o problema estava relacionado ao hardware. Nós migramos para um servidor diferente e POOF , o problema desapareceu. Pelo menos nos fez dar uma boa olhada em possíveis problemas de desempenho em nossos próprios programas. Encontramos algumas coisas que provavelmente teriam passado despercebidas, mas que deveriam ser limpas.

    
por 24.05.2011 / 18:12