torque pbs 4.0.1 o trabalho fica em estado de espera ('Q'); o agendador parece não receber nenhuma notificação

5

Estou usando o torque 4.0.1 no openSUSE 12.1 em um ambiente de cluster. Quando eu escrevo um job (simples como "echo hello"), ele permanece no estado 'Q' e nunca é agendado. Eu posso forçar o trabalho a executar com o qrun e ele é executado no primeiro nó sem erro.

Eu tentei encontrar as soluções para os últimos dias, mas falhei. Eu li o manual, os logs, até mesmo o código-fonte, mas ainda não consigo localizar o problema. Claro que eu pesquisei muito, tentei várias soluções, mas ninguém trabalhou.

Aqui estão algumas informações que talvez sejam úteis:

  • pbs_sched está sendo executado, mas seus registros parecem sugerir que ele não recebe notificação sobre tarefas sendo enfileiradas.

    05/13/2012 18:55:08;0002; pbs_sched;Svr;Log;Log opened
    05/13/2012 18:55:08;0002; pbs_sched;Svr;TokenAct;Account file /var/spool/torque/sched_priv/accounting/20120513 opened
    05/13/2012 18:55:08;0002; pbs_sched;Svr;main;pbs_sched startup pid 32604
  • O log pbs_server mostrou que a tarefa foi enfileirada no lote de filas padrão:

    05/13/2012 19:33:08;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 4.0.1, loglevel = 0
    05/13/2012 19:33:56;0100;PBS_Server;Job;16.head;enqueuing into batch, state 1 hop 1
    05/13/2012 19:33:56;0008;PBS_Server;Job;16.head;Job Queued at request of pubuser@head, owner = pubuser@head, job name = STDIN, queue = batch
  • qstat -f 16 não mostrou nada útil

    Job Id: 16.head
    Job_Name = STDIN
    Job_Owner = pubuser@head
    job_state = Q
    queue = batch
    server = head
    Checkpoint = u
    ctime = Sun May 13 19:33:56 2012
    Error_Path = head:/fserver/home/pubuser/STDIN.e16
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Sun May 13 19:33:56 2012
    Output_Path = head:/fserver/home/pubuser/STDIN.o16
    Priority = 0
    qtime = Sun May 13 19:33:56 2012
    Rerunable = True
    Resource_List.walltime = 01:00:00
    substate = 10
    Variable_List = PBS_O_QUEUE=batch,PBS_O_HOME=/,
        PBS_O_WORKDIR=/fserver/home/pubuser,PBS_O_HOST=head,PBS_O_SERVER=head,
        PBS_O_WORKDIR=/fserver/home/pubuser
    euser = pubuser
    egroup = users
    queue_rank = 4
    queue_type = E
    etime = Sun May 13 19:33:56 2012
    fault_tolerant = False
    job_radix = 0
    submit_host = head
    init_work_dir = /fserver/home/pubuser
  • Todos os nós são gratuitos:

    sun1
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910403,varattr=,jobs=,state=free,netload=44492032184,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1697420kb,totmem=1802616kb,idletime=241085,nusers=0,nsessions=0,uname=Linux sun1 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun2
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910408,varattr=,jobs=,state=free,netload=39762812881,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1701012kb,totmem=1802616kb,idletime=239982,nusers=0,nsessions=0,uname=Linux sun2 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun3
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910400,varattr=,jobs=,state=free,netload=45984311925,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1699772kb,totmem=1802616kb,idletime=212303,nusers=0,nsessions=0,uname=Linux sun3 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun4
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910407,varattr=,jobs=,state=free,netload=37538584401,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805480kb,totmem=1908308kb,idletime=211197,nusers=0,nsessions=0,uname=Linux sun4 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun5
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910411,varattr=,jobs=,state=free,netload=173547166,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1803816kb,totmem=1908308kb,idletime=211199,nusers=0,nsessions=0,uname=Linux sun5 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun6
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910411,varattr=,jobs=,state=free,netload=24641446,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805704kb,totmem=1908308kb,idletime=212999,nusers=0,nsessions=0,uname=Linux sun6 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun7
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910412,varattr=,jobs=,state=free,netload=1548383055,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805432kb,totmem=1908308kb,idletime=215630,nusers=0,nsessions=0,uname=Linux sun7 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun8
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910400,varattr=,jobs=,state=free,netload=128755968,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1803448kb,totmem=1908308kb,idletime=211866,nusers=0,nsessions=0,uname=Linux sun8 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun9
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910374,varattr=,jobs=,state=free,netload=1371896399,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805664kb,totmem=1908308kb,idletime=211161,nusers=0,nsessions=0,uname=Linux sun9 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0
  • qmgr -c 'p s':

    #
    # Create queues and set their attributes.
    #
    #
    # Create and define queue batch
    #
    create queue batch

    set queue batch queue_type = Execution

    set queue batch resources_default.walltime = 01:00:00

    set queue batch enabled = True

    set queue batch started = True

    #
    # Set server attributes.
    #
    set server scheduling = True

    set server acl_hosts = head

    set server managers = pubuser@head

    set server managers += root@head

    set server operators = pubuser@head

    set server operators += root@head

    set server default_queue = batch

    set server log_events = 511

    set server mail_from = adm

    set server scheduler_iteration = 600

    set server node_check_rate = 150

    set server tcp_timeout = 300

    set server job_stat_rate = 45

    set server poll_jobs = True

    set server mom_job_sync = True

    set server keep_completed = 0

    set server submit_hosts = head

    set server next_job_number = 17

    set server moab_array_compatible = True
  • momctl -d 13 no primeiro nó:

Host: sun1/sun1   Version: 4.0.1   PID: 5362
Server[0]: head (192.168.0.1:15001)
  Last Msg From Server:   1584 seconds (DeleteJob)
  Last Msg To Server:     7 seconds
HomeDirectory:          /var/spool/torque/mom_priv
stdout/stderr spool directory: '/var/spool/torque/spool/' (4457492 blocks available)
MOM active:             229485 seconds
Check Poll Time:        45 seconds
Server Update Interval: 45 seconds
LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    TCP
MemLocked:              TRUE  (mlock)
TCP Timeout:            0 seconds
Trusted Client List:  127.0.0.1:0,192.168.0.1:0,192.168.0.101:0,192.168.0.101:15003,192.168.0.102:15003,192.168.0.103:15003,192.168.0.104:15003,192.168.0.105:15003,192.168.0.106:15003,192.168.0.107:15003,192.168.0.108:15003,192.168.0.109:15003:  0
Copy Command:           /usr/bin/scp -rpB
NOTE:  no local jobs detected

diagnostics complete

O problema é que o TCP Timeout é de 0 segundos, o que não parece ser normal. Durante o diagnóstico, o seguinte log foi encontrado em mom_logs


05/13/2012 20:30:10;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Resource temporarily unavailable (11) in tcp_read_proto_version, no protocol version number End of File (errno 2)

Eu pesquisei no Google, mas não encontrei nada.

  • Eu compilei o OpenMPI com este torque 4.0.1 (para suporte a tm) e posso testar programas sem problemas.

Espero que alguém possa resolver este problema. Obrigada!

    
por liding 13.05.2012 / 14:15

0 respostas

Tags