ubuntu 16.04 slurm srun falhou com intel mpi?

0

Estou tentando instalar o slurm em um cluster executando o Ubuntu 16.04.

Estou usando intel mpi e o diretório de instalação está localizado no nó principal /opt/intel/impi_5.01.

De acordo com a instrução slurm, ele precisa exportar a variável libpmi.so. link

Mas instalei o slurm-llnl via ubuntu

sudo apt-get slurm-llnl

e não tenho certeza de onde o libpmi.so está localizado? Então, eu fiz uma pesquisa e encontrei um arquivo aqui, este é o arquivo que estou procurando?

/usr/lib/x86_64-linux-gnu/libpmi.so

De qualquer forma, eu exportei a variável e tentei

srun -p old -N3 -n24 hostname

Ele retorna,

rolly@head:~$ srun -p old -N3 -n24 hostname
node02
node02
node02
node02
node02
node02
node02
node02
node01
node01
head
head
node01
head
head
head
node01
node01
head
node01
head
head
node01
node01

Parece funcionar.

Mas enquanto executo minha tarefa,

srun -p old -N3 -n24 ~/QE530-CPU/espresso-5.3.0/bin/pw.x

Produziu erros,

mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)

Acredito que os prompts de erro são devidos à execução do mpiexec com o intel-mpi, ele deve estar usando o mpirun.

Como posso corrigir o problema?

Obrigado!

    
por Rolly Ng 15.01.2017 / 07:24

1 resposta

0

Eu encontrei minha solução.

1) sudo apt-get install mpich

2) srun --mpi=pmi2

3) As variáveis ambientais relacionadas a mkl e intel são carregadas corretamente.

Espero que isso ajude alguém com problemas semelhantes.

    
por Rolly Ng 15.01.2017 / 16:41