Existe um guia de resolução de problemas na documentação do Lote para resolução de problemas "Jobs Stuck in RUNNABLE Status".
link
If your compute environment contains compute resources, but your jobs do not progress beyond the RUNNABLE status, then there is something preventing the jobs from actually being placed on a compute resource. Here are some common causes for this issue:
The awslogs log driver is not configured on your compute resources
AWS Batch jobs send their log information to CloudWatch Logs. To enable this, you must configure your compute resources to use the awslogs log driver. If you base your compute resource AMI off of the Amazon ECS-optimized AMI (or Amazon Linux), then this driver is registered by default with the ecs-init package. If you use a different base AMI, then you must ensure that the awslogs log driver is specified as an available log driver with the ECS_AVAILABLE_LOGGING_DRIVERS environment variable when the Amazon ECS container agent is started. For more information, see Compute Resource AMI Specification and Creating a Compute Resource AMI.
Insufficient resources
If your job definitions specify more CPU or memory resources than your compute resources can allocate, then your jobs will never be placed. For example, if your job specifies 4 GiB of memory, and your compute resources have less than that, then the job cannot be placed on those compute resources. In this case, you must reduce the specified memory in your job definition or add larger compute resources to your environment.
Amazon EC2 instance limit reached
The number of Amazon EC2 instances that your account can launch in an AWS region is determined by your EC2 instance limit. Certain instance types have a per-instance-type limit as well. For more information on your account's Amazon EC2 instance limits (including how to request a limit increase), see Amazon EC2 Service Limits in the Amazon EC2 User Guide for Linux Instances
Outros problemas muito comuns que eu vejo e que causam isso:
- Nenhuma rota para a internet
- CPU / memória na definição do trabalho é maior que as instâncias
- A instância não está registrada no cluster do ECS
- O agente está desconectado - link
Etapas adicionais de solução de problemas que você pode seguir:
- Iniciar a definição de tarefa do ECS associada manualmente em seu cluster
- SSH e tente executar o docker de dentro da instância do contêiner
- Curl ECS e pontos de extremidade em lote de dentro da instância do contêiner
- Remover restrições de CPU / memória na definição de trabalho
- Revise /etc/ecs/ecs.config
- Obtenha registros do ECS - link