Eu tenho um contêiner de servidor da Web em execução no App Engine que veicula uma API REST. Eu tentei ir para uma implementação relativamente padrão - nginx + PHP-FPM usando sockets TCP (eu não consegui um soquete unix para funcionar por algum motivo). A conexão com o banco de dados também é um soquete TCP que é executado em uma VPN do Google Cloud.
O problema
Estou com aproximadamente 25% de disponibilidade na API. Geralmente, a solicitação será 504 Gateway Timeout
após o tempo máximo (o proxy nginx do App Engine está definido como 60s). Às vezes, será 502 Bad Gateway
se o PHP-FPM expirar ( request_terminate_timeout
).
Estou tentando descobrir se esse é o nginx do App Engine que está mal configurado, meu nginx ou minha configuração do PHP-FPM. O Nginx deve estar fechando os soquetes ou reutilizando-os, mas parece não estar fazendo nenhum dos dois.
Lado do cliente
Quando eu siege
de um determinado endpoint (25 usuários) por alguns minutos, vejo:
HTTP/1.1 504 60.88 secs: 176 bytes ==> GET /path/to/rest
...15 lines...
HTTP/1.1 504 61.23 secs: 176 bytes ==> GET /path/to/rest
HTTP/1.1 200 57.54 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 57.68 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 504 60.76 secs: 176 bytes ==> GET /path/to/rest
...15 lines...
HTTP/1.1 504 61.06 secs: 176 bytes ==> GET /path/to/rest
HTTP/1.1 200 33.35 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 32.97 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 36.61 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 39.00 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 42.47 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 48.51 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 56.45 secs: 13143 bytes ==> GET /path/to/rest
# Another run
HTTP/1.1 200 7.65 secs: 13143 bytes ==> GET /path/to/rest
...10 lines...
HTTP/1.1 200 8.20 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 502 47.15 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 47.15 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 200 8.30 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 504 61.15 secs: 176 bytes ==> GET /path/to/rest
HTTP/1.1 502 54.46 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 54.33 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 54.25 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 53.63 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 48.40 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.31 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 6.97 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.27 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.26 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 502 54.99 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 60.08 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 60.56 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 200 6.83 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 502 60.85 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 59.99 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 58.99 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 52.40 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 52.21 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 59.61 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 502 52.65 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.13 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 6.96 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.48 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.81 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 6.89 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 502 59.26 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 200 6.80 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 502 59.44 secs: 166 bytes ==> GET /path/to/rest
Isso também acontece com apenas 1 usuário:
HTTP/1.1 502 55.43 secs: 166 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.71 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 200 7.54 secs: 13143 bytes ==> GET /path/to/rest
HTTP/1.1 502 59.21 secs: 166 bytes ==> GET /path/to/rest
Lado do servidor
Registros Nginx para cada caso:
# 200
Normal logging i.e. [notice] GET /path/to/rest (param1, param2) ...
# 502
[error] 1059#0: *1395 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: 172.18.0.3, server: gaeapp, request: "GET /path/to/rest HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "api.example.com"
# 504
[error] 34#0: *326 upstream timed out (110: Operation timed out) while reading response header from upstream, client: 172.18.0.3, server: gaeapp, request: "GET /path/to/rest HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "api.example.com"
É assim que o netstat -t
se parece:
# Before starting
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:33971 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34072 ESTABLISHED
# During the siege
tcp 0 0 localhost:56144 localhost:9000 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34565 ESTABLISHED
tcp 0 0 5c2ad0938ce9:53073 192.168.2.29:postgresql ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:33971 ESTABLISHED
tcp 0 0 localhost:56148 localhost:9000 ESTABLISHED
tcp 0 0 5c2ad0938ce9:53071 192.168.2.29:postgresql ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34580 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34584 ESTABLISHED
tcp 0 0 localhost:56106 localhost:9000 ESTABLISHED
tcp 0 0 localhost:56191 localhost:9000 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34566 ESTABLISHED
tcp 0 0 localhost:56113 localhost:9000 ESTABLISHED
tcp 0 0 localhost:56150 localhost:9000 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34591 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34574 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34072 ESTABLISHED
tcp 0 0 5c2ad0938ce9:53102 192.168.2.29:postgresql ESTABLISHED
tcp 0 0 5c2ad0938ce9:53051 192.168.2.29:postgresql ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34572 ESTABLISHED
tcp 8 0 localhost:9000 localhost:56146 ESTABLISHED
tcp 0 0 localhost:9000 localhost:56117 TIME_WAIT
tcp 8 0 localhost:9000 localhost:56179 ESTABLISHED
tcp 8 0 localhost:9000 localhost:56160 ESTABLISHED
tcp 0 0 localhost:9000 localhost:56168 TIME_WAIT
tcp 0 0 localhost:9000 localhost:56170 TIME_WAIT
tcp 0 0 localhost:9000 localhost:56111 TIME_WAIT
tcp 0 0 localhost:9000 localhost:56115 TIME_WAIT
tcp 8 0 localhost:9000 localhost:56123 ESTABLISHED
tcp 0 0 localhost:9000 localhost:56109 TIME_WAIT
tcp 8 0 localhost:9000 localhost:56113 ESTABLISHED
tcp 0 0 localhost:9000 localhost:56140 TIME_WAIT
tcp 0 0 localhost:9000 localhost:56181 TIME_WAIT
tcp 0 0 localhost:9000 localhost:56121 TIME_WAIT
tcp 8 0 localhost:9000 localhost:56191 ESTABLISHED
tcp 0 0 localhost:9000 localhost:56119 TIME_WAIT
tcp 0 0 localhost:9000 localhost:56142 TIME_WAIT
tcp 8 0 localhost:9000 localhost:56106 ESTABLISHED
tcp 0 0 localhost:9000 localhost:56110 TIME_WAIT
tcp 8 0 localhost:9000 localhost:56144 ESTABLISHED
tcp 8 0 localhost:9000 localhost:56148 ESTABLISHED
tcp 8 0 localhost:9000 localhost:56150 ESTABLISHED
# A minute or so after ending the siege
tcp 0 0 5c2ad0938ce9:53319 192.168.2.29:postgresql ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34578 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34576 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34570 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34565 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:33971 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34580 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34584 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34566 ESTABLISHED
tcp 0 0 localhost:56396 localhost:9000 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34591 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34574 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34072 ESTABLISHED
tcp 0 0 5c2ad0938ce9:http-alt 172.18.0.3:34572 ESTABLISHED
tcp 8 0 localhost:9000 localhost:56396 ESTABLISHED
Configs
nginx.conf
user www-data;
worker_processes auto;
worker_cpu_affinity auto;
events {
worker_connections 512;
}
http {
server_tokens off;
fastcgi_ignore_client_abort off;
keepalive_timeout 650;
keepalive_requests 10000;
gzip on;
..more gzip settings..
server {
charset utf-8;
client_max_body_size 512M;
listen 8080;
rewrite_log on;
root /app/web;
index index.php;
location / {
try_files $uri /index.php?$args;
}
location ~ \.php$ {
fastcgi_pass 127.0.0.1:9000;
include /etc/nginx/fastcgi_params;
fastcgi_keep_conn off;
fastcgi_param SCRIPT_FILENAME $document_root/$fastcgi_script_name;
}
}
include /etc/nginx/conf.d/*.conf; # There are no extra conf files
}
php-fpm (www.conf)
[www]
user = www-data
group = www-data
listen = 127.0.0.1:9000
pm = ondemand
pm.process_idle_timeout = 10s
request_terminate_timeout = 45
Mais informações
Desativar a atividade é uma má ideia, já que o Google App Engine monitora constantemente o contêiner para verificações de integridade e isso gera muitos TIME_WAIT
sockets mortos (tentei).
Antes de request_terminate_timeout
, havia muitos CLOSE_WAIT
sockets em vez de TIME_WAIT
. A configuração request_terminate_timeout = 45
ajuda, de certa forma, à medida que o processo de trabalho é eliminado e retorna 200
após o respawn. Um tempo limite de término inferior apenas gerará mais 502
s e menos 504
s.
O process_idle_timeout
é ignorado porque os soquetes não são tecnicamente ociosos.
Definir fastcgi_keep_conn on
não tem impacto mensurável no comportamento do nginx.