Eu tenho vários servidores que servem um único site.
O servidor principal executa nginx e php-fpm. E todos os outros servidores executam o php-fpm. O servidor que executa tanto o nginx quanto o php-fpm se conecta através de um soquete unix e os outros via tcp.
Aproximadamente uma vez por hora (não exatamente, às vezes mais freqüente), há um comportamento estranho. Toda a conexão do nginx ao timeout do servidor php-fpm. Não consegue fazer uma conexão.
2014/03/24 04:59:09 [error] 2123#0: *925153 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.5:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here"
2014/03/24 04:59:09 [error] 2124#0: *926742 connect() to unix:/tmp/php-fpm.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://unix:/tmp/php-fpm.sock:", host: "www.example.com", referrer: "http://www.example.com/some/address/here"
2014/03/24 04:59:09 [error] 2123#0: *925159 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.2:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here"
2014/03/24 04:59:09 [error] 2123#0: *923874 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.3:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here"
2014/03/24 04:59:09 [error] 2123#0: *925164 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.4:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here"
2014/03/24 04:59:09 [error] 2124#0: *909392 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.3:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here"
2014/03/24 04:59:09 [error] 2124#0: *923098 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.5:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here"
2014/03/24 04:59:09 [error] 2125#0: *923309 upstream timed out (110: Connection timed out) while connecting to upstream, client: <<client ip removed>>, server: www.example.com, request: "GET /some/address/here HTTP/1.1", upstream: "fastcgi://192.168.1.4:9000", host: "www.example.com", referrer: "http://www.example.com/some/address/here"
Como este é um site bastante movimentado, o log acima é preenchido rapidamente.
Isso dura cerca de 10 a 15 segundos e tudo volta ao normal.
Além dos erros de conexão esgotados aqui, não parece haver nenhum outro erro.
Eu suspeito que o problema esteja no nginx, já que acontece simultaneamente em todos os servidores php-fpm.
O que causaria isso? E como isso poderia ser resolvido?
Minha configuração do nginx é ...
user nginx;
worker_processes 4;
worker_rlimit_nofile 30000;
error_log /var/log/nginx/error.log warn;
pid /var/run/nginx.pid;
events {
worker_connections 4096;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main;
sendfile on;
keepalive_timeout 5;
fastcgi_buffers 256 4k;
gzip on;
gzip_disable "msie6";
fastcgi_cache_path /dev/shm/caches/ levels=1:2 keys_zone=zoneone:4000m max_size=4000m inactive=30m;
fastcgi_temp_path /var/www/tmp 1 2;
fastcgi_cache_key "$scheme$proxy_host$request_uri";
fastcgi_connect_timeout 3s;
limit_req_zone $binary_remote_addr zone=limitone:200m rate=1r/s;
limit_req_zone $binary_remote_addr zone=limitcomic:500m rate=40r/m;
upstream partone {
server unix:/tmp/php-fpm.sock;
}
upstream parttwo {
server 192.168.1.3:9000 weight=10 max_fails=0 fail_timeout=2s;
server 192.168.1.4:9000 weight=10 max_fails=0 fail_timeout=2s;
server 192.168.1.5:9000 weight=10 max_fails=0 fail_timeout=2s;
}
upstream parttre {
server 192.168.1.2:9000 weight=8 max_fails=0 fail_timeout=2s;
server 192.168.1.3:9000 weight=10 max_fails=0 fail_timeout=2s;
server 192.168.1.4:9000 weight=10 max_fails=0 fail_timeout=2s;
server 192.168.1.5:9000 weight=10 max_fails=0 fail_timeout=2s;
}
... stuff with server, locations and such...
}
Você pode ver que nem eu uso todos os 5 servidores no mesmo contexto.
versão nginx: nginx / 1.4.5