Hi,
On Wed, Aug 12, 2009 at 12:50:00PM -0700, James Hartshorn wrote:
> Hi,
>
> We run Haproxy on Amazon ec2 for http load balancing. On Monday
> (august 11) we upgraded seven of our load balancers in two of our
> products to 1.3.20 from 1.3.15.8 (four servers, all of one product)
> and 1.3.18 (three servers, all of the other product). We kept the
> config files the same. We finished replacing the load balancers by
> 2300 UTC on aug 11, and at about 0900 UTC Aug 12 the first cluster
> (the one upgraded from 1.3.15.8) started showing performance issues,
> enough to cause our monitoring systems to go off. Response times were
> several seconds.
please enable the stats page, it will show you a lot of useful info in such cases. Most specifically the session rate and concurrent sessions number.
> Logging on to one of the load balancers I saw normal
> cpu and memory, but looking at netstat -anp I saw more than 30k lines
> there, the majority in TIME_WAIT state.
TIME_WAIT is completely normal. Assuming your system is running with default settings (60s timeout on finwait/timewait), 30k TIME_WAIT sessions means you're getting 500 connections per second.
> For background, the load
> balancers each point to the same pool of about 60 servers, which at
> the time were doing about 20-30 sessions per server, and the servers
> reporting about 80 requests per second (nominally 60% of peak).
80 req/s cumulated or per server ? It seems extremely low for a cumulative count, but if it's per server, it means 4800 req/s cumulated which is approximately in the range of what we have observed on another site running at EC2, the limit certainly being caused by virtualization and/or shared CPU resources.
> At
> this point we put the old load balancers back into production and
> found them to be still working fine.
that's what I find strange then :-/
> At around 1200 UTC Aug 12 a
> nearly identical state occured on the other set of load balancers (the
> ones upgraded from 1.3.18).
>
> If anyone can see any issues please let me know.
>
> I have pasted a representative haproxy.cfg file below:
> global
> #log 127.0.0.1 local0 info
> #log 127.0.0.1 local1 notice
> #log loghost local0 info
> maxconn 75000
> chroot /var/lib/haproxy
> user haproxy
> group haproxy
> daemon
> #debug
> #quiet
>
> defaults
> #log global
> mode http
> #option httplog
> option dontlognull
> option redispatch
> retries 3
> maxconn 75000
your defaults and frontend maxconns should be slightly lower than the global one, so that one single frontend can never fill the whole process. BTW, 75000 seems a bit optimistic for a virtualized environment...
> contimeout 5000
> clitimeout 50000
> srvtimeout 2000
Is there a reason for 50s on the client and only 2s on the server ? I suppose that when your servers slow down, you're killing a lot of requests by sending 504 responses.
> frontend openx *:80
> #log global
> maxconn 75000
> option forwardfor
> default_backend openx_ec2_hosted_http
>
> backend openx_ec2_hosted_http
> mode http
> #balance roundrobin
> balance leastconn
> option abortonclose
> option httpclose
> #remove the line below if not 1.3.20
> #option httpchk HEAD /health.chk
Why is there a special case for this line and 1.3.20 ? Are you sure you don't change it when you switch to another version ? If so, it may be the reason why your servers may be flapping.
> timeout queue 500
Same here, 500ms for a queue seems very short (but looks consistent with the 2s for the server though).
> #option forceclose
Just in case you'd have enabled it, avoid using forceclose, as you may reach a point where the system is refusing to allocate a source port for haproxy to connect to the server.
(...)
Other than the points above, I don't see anything really wrong. Please do enable the stats and save a report. Check the "Dwn" and "Chk" columns for your servers. You might notice they're flapping because they'd take too much time to respond to health checks.
Regards,
Willy
Received on 2009/08/13 23:00
This archive was generated by hypermail 2.2.0 : 2009/08/13 23:15 CEST