Hypermail

From: Willy Tarreau <w#1wt.eu>
Date: Thu, 7 May 2009 06:26:58 +0200

On Mon, May 04, 2009 at 11:47:10AM +0200, Nicolas MONNET wrote:
> I'm experiencing a problem since updating to 1.3.17, whereby checks
> periodically see a backend service as down, one at a time, but for all 3
> checks; and it picks right up again on the next check. Not sure what
> info I could get you.

generally this is caused by overloaded servers which can't manage to respond at all due to the amount of work they have in their backlog queue. Please add "maxconn 50" for instance on each "server" line to see if it changes anything. Also, what type of server are you using ? For instance, mongrel only accepts one request at a time and will not respond to any health-check while it's processing a long request, so with it you need "maxconn 1".

> One question: couldn't it be possible to have redispatch work for TCP
> connections?

it does. However you have one particular config, you're using "balance source" with your TCP config. That means that when you redispatch the connection, you apply the LB algorithm again and you can only get back to the same server if it is still seen as up, because the size of the farm has not changed. There are two workarounds for this :

don't use "balance source" when not needed :-)
add enough retries to cover for the time to detect the server down, taking into account that each attempt waits at least 1 second.

For the second solution, you can combine "inter" and "fastinter" to lower the failure detection time. For instance, "inter 5s fastinter 1s fall 2" will take 5 + 2*1 = 7s to see the server as down. So with at least 8 retries it should be OK. The redispatch will occur once the server has been taken out of the farm, so the source hash algorithm will bring you to another server.

Regards,
Willy Received on 2009/05/07 06:26

Re: 1.3.17 in TCP mode sees dead servers (but they're not)