On Sat, 16 Feb 2008, Willy Tarreau wrote: <CUT>
> Do you think you really want to cover a large connect time *and* a large
> response time ?
Not large but IMO it is good to allow just one SYN retransmission.
> Do not forget that large connect times are the prime signs
> of failures if they happen often, and that if they happen occasionally,
> they are just caused by a lost packet, and this lost packet may as well
> happen in the response. That's why I think that timeout.check should cover
> the whole check, which in your case should mostly look like this :
>
> <-- connect --> | <---------------- response ---------------->
> 1ms 5 seconds
>
> If you set timeout.check to 5 seconds, you cover the case above without lost
> packet. But now, let's consider a lost packet in either part :
>
> <------- connect --------> | <---------------- response ---------------->
> 3 seconds 5 seconds
>
> <-- connect --> | <----------------------- response ---------------------->
> 1ms 8 seconds
If a packet is lost when a connection is already established it don't always mean that such retransmission takes 3s. I would rather say that in most cases it takes much less time. So it is possible for it to look like this:
<-- connect --> | <----------------------- response ----------------> 1ms ~RTT + 5 seconds (< 8 seconds)
The main question is if is better to have a precisely defined timeout of the whole check (connect + response read) or to have a precisely defined guaranteed time to a server to deal with a check-request.
Let's assume that a server is quite busy, it needs 5s for an anserwer and check.timeout is set to 7s:
0. No packet lost:
<-- connect --> | <----------------------- response --------------------> 1ms 5 seconds
<--------------------- check.timeout: 7 seconds ----------------> FAILED <------- connect --------> | <---------------- response ----------------> 3 seconds 5 seconds 2. SYN packet lost (check.timeout covers response read): <--------------------- check.timeout: 7 seconds -(...)-> <------- connect --------> | <---------------- response ----------------> OK 3 seconds 5 seconds
IMHO, scenario #1 is better.
So, the only problem is to make sure that the timeout used for check-connect is both not to short and not to long. What we have currently is min("timeout connect", "inter"). Maybe this one is wrong? If it is set to high, there is no way one can fix it by playing with timeout.check. Even if it covers the full check.
As I stated above, we should allow by default for 1 retransmission of a SYN. So *maybe* we can just hardcode it to quite safe ~3.5s? Or we can add another variable (you probably are starting to hate me ;) implicitly initialized to 3.5s (safe value) that one can change to: ~0.5s (no SYN retransmission is allowed) or >= ~8.5s (more SYN retransmissions are allowed).
If timeout.check is not set:
- a health-check never lasts longer than server.inter (current situation)
If timeout.check is set:
<CUT>
> The problem you mention about fastinter being too short for the whole test
> is fine. While using it for the connect only does not make me feel very
> comfortable, I still think this is needed to detect unplugged servers on
> which no connection may be established anymore, otherwise we will wait too
> long. I would propose to maintain <inter> as an upper bound for all tests
> anyway, and clearly state that <fastinter> is just a speedup to detect
> rapidly changing server states in easy to detect situations such as connection
> refused or server off. It will not affect the check total time nor the
> applicative part of the check. In this case, here's the slightly modified
> proposal :
>
> - a health-check never lasts longer than server.inter (current situation)
Glad we agree on this one. ;)
> - a health-check never lasts longer than timeout.check
> - a health-check never waits more than inter|fastinter to connect
> - a health-check never waits more than timeout.connect to connect
IMHO fastinter set for example to 1s may be too short even for a connect timeout, as we want to allow a SYN retransmission.
Best regards,
Krzysztof Olędzki Received on 2008/02/17 22:55
This archive was generated by hypermail 2.2.0 : 2008/02/17 23:00 CET