Hypermail

From: Willy Tarreau <w#1wt.eu>
Date: Sat, 16 Feb 2008 08:02:20 +0100

On Sat, Feb 16, 2008 at 01:42:49AM +0100, Krzysztof Oledzki wrote:
> >I would like to propose something more in sync with your version, but
> >slightly adapted for bizarre situations. In the examples, what I call
> >"inter" will be whatever inter (normal, fast, etc...).
> >
> >- most common setups will use medium connect (about 4-5 seconds) to cover
> > one lost packet, and short inter (about 1s) to quickly detect changes.
>
> OK.
>
> >- some setups will use medium connect with large inter (60s) in order not
> > to flood the server with checks, because we're not interested in quick
> > changes. However, as you noted, it's not fair to permit that long for
> > a connection attempt to succeed, and we should at least kick the check
> > off if we know that normal traffic will not succeed (meaning kick it
> > off after timeout.connect anyway).
>
> Right.
>
> >- some setups use large connect (60s) at least because of queue. We have to
> > support them. Some of them will use short inter (1s) and we don't want
> > that interval to shift to 1 minute because of the connect, and for others
> > with very inter (60s), we would still like to be able to stop a check if
> > it does not succeed within a few seconds.
>
> Exactly.
>
> >So I would say that we should always bound the connect timeout to
> >min(timeout.connect, inter).
>
> Yes, we could, but...
>
> >Most common setups (first case) will remain unaffected. Second case will
> >be affected but will get back to reality by testing the real service
> >instead of something which might present an apparently up server which
> >does not work for traffic. Third case will remain unaffected
> >(connect=60, inter=1), but last case will not get any better (60, 60).
>
> ... but this is not so simple. A health-check is not only a successful
> connection.

I'm well aware of that. I was just thinking about limiting the whole check, which implies the connection too.

> It is also a work a server has to fulfill to deal with a
> request and this work takes some time. If server is loaded (but not
> overloaded) and a health-check-script is something more that "return '200
> OK'" than it may take some seconds. Especially if script checks for
> example if it is able to read files from a NFS/CIFS/iSCSI storage, connect
> to a database (possibly to wait for a free slot) and to perform some
> selects when it checks if is supposed to return a 200 or rather a 4xx/5xx
> code when someone scheduled a downtime, ...

OK. This is exactly what timeout.check is supposed to cover.

> >That's exactly where timeout.check is needed. What does it basically mean ?
> >It means that we don't want a check to run for too long. If it does not
> >*complete* within the expected time, kill it. Having it cover the whole
> >sequence reduces the risk of delay shifts due to additions of many small
> >numbers (what if the server responds 1 byte per second after all ?).
>
> But we also want to give a health-check some chances to finish.

Yes, that's what I proposed first (not taking <inter> into account), but as you reminded to me, we want a fast detection of failing servers too.

> >It also allows us to *reduce* the allowed connect time to the server for
> >health checks without affecting timeout.connect which is initially for
> >traffic.
> >
> >So with your example values, we still remain at 34 seconds, but we can
> >now reduce timeout.check to make it more meaningful :
> >
> > inter = 15s
> > fastinter = 1s
> > timeout connect = 4s
> > timeout check = 1s (we don't care about retransmits, fall is there for
> > that)
> > fall = 4
> >
> >we get :
> >
> > "min(check,connect,inter) + (fall-1)*min(check,connect,fastinter) +
> > fall*min(inter, connect, check)"
> > Total: 1 + 3*1 + 4*1 = 8s
> >
> >That way, existing setups benefit from the fix for the second case, and
> >new ones can play with timeout.check to enforce the timeout on their
> >checks without depending on other counter-intuitive timeout calculations.
> >
> >Is that OK for you that way ? At least it is for me since I see how I
> >can configure my proxies with this, and I also see how I can explain
> >to users how to use it and what each parameter does. This simply resumes
> >in this :
> >
> > - a health-check never lasts longer than inter|fastinter
> > - a health-check never lasts longer than timeout.check
> > - a health-check never takes longer than timeout.connect to establish
>
> In my situation, when a health-check connection was successfully
> established, it typically requires 0-5 more seconds to finish a test so
> 10s timeout seems to be safe. I would like to detect when a server is down
> or broken ASAP but at the same time I don't want to kick it out if it (or
> a database, a storage, a memcache engine, etc) is saturated for a (short)
> moment. If something went wrong I like to repeat the test soon but also
> not too soon to prevent false alarms and not to flood a server (fastinter
> = ~1s) so I need timeout.check to be large. Not very large but large
> enough. Bounding everything to fastinter is not going to work for me. :(

OK, I think this is precisely the problem: inter and fastinter are *intervals*, not timeouts. inter has been the timeout since the very beggining and should remain so (I think for now). fastinter is just a speedup to quickly detect that a server is failing.

> Bounding it to timeout.check can be fine but only if there were no connect
> timeout because if it happened, server may run out of time to execute a
> healt-check-script. So, this is the reason why I designed it that way.

Do you think you really want to cover a large connect time *and* a large response time ? Do not forget that large connect times are the prime signs of failures if they happen often, and that if they happen occasionally, they are just caused by a lost packet, and this lost packet may as well happen in the response. That's why I think that timeout.check should cover the whole check, which in your case should mostly look like this :

  <-- connect --> | <---------------- response ---------------->
       1ms                         5 seconds

If you set timeout.check to 5 seconds, you cover the case above without lost packet. But now, let's consider a lost packet in either part :

  <------- connect --------> | <---------------- response ---------------->
          3 seconds                             5 seconds

  <-- connect --> | <----------------------- response ---------------------->
       1ms                         8 seconds

In both situations, you need to tolerate 8 seconds, which you do with your 10s check timeout. And for the normal user, there is no need to think long about what finally covers the connect time and how to calculate the maximum time between two checks. Also, this is how it was working with inter before timeout.check, and what we really tried to fix is too large a time granted to a check to finish when intervals were huge.

The problem you mention about fastinter being too short for the whole test is fine. While using it for the connect only does not make me feel very comfortable, I still think this is needed to detect unplugged servers on which no connection may be established anymore, otherwise we will wait too long. I would propose to maintain <inter> as an upper bound for all tests anyway, and clearly state that <fastinter> is just a speedup to detect rapidly changing server states in easy to detect situations such as connection refused or server off. It will not affect the check total time nor the applicative part of the check. In this case, here's the slightly modified proposal :

a health-check never lasts longer than server.inter (current situation)
a health-check never lasts longer than timeout.check
a health-check never waits more than inter|fastinter to connect
a health-check never waits more than timeout.connect to connect

That way, we can easily play with timeout.check to set the cursor low enough when inter either or timeout.connect need to be huge, we fix existing broken setups with both connect and inter very high, and we are still able to detect unplugged servers thanks to fastinter.

Thus, the total time to detect an unplugged server will be min(inter,connect,check) + (fall-1)*min(fastinter,connect,check) = a few

                                                                     seconds

And it still tolerates long server response times when there are complex things to check on the backend.

> Finally, this "keep the old behavior if timeout.check is not set" was
> supposed to to keep old configs 100% valid but give a chance to tune
> everything if someone finds a time to read a new manual and decides to use
> the additional "check timeout".

I know, but even after reading the manual, things are complex to understand, because it is never natural for a normal human being that if we fix an upper bound to the response time, it implies that another bound exists for the connect time, which does not depend on it but on two other variables.

Also, it's very difficult to evaluate a total time when you have to add lots of functions of multiple variables. What people really want is not just to finely tune each step of the check, but also to ensure that their detection time will be short. Having the ability to limit this total time by multiplying only two variables to get the worst case is very important. Here, an admin can commit himself saying that in the worst case, the failure detection will not take longer than fall*timeout.check.

Best regards,
Willy Received on 2008/02/16 08:02

Re: Changes in the check timeouts