On Sat, 29 Sep 2007, Willy Tarreau wrote:
> On Sat, Sep 29, 2007 at 01:14:35PM +0200, Krzysztof Oledzki wrote:
> > Yes I agree. When some servers are checked every 30 seconds, this can be a > bit nasty. I intended to have two speeds for checks, a fast one used during > transitions and a normal one. Basically, upon startup, or just after one > failed health check or one success on a failed server, it would switch to > fast checks (eg: inter 1000 instead of inter 10000). It would ensure that > we could get rid of all these annoying things. Also, it would detect failures > faster.
I was also thinking on this for a while. It may be also good idea to add a restart-delay parameter, so when haproxy is restarted may be able to finish most of checks before replacing old process. Anyway, I'm not sure if it can be easily done...
>>>> Calculation were moved from cfgparse.c to
>>>> checks.c. There is a new function start_checks() and now it is not called
>>>> when haproxy is started in MODE_CHECK.
>>>>
>>>> With this patch it is also possible to set a global 'spread-check'
>>>> parameter. It takes a percentage value (1..50, probably something near
>>>> 5..10 is a good idea) so haproxy adds or removes that many percent to the
>>>> oryginal interval after each check. My test shows that with 18 backends,
>>>> 54 servers total and 10000ms/5% it takes about 45m to mix them completely.
>>>
>>> I think that we should be *very* careful when subtracting random
>>> percentage.
>>> It is very easy to go down to zero that way, and have a server bombed by
>>> health-checks.
>>
>> Is is not possible as spread-check accepts only 1..50, so in a worst case
>> this time should be (server->inter/2)+1.
> > OK fine. I thought I saw something in the style of "inter + random(100) - 50" > but may be I just confused with something else. In this case, I find it normal > that we would accept 0 for the spread-check. It would simply disable it but in > a more convenient way.
This value is in percents and calculation is performed with respect of using ints (not floats) so:
+ if (global.spread_check) {
Do something only when spread_check is enabled (>0)
+ rv = s->inter*global.spread_check/100; Calculate dedicated spread-check value for a server from global percent representation. In a corner case it is for example:
+ rv -= (int) (2*rv*(rand()/(RAND_MAX+1.0))); Get a random value in range 0..(2*rv-1) and substract it from rv:
+ tv_ms_add(&t->expire, &t->expire, s->inter+rv); Final value according to s->inter:
As you may noticed it even prefers larger values than lower, which should not be noticeable with more reasonable intervals like 1000ms. I spent some time on this so I hope it should work in all corner cases. BTW: I hope no one is so crazy enough to setup so small values! ;)
Allowing spread-check=0 make only sense when we setup a default value to something > 0. Maybe ideed this could be a good idea, with initial value like 5% it should not breake anything.
>>> BTW, I'm suddenly thinking about something: when I build the servers map,
>>> I use all the server weights and arrange them so that all their occurrences
>>> are as far as possible from each other, while respecting their exact
>>> weight.
>>> It it works very well. I'm realizing that we need exactly the same
>>> principle
>>> for the checks. The "weight" here is simply the frequency at which the
>>> servers
>>> must be checked, which is 1/interval. So by using the exact same functions
>>> as
>>> is used to build the servers map, we could build a health-check map that
>>> way :
>>>
>>> foreach srv in all_servers :
>>> weight(srv) = max(all inter) / inter(srv)
>>>
>>> build_health_map(all_servers)
>>>
>>> Afterwards, we would just have to cycle through this map to start the
>>> checks.
>>> It would even remove the need for one timer per server in the timers table.
>>> The immediate advantage is that they would all be spread apart upon
>>> startup,
>>> and we would not need the random anymore (eventhough it's would not harm to
>>> conserve the option).
>>>
>>> What do you think about this ?
>>
>> I need to think about this for a moment
> > Of course. I explained it quickly and it's not easy. I developped the algorithm > using a small program which used to draw the map for various values, but I don't > know where I put it. >
> > :-) >
> > Yes, that's a good point. In fact, we can limit interval to some reasonable > small values (eg: no below 50 ms). But even with that, some people would still > use a 1 minute interval, leading to disproportions of 60/.05 = 1200. This is > not enormous, but quite big yet. >
> > Oh I have no problem with that. I was just thinking loudly. I think I will > merge your patch in 1.3.12, but it does not prevent us from thinking about > evolutions ;-)
Of course, all I was trying to say is that it may not be worth spending too much time on this problem. The goal was to prevent sending checks simultaneously and IMHO it was achieved.
Best regards,
Krzysztof Olędzki Received on 2007/09/29 17:26
This archive was generated by hypermail 2.2.0 : 2007/11/04 19:21 CET