Re: Timed cookies

From: Krzysztof Oledzki <ole#ans.pl>
Date: Sun, 21 Oct 2007 15:52:54 +0200 (CEST)

On Sun, 21 Oct 2007, Willy Tarreau wrote:

> Hi Krzysztof,

Hi Willy,

> I have several comments on your patch (which is globally correct).
Thanks.

> On Sun, Oct 21, 2007 at 03:08:15AM +0200, Krzysztof Oledzki wrote:
>>> Be careful, I don't want to add new features to 1.2.17, or minor
>>> features only. It's basically in fixes-only mode right now. However,
>>> adding 2 columns in the stats page in 1.3 is OK for me.
>>
>> Funny, funny. I have been working on similar functionality for some time
>> now. I was planing to push this patch yesterday (read as: before the
>> midnight) but my initial version only calculated time since last status
>> change. It was nearly a trivial joke to add what you described. So here it
>> is, with small delay required by additional changes. ;)
>>
>> This patch implements new statistics for SLA calculation by adding new
>> filed 'Dwntime' with total down time since restart (both HTTP/CSV) and
>> extending status filed (HTTP) or inserting a new one (CVS) with time
>> showing how long each server/backend is in a current state. Additionaly,
>> down transations are also calculated and displayed for backends, so it is
>> possible to know how many times selected backend was down, generating "No
>> server is available to handle this request." error.
>>
>> New information are presentetd in two different ways:
>> - for HTTP: a "human redable form", one of "100000d 23h", "23h 59m" or
>> "59m 59s"
>
> Please compact the time as much as possible. Simply remove spaces. Also,
> you can compact the number of days. I even think that we can report the
> number of days in digital format. Eg: 12.5d.

Right. If that is fine for you I would like to keep the 24855d23h/23h59m/59m59s format since it is hard to tell what does 12.3d exactly mean. IMHO 12d7h looks better than 12.3d if you need to check if it is what you expect.

> Also, keep in mind that with unix time, you'll never be able to report
> more than 24855 days.

Only when your time is 32bit, not 64. However, I think that keeping here a space for 5 chars is enough even in the future (99999d=273y) so I will shrink it, like you described.

>> - for CSV: seconds
>> I believe that seconds resolution is enough.
> Yes, for that it's enough.

OK.

>> As there are more columns in the status page I decided to shrink some
>> names to make more space:
>> - Check -> Chk
>> - Down -> Dwn
>> - Weight -> Wght
>
> OK, however, I would like the order of the columns not to change, and
> their name to remain unchanged in the CSV output. The reason is that
> people write scripts to convert the output into whatever monitoring
> solution they use, so if we extend the format, we must not break their
> tools. That means that additional fields must append to the right (at
> least in the CSV output).

OK.

>> Making described changes I also made some improvements and fixed some
>> small bugs:
>> - do not set server down if it is down already
>> - do not set server up if it is up already
>> - don't increment s->health above s->rise
>
> You mean above s->rise+s->fall I presume ? Because from what I have in
> mind, the server is down below rise and up between rise and rise+fall.

Exactly, a c&p error. Should be "s->health<s->rise+s->fall-1", as in the patch.

>> - fix colspan in multiple places (mostly broken by my previous patch)
>
> OK, I had already fixed the number of columns as the first post-1.3.13
> patch, but I did not notice that colspan was wrong.

OK, I will try to extract this patch from the git repository to prevent conflicts.

>> To prevent future mistakes I also added new 'END' field with '.', so now
>> it is easy to check if there are no fields missing after new ones were
>> added.
>
> Hmmm I don't like that very much, because we basically dump garbage on
> the output just in order to catch our programming mistakes more easily.
> I can understand why you find it useful, but I think we can control that
> in many other ways. I found that counting the commas was enough to
> confirm that the status field was missing. We could even write a simple
> awk/perl script to ensure that a CSV output is correct.

OK, removed.

>> diff -Nur haproxy-1.3.13-orig/include/common/time.h haproxy-1.3.13-sla/include/common/time.h
>> --- haproxy-1.3.13-orig/include/common/time.h 2007-10-18 22:38:22.000000000 +0200
>> +++ haproxy-1.3.13-sla/include/common/time.h 2007-10-20 22:12:50.000000000 +0200
>> @@ -27,6 +27,11 @@
>> #include <common/config.h>
>> #include <common/standard.h>
>>
>> +#define SEC hz
>> +#define MINUTE (60 * SEC)
>> +#define HOUR (60 * MINUTE)
>> +#define DAY (24 * HOUR)
>
> Please do not write macros which are specific to one function's context.
> In this case, it will only work in the function which has an "hz" variable.
> I'd prefer that you define SEC to 1 and use "10*SEC*hz" in your function
> when you need the conversion.

OK.

+char *human_time(unsigned, int);
> Please name your arguments above. I have no way to know how I can use
> this function.

OK.

>> diff -Nur haproxy-1.3.13-orig/include/proto/backend.h haproxy-1.3.13-sla/include/proto/backend.h
>> --- haproxy-1.3.13-orig/include/proto/backend.h 2007-10-18 22:38:22.000000000 +0200
>> +++ haproxy-1.3.13-sla/include/proto/backend.h 2007-10-21 01:42:20.000000000 +0200
>> @@ -39,7 +39,7 @@
>>
>> void recount_servers(struct proxy *px);
>> void recalc_server_map(struct proxy *px);
>> -
>> +unsigned be_downtime(struct proxy *);
>
> same here.

OK.

>> diff -Nur haproxy-1.3.13-orig/include/proto/server.h haproxy-1.3.13-sla/include/proto/server.h
>> --- haproxy-1.3.13-orig/include/proto/server.h 2007-10-18 22:38:22.000000000 +0200
>> +++ haproxy-1.3.13-sla/include/proto/server.h 2007-10-21 01:41:32.000000000 +0200
>> @@ -31,7 +31,7 @@
>>
>> #include <proto/queue.h>
>>
>> -
>> +unsigned srv_downtime(struct server *);
>
> and here.

OK. :)

>> diff -Nur haproxy-1.3.13-orig/include/types/proxy.h haproxy-1.3.13-sla/include/types/proxy.h
>> --- haproxy-1.3.13-orig/include/types/proxy.h 2007-10-18 22:38:22.000000000 +0200
>> +++ haproxy-1.3.13-sla/include/types/proxy.h 2007-10-21 02:22:26.000000000 +0200
>> @@ -111,6 +111,11 @@
>> unsigned int maxconn; /* max # of active sessions on the frontend */
>> unsigned int fullconn; /* #conns on backend above which servers are used at full load */
>> struct in_addr except_net, except_mask; /* don't x-forward-for for this address. FIXME: should support IPv6 */
>> +
>> + unsigned down_trans; /* up-down transitions */
>> + unsigned down_time; /* total time the proxy was down */
>> + time_t last_change; /* last time, when the state was changed */
>
>
> Use a long instead of a time_t for last_change, because it's
> awkward to call time() every time you need it. There is already
> a variable for this : "now". It is a timeval, so everywhere you
> need the current time, you can simply use "now.tv_sec". It is
> really cleaner IMHO because it is guaranteed not to change during
> a "cycle". Also, it avoids many useless syscalls.

Right, as always, I did not know that such thing exists. ;) Anyway, I am not sure about this time_t. AFAIK the new standard is to use time_t everywhere where possible since it will allow to switch from 32 to 64 bits in the future. I just checked, and on Linux, struct timeval defines tv_sec as time_t:

struct timeval {

         time_t          tv_sec;         /* seconds */
         suseconds_t     tv_usec;        /* microseconds */
};

For FreeBSD6 it is little more complicated:

struct timeval {
#ifdef __arm__

         time_t tv_sec;
#else

         long tv_sec; /* seconds (XXX should be time_t) */ #endif

         suseconds_t tv_usec; /* and microseconds */ };

So, I would rather keep it as time_t. What do you think?

>> diff -Nur haproxy-1.3.13-orig/include/types/server.h haproxy-1.3.13-sla/include/types/server.h
>> --- haproxy-1.3.13-orig/include/types/server.h 2007-10-18 22:38:22.000000000 +0200
>> +++ haproxy-1.3.13-sla/include/types/server.h 2007-10-20 20:06:15.000000000 +0200
>> @@ -86,10 +86,14 @@
>> unsigned wscore; /* weight score, used during srv map computation */
>>
>> unsigned failed_checks, down_trans; /* failed checks and up-down transitions */
>> + unsigned down_time; /* total time the server was down */
>> + time_t last_change; /* last time, when the state was changed */
>
> same here.

>> diff -Nur haproxy-1.3.13-orig/src/backend.c haproxy-1.3.13-sla/src/backend.c
>> --- haproxy-1.3.13-orig/src/backend.c 2007-10-18 22:38:22.000000000 +0200
>> +++ haproxy-1.3.13-sla/src/backend.c 2007-10-21 02:12:57.000000000 +0200
>> @@ -16,6 +16,7 @@
>> #include <stdlib.h>
>> #include <syslog.h>
>> #include <string.h>
>> +#include <time.h>
>
> you don't need time.h anymore :-)

Indeed.

>> @@ -717,6 +718,13 @@
>> return 0;
>> }
>>
>> +unsigned be_downtime(struct proxy *px) {
>> +
>> + if (px->srv_act || px->srv_bck)
>> + return px->down_time;
>> +
>> + return (unsigned)(px->down_time + time(NULL)-px->last_change);
>
> this translates into :
> + return (unsigned)(px->down_time + now.tv_sec - px->last_change);

True, true. :)

>> diff -Nur haproxy-1.3.13-orig/src/cfgparse.c haproxy-1.3.13-sla/src/cfgparse.c
>> --- haproxy-1.3.13-orig/src/cfgparse.c 2007-10-18 22:38:22.000000000 +0200
>> +++ haproxy-1.3.13-sla/src/cfgparse.c 2007-10-21 02:45:33.000000000 +0200
>> @@ -22,6 +22,7 @@
>> #include <sys/stat.h>
>> #include <fcntl.h>
>> #include <unistd.h>
>> +#include <time.h>
>>
>> #include <common/cfgparse.h>
>> #include <common/config.h>
>> @@ -543,7 +544,8 @@
>> tv_eternity(&curproxy->srvtimeout);
>> tv_eternity(&curproxy->contimeout);
>> tv_eternity(&curproxy->appsession_timeout);
>> -
>> +
>> + time(&curproxy->last_change);
>
> same here : curproxy->last_change = now.tv_sec.

:)

>> diff -Nur haproxy-1.3.13-orig/src/checks.c haproxy-1.3.13-sla/src/checks.c
>> --- haproxy-1.3.13-orig/src/checks.c 2007-10-18 22:38:22.000000000 +0200
>> +++ haproxy-1.3.13-sla/src/checks.c 2007-10-21 02:47:23.000000000 +0200
>> @@ -57,48 +57,54 @@
>> struct session *sess;
>> int xferred;
>>
>> + if (!(s->state & SRV_RUNNING))
>> + return; // already down, do nothing
>
> I would prefer not to rely on SRV_RUNNING inside checks.c. The reason is
> that it is checks.c which is responsible for setting this flag for others.
> In some sort, it is an output. All the checks algorithms relies on s->health,
> and I'd really like to keep that as the primary source of information.

All I was trying to change here is to prevent doing something which was alredy done. If SRV_RUNNING flag is cleared then there is no need to clear it again. I needed it to bypass updating last_change if nothing was changed.

> (...)
>
>> - if (s->health == s->rise) {
>> - recount_servers(s->proxy);
>> - s->proxy->map_state |= PR_MAP_RECALC;
>> -
>
> Also, BTW, the patch is wrong, because s->health can only equal s->rise
> when the server is *going* down. So we enter the if only during the
> transition. With the change, we enter it every time a server is UP (=
> mostly everytime). If you really want to remove this "if" block, you
> can simply revert the condition and return early :

Not sure about this as set_server_down is only called whet the server is going down. So, IMHO it is pointles to check if it is going down if it _is_ going down. Additionally, if I made this wrong last_change would be updated everytime and it is not what my test shows.

> + if (s->health != s->rise)
> + return; // server is not going down
>
> But I'm not sure it bring anything.
>
>> @@ -455,12 +461,25 @@
>> /* there was a test running */
>> if (s->result > 0) { /* good server detected */
>> //fprintf(stderr, "process_chk: 9\n");
>> - s->health++; /* was bad, stays for a while */
>> + if (s->health<s->rise+s->fall-1)
>> + s->health++; /* was bad, stays for a while */
>
> OK for this one.
>
>> if (s->health >= s->rise) {
>> - s->state |= SRV_RUNNING;
>>
>> - if (s->health == s->rise) {
>> + if (!(s->state & SRV_RUNNING)) {
>
> same as above about the transition vs state.

OK, if you do not like my changes I can move the place where the last_change is updated. However, I thought it is quite intuitive to do it in the same place where s->state is changed, which is not possible currently as s->state is overwritten every time.

>> diff -Nur haproxy-1.3.13-orig/src/dumpstats.c haproxy-1.3.13-sla/src/dumpstats.c
>> --- haproxy-1.3.13-orig/src/dumpstats.c 2007-10-18 22:38:22.000000000 +0200
>> +++ haproxy-1.3.13-sla/src/dumpstats.c 2007-10-21 02:42:10.000000000 +0200
>> @@ -47,6 +47,7 @@
>> #include <proto/proto_uxst.h>
>> #include <proto/senddata.h>
>> #include <proto/session.h>
>> +#include <proto/server.h>
>>
>> /* This function parses a "stats" statement in the "global" section. It returns
>> * -1 if there is any error, otherwise zero. If it returns -1, it may write an
>> @@ -186,9 +187,9 @@
>> "dreq,dresp,"
>> "ereq,econ,eresp,"
>> "wretr,wredis,"
>> - "weight,act,bck,"
>> - "chkfail,chkdown"
>> - "\n");
>> + "lastchg,status,weight,act,bck,"
>> + "chkfail,chkdown,downtime,"
>> + "END\n");
>
> OK for "status", but please move "lastchg" to the right (just before
> downtime, it would be perfect there).

Done.

> I have not recounted all the commas, but I trust you on this.
;) However, as I need to rework this part I hope is stays OK.

>> diff -Nur haproxy-1.3.13-orig/src/time.c haproxy-1.3.13-sla/src/time.c
>> --- haproxy-1.3.13-orig/src/time.c 2007-10-18 22:38:22.000000000 +0200
>> +++ haproxy-1.3.13-sla/src/time.c 2007-10-21 02:48:02.000000000 +0200
>> @@ -142,6 +142,32 @@
>> return __tv_isgt(tv1, tv2);
>> }
>>
>> +char *human_time(unsigned t, int hz) {
>> + static char rv[sizeof(" 100000d 23h")+1]; // longest of " 23h 59m" and " 59m 59s"
>
> you cannot reach 100000 days, so you can strip one char here. However,
> I don't know what happens with negative time. This would be interesting
> to check (you see, when people brutaly set the system's time by hand in
> the past).

Thank you for pointing this. I will fix this.

>> + char *p = rv;
>> +
>> + if (t < HOUR) {
>> + if (t >= MINUTE)
>> + p += sprintf(p, " %dm", t/MINUTE);
>> +
>> + if (t%MINUTE/SEC || !t)
>
> please even if it looks only like cosmetics, put spaces between
> your variables and operators everywhere. I don't care a dime about
> the 80 columns limit (as you might have noticed), but I want code
> that I can read at 3am when my eyes are starting to cross. And at
> 3am,
>
> if (t % MINUTE / SEC || !t)
>
> is more readable than :
>
> if (t%MINUTE/SEC || !t)

Done.

> As you see, that's a lot of small changes, but nothing particularly
> critical.

Exactly.

> BTW, I generally do not change people's patches before committing
> them. Some contributors may find it insulting that I fix things in
> their back, and others get annoyed because they maintain incremental
> patches on their side.

That is true. However, I have no objection when you fix something trivial, obviously and wrong. Especially that it is _your_ project and there is no way to force you to accept something you do not want to. On the other hand, it is nice to know what was changed, for example to be in sync with the mainline.

> On the other hand, I know it's a waste of time to play ping-pong
> with patches just to fix minor issues, hence why I've recently
> slightly modified some patches I was merging. Depending on what
> you expect when sending a patch, do not hesitate to tell me when
> you're OK with me doing minor changes if I notice anything while
> merging, to ask for review when it is what you're waiting for,
> or to explicitly tell me that you want a commit or reject but no
> change, for instance because you have other changes pending, or
> because you want to discuss the reason for changes before the
> commit. I have no problem adapting to people's expectations, we
> just have to be in sync.

Completely fair for me.

Thank you.

Best regards,

                         Krzysztof Olędzki Received on 2007/10/21 15:52

This archive was generated by hypermail 2.2.0 : 2007/11/04 19:21 CET