Hi Willy,
On 30/11/08 21:01, Willy Tarreau wrote:
> Hi Kai,
>
> On Sun, Nov 30, 2008 at 08:24:17PM +0000, Kai Krueger wrote:
>
>>> Could you try to change the LB algo from leastconn to roundrobin and
>>> reproduce the test above ? If only leastconn triggers the issue, then
>>> there might be a bug in its set_server_down() function. Maybe sometimes
>>> a server is not removed from the tree.
>>>
>>>
>> Ok, I think I might have tracked the problem down a bit more. I still
>> don't have a full grasp of which path exactly the requests go through,
>> but I might have identified one potential way of preventing this. The
>> problem is not that the server is not removed from the tree. That
>> happens correctly. However there seem to be several paths in
>> proto_http.c which bypass the load balancing tree with comments of the
>> form "We used to have a free connection slot. Since we'll never use it,
>> we have to inform the server that it may be used by another session. "
>> However the sever could have gone down in the mean time and neither
>> may_dequeue_tasks(server) nor process_srv_queue(server) check for the
>> status of the server before wakeing up tasks on that server. Adding a
>> check in either of these functions seem to do the trick, but again, I am
>> not sure if this is the right fix, or if my analysis was correct.
>>
>
> Damn, you're perfectly right, shame on me ! I'm amazed that this problem
> has never showed up before. I think it is because the low maxconns and the
> long requests trigger it more easily (otherwise the server would have nothing
> left in the queue when completing an old request).
>
I suspect another important aspect of why other people might not have
noticed this is that in our case, the backend server still respond with
correct HTTP in a timely manor when marked down, so the other mechanisms
that retry and redispatch sessions on connection errors didn't catch it.
> Could you please try this patch ? If it works, I'll merge it and release
> two more versions.
>
I haven't had the chance to test it properly yet so this is a
preliminary conclusion, but I do get the impression that unfortunately
it didn't seem to work. I need to try it in my test setup again though,
as there is too much other stuff going on on the production server. I
did notice though that in the code there is one place, where
process_srv_queue() is called with out the may_qequeue_tasks() check and
that is in session_free(), which could potentially explain it?
Kai
> Thanks very much for your long work at tracking this issue down!
>
> Willy
>
> ----
>
> From 80b286a064eaec828b7fd10e98e3f945e8b244f3 Mon Sep 17 00:00:00 2001
> From: Willy Tarreau<w#1wt.eu>
> Date: Sun, 30 Nov 2008 21:51:58 +0100
> Subject: [BUG] do not dequeue requests on a dead server
>
> Kai Krueger reported a problem when a server goes down with active
> connections. A lot of connections were drained by that server. Kai
> did an amazing job at tracking this bug down to the dequeuing
> mechanism which forgets to check the server state before allowing
> a request to be sent to a server.
>
> The problem occurs more often with long requests, which have a chance
> to complete after the server is completely marked down, and to find
> requests in the global queue which have not yet been fetched by other
> servers.
>
> The fix consists in ensuring that a server is up before sending it
> any new request from the queue.
> ---
> include/proto/queue.h | 5 +++--
> 1 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/include/proto/queue.h b/include/proto/queue.h
> index 6899aee..be78b53 100644
> --- a/include/proto/queue.h
> +++ b/include/proto/queue.h
> @@ -64,10 +64,11 @@ static inline struct pendconn *pendconn_from_px(const struct proxy *px) {
> }
>
> /* returns 0 if nothing has to be done for server<s> regarding queued connections,
> - * and non-zero otherwise. Suited for and if/else usage.
> + * and non-zero otherwise. If the server is down, we always return zero. Suited for
> + * and if/else usage.
> */
> static inline int may_dequeue_tasks(const struct server *s, const struct proxy *p) {
> - return (s&& (s->nbpend || p->nbpend)&&
> + return (s&& (s->state& SRV_RUNNING)&& (s->nbpend || p->nbpend)&&
> (!s->maxconn || s->cur_sess< srv_dynamic_maxconn(s)));
> }
>
>
Received on 2008/12/01 12:17
This archive was generated by hypermail 2.2.0 : 2008/12/01 12:30 CET