Re: haproxy alleatory fails on config reload

From: Pablo Escobar <pescobar#cipf.es>
Date: Fri, 30 May 2008 14:22:02 +0200


El Friday 30 May 2008 06:43:34 Willy Tarreau escribió:
> On Fri, May 30, 2008 at 01:22:56AM +0200, Pablo Escobar wrote:
> > Hi Krzysztof and Willy,
> >
> > I have compiled haproxy 1.3.15.1 for x86_64
> >
> > I found with dmesg some of this errors:
> > ip_conntrack: table full, dropping packet.
> >
> > So I tougth I had found the problem but I was wrong. I doubled the
> > ip_conntrack_max value and the dmesg error dissapeared but I still get
> > the same problem. Around 5 reloads ok and then a wrong reload wich takes
> > all my websites offline for EXACTLY 1 MIN.
> >
> > connecting to port 81 I see every backend up. connecting on port 80 I get
> > 503 error on every backend. exactly when the stats web arrives to "uptime
> > 1:01min" I get every backend up on port 80. I am sure the 1min is not
> > random. I have tried it 2 times with the same result.
>
> 1 minute might be the time needed to expire old sessions from your
> conntrack. You can try to reduce ip_conntrack_tcp_timeout_time_wait in
> /proc/sys/net/ipv4/netfilter to see if this has any effect.
>

I have a value of 120 on ip_conntrack_tcp_timeout_time_wait. ¿would be safe to change it to 60?

also I have found a 60 value on "ip_conntrack_tcp_timeout_syn_recv"

[root#haproxy]$>
cat /proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_timeout_syn_recv 60

¿maybe the problem is related to this?

> Also, it is possible that your conntrack is buggy and does not initiate a
> new session on a SYN reusing a same source port (once your source ports
> wrap around).
>
> You should *really* unload conntrack to see if it makes any difference.

Sorry for my ignorance but........¿If I unload this module could it make my haproxy machine stop working? This is the main machine on my web cluster (doing reverse proxy on every webserver) so if it stops working everything goes down. ¿is safe to remove the conntrack module? Right now I can´t do it because this hours is when I have most of the web traffic but if anyone can confirm that is safe to unload the module I will do some testing tonight.

>
> > Other strange thing is that I am getting this on logs on every reload. on
> > both good reload and wrong reloads
>
> (...)
>
> > May 30 00:47:58 127.0.0.1 haproxy[27763]: Proxy trac failed to enter
> > pause mode.
> > May 30 00:47:58 127.0.0.1 haproxy[27763]: Proxy trac failed to enter
> > pause mode.
>
> OK for those now I think I found why they report this error. haproxy is
> trying to pause all proxies, including the backends, then checks that they
> are correctly paused, which is stupid. I'll fix this. In my opinion, except
> for the unpleasant log messages, it should not cause any trouble.
>

I supposed this wasn´t a problem because I had the same logs on good reload and wrong reloads but it was strange.

> > other strange thing is that in the wrong reload I got this on the log:
> >
> > May 30 00:47:42 127.0.0.1 haproxy[27763]: 127.0.0.1:35292
> > [30/May/2008:00:47:42.500] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34106
> > - - PR-- 0/0/0/0/0 0/0 "GET /stats HTTP/1.1"
> > May 30 00:47:42 127.0.0.1 haproxy[27763]: 127.0.0.1:35297
> > [30/May/2008:00:47:42.990] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34084
> > - - PR-- 1/1/0/0/0 0/0 "GET /stats HTTP/1.1"
> > May 30 00:47:43 127.0.0.1 haproxy[27763]: 127.0.0.1:35310
> > [30/May/2008:00:47:43.507] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34053
> > - - PR-- 1/1/0/0/0 0/0 "GET /stats HTTP/1.1"
> > May 30 00:47:43 127.0.0.1 haproxy[27763]: 127.0.0.1:35315
> > [30/May/2008:00:47:43.622] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34055
> > - - PR-- 2/2/0/0/0 0/0 "GET /stats HTTP/1.1"
> > May 30 00:47:44 127.0.0.1 haproxy[27763]: 127.0.0.1:35328
> > [30/May/2008:00:47:44.109] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34056
> > - - PR-- 1/1/0/0/0 0/0 "GET /stats HTTP/1.1"
> > May 30 00:47:44 127.0.0.1 haproxy[27763]: 127.0.0.1:35335
> > [30/May/2008:00:47:44.555] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34056
> > - - PR-- 2/2/0/0/0 0/0 "GET /stats HTTP/1.1"
> >
> > 30/May/2008:00:47:44
> > ---------- HERE STOPS LOGIN FOR AROUND ONE MINUTE
> > --------------------- ---------- THEN CONTINUES LOGGING BUT TIMESTAMP IS
> > 1SEC LATER -----------
>
> Are you really sure about this ? The log messages are only 1 second later,
> so it like your machine paused for 1 minute. Last time I saw such a thing
> was on a machine regularly entering suspend mode due to excessive power
> management settings.
>

Well, I think yes but really yesterday night was 1.30am, I had been for more than 12 hours on the keyboard and I had some "grep -v" on my "tail -f" so maybe it was my fault. Tonight I will try to do some testing again doing just "tail -f" without any "grep -v" to try to confirm this.
> > May 30 00:47:45
> >
> > May 30 00:47:45 127.0.0.1 haproxy[27763]: 127.0.0.1:35338
> > [30/May/2008:00:47:45.035] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34056
> > - - PR-- 0/0/0/0/0 0/0 "GET /stats HTTP/1.1"
> > May 30 00:47:45 127.0.0.1 haproxy[27763]: 127.0.0.1:35355
> > [30/May/2008:00:47:45.559] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34059
> > - - PR-- 2/2/0/0/0 0/0 "GET /stats HTTP/1.1"
> > May 30 00:47:46 127.0.0.1 haproxy[27763]: 127.0.0.1:35371
> > [30/May/2008:00:47:46.043] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34060
> > - - PR-- 3/3/0/0/0 0/0 "GET /stats HTTP/1.1"
> > May 30 00:47:46 127.0.0.1 haproxy[27763]: 127.0.0.1:35373
> > [30/May/2008:00:47:46.283] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34060
> > - - PR-- 2/2/0/0/0 0/0 "GET /stats HTTP/1.1"
> > May 30 00:47:46 127.0.0.1 haproxy[27763]: 127.0.0.1:35376
> > [30/May/2008:00:47:46.545] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34060
> > - - PR-- 1/1/0/0/0 0/0 "GET /stats HTTP/1.1"
> > May 30 00:47:47 127.0.0.1 haproxy[27763]: 127.0.0.1:35389
> > [30/May/2008:00:47:47.217] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34060
> > - - PR-- 5/5/0/0/0 0/0 "GET /stats HTTP/1.1"
> > May 30 00:47:47 127.0.0.1 haproxy[27763]: 127.0.0.1:35406
> > [30/May/2008:00:47:47.915] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34061
> > - - PR-- 2/2/0/0/0 0/0 "GET /stats HTTP/1.1"
> > May 30 00:47:48 127.0.0.1 haproxy[27763]: 127.0.0.1:35415
> > [30/May/2008:00:47:48.371] http-in http-in/<STATS> 0/-1/-1/-1/0 200 34096
> > - - PR-- 3/3/0/0/0 0/0 "GET /stats HTTP/1.1"
> >
> > ¿¿maybe can affect that I am using syslog-ng?? I dont think so but.....
>
> No I don't think so either.
> One possibility would be the fact that conntrack is full and drops UDP
> logs, but I really don't think so because once the session is created, the
> same will be used again and again.
>

We think the same about this.

> > The output of ethtool is this:
> >
> > Settings for eth0:
> > Supported ports: [ TP ]
> > Supported link modes: 10baseT/Half 10baseT/Full
> > 100baseT/Half 100baseT/Full
> > 1000baseT/Full
> > Supports auto-negotiation: Yes
> > Advertised link modes: 10baseT/Half 10baseT/Full
> > 100baseT/Half 100baseT/Full
> > 1000baseT/Full
> > Advertised auto-negotiation: Yes
> > Speed: 100Mb/s
> > Duplex: Full
> > Port: Twisted Pair
> > PHYAD: 1
> > Transceiver: internal
> > Auto-negotiation: on
> > Supports Wake-on: g
> > Wake-on: d
> > Link detected: yes
> >
> >
> > Settings for eth1:
> > Supported ports: [ TP ]
> > Supported link modes: 10baseT/Half 10baseT/Full
> > 100baseT/Half 100baseT/Full
> > 1000baseT/Full
> > Supports auto-negotiation: Yes
> > Advertised link modes: 10baseT/Half 10baseT/Full
> > 100baseT/Half 100baseT/Full
> > 1000baseT/Full
> > Advertised auto-negotiation: Yes
> > Speed: 100Mb/s
> > Duplex: Full
> > Port: Twisted Pair
> > PHYAD: 1
> > Transceiver: internal
> > Auto-negotiation: on
> > Supports Wake-on: g
> > Wake-on: d
> > Link detected: yes
> >
> > Not sure if is possible what you mention " duplex mismatch between the
> > haproxy server and the switch". I will try to explore it deeper tomorrow
> > at the lab.
>
> At least you have auto-neg enabled and full duplex reported, so you most
> likely have the same on the other end.
>
> > The 503 error logs that I sent on my last mail I think was ok. I mean,
> > the backend was really down and the 503 error was right. I just added the
> > "trac" backend today and testing it I turned it off. The strange thing is
> > that those are the only 503 errors on my log. On the 1 minute downtime I
> > dont get any 503 error on my haproxy log (I can see the 503 just on my
> > browser).
> >
> > I thik tomorrow I will try to compile latests haproxy version for x86 and
> > do some testing....
>
> First, disable ip_conntrack so that we know where to look for the issue. If
> conntrack is responsible for the problem, we need to understand why it is
> not clear from haproxy's logs and try to improve the logs so that tracking
> that down becomes easier in the future.
>

as said before, If you confirm me that is safe to unload the module I will try tonight when the web traffic is slower.

> > many thanks again for all your help.
> >
> > I go to bed now. enough for one day :)
> >
> :-)
>
> regards,
> willy

many thanks again.
Pablo.

-- 
Pablo Escobar Lopez
Head of Infrastructure & IT Support
Bioinformatics Department
Centro de Investigación Príncipe Felipe (CIPF)
Tfn: (34) 96 328 96 80 ext: 1004
http://bioinfo.cipf.es
Received on 2008/05/30 14:22

This archive was generated by hypermail 2.2.0 : 2008/05/30 14:31 CEST