On Sat, Feb 16, 2008 at 07:41:21AM -0500, Marc Breslow wrote:
> Thanks Willy. I wanted to wait until a slower time to run strace as it
> sounded like it could interrupt or slow down our services. HAProxy is
> running at 50% CPU now with roughly 275 HTTP sessions and 100 TCP sessions.
What's the approximative session rate and data rate ? If you have 5000 sessions per second or if you are forwarding 1 Gbps, that could be justified. Otherwise, even my small Geode 500 MHz consumes that at 100 Mbps under 3W !
> I generated the trace file. I searched for "refused" and found things like
> 07:26:30.697634 send(372, "HEAD /staging.online HTTP/1.0\r\n\r"..., 33,
> MSG_DONTWAIT|MSG_NOSIGNAL) = -1 ECONNREFUSED (Connection refused) <0.000010>
>
> Is that an example of something that takes a lot of CPU for haproxy?
No, this looks like a health-check, it consumes almost nothing. I was more worried about un checked servers which would be down but regularly selected under high session rates (eg: thousands of sessions per second).
> Maybe we're not using haproxy in the most effective way.
I don't see why this would be the case. However, as I said, I'm more worried about the kernel, and my worries were amplified by the "top" output you posted which showed very high system CPU usage.
> We have a couple of
> spare web server instances in our cluster that are usually not online. The
> way that we bring them online is by creating the file that haproxy uses to
> see if it's up or down. So every 2.5s it's checking those two servers and
> finding their down.
OK, but that's almost nothing. I don't remember the size of your machine, but on a typical 2 GHz single-proc machine, health-checks can go as high as 20000/s at full throttle, so you're about 50000 times below, which cannot be a problem.
> We also have an entire duplicate haproxy configuration for our testing site
> which we'll add 1 or 2 servers into at any time. We add the servers in by
> touching a different file on the web server that haproxy is constantly
> polling for. 5 or 6 out of 6 of these instances are usually unavailable.
This is the correct way of doing this.
> Is that more overhead for haproxy then if the servers are always available?
Not at all. BTW, you can also run strace in statistics mode :
# strace -c -p $(pidof haproxy)
wait 5s then Ctrl-C. It will output something like this :
Process 12233 attached - interrupt to quit
Process 12233 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ---------------- 77.38 0.025027 70 355 1 select 8.75 0.002829 5 548 ioctl 4.32 0.001397 11 133 33 read 3.66 0.001185 13 90 write 2.96 0.000957 10 100 3 sigreturn 2.94 0.000950 3 276 gettimeofday ------ ----------- ----------- --------- --------- ---------------- 100.00 0.032345 1502 37 total
If you could post it here, it would help trouble-shoot the problem.
> What else can I look for in the trace file?
Large times, and signs of one syscall looping on nothing. Eg: epoll() returning zero and being called immediately afterwards. This would indicate a big bug in haproxy. But since 1.2 and 1.3 are fairly different in this area and both exhibit the problem for you, I'm sceptical.
Regards,
Willy
Received on 2008/02/16 15:54
This archive was generated by hypermail 2.2.0 : 2008/02/16 16:00 CET