Hi Roy,
On 2011-03-18 22:21, Roy Smith wrote:
> Before I reinvent the wheel, has anybody already written code to parse
> haproxy log messages with Python?
I have, although it's not _that_ fast. My approach requires about 1 minutes per 100 MB gziped logs (with a roughly 10:1 compression).
If your usecase matches on the features of halog, you should definitly try that instead. It's written by Willy himself and is able to easily maxout your streaming file I/O (meaning it is magnitudes faster than you could ever do it in python itself)
That said, the gist of my analyzing implementation follows. It is targeted at the verbose HTTP log format of HAProxy and Python 2.4. The terminology is the one used in the configuration manual of HAProxy. Refer to it for a description of the various fields.
#!/usr/bin/env python
# encoding: utf-8
import re
import subprocess as sub
# Does the syslog server escape quotes?
template_escape = True
haproxy_re = (r'haproxy\[(?P<pid>\d+)\]: '
r'(?P<client_ip>(\d{1,3}\.){3}\d{1,3}):(?P<client_port>\d{1,5}) ' r'\[(?P<date>\d{2}/\w{3}/\d{4}(:\d{2}){3}\.\d{3})\] ' r'(?P<listener_name>\S+) (?P<server_name>\S+) ' r'(?P<Tq>(-1|\d+))/(?P<Tw>(-1|\d+))/(?P<Tc>(-1|\d+))/(?P<Tr>(-1|\d+))/' r'(?P<Tt>\+?\d+) ' r'(?P<HTTP_return_code>\d{3}) (?P<bytes_read>\d+) ' r'(?P<captured_request_cookie>\S+) (?P<captured_response_cookie>\S+) ' r'(?P<termination_state>[\w-]{4}) (?P<actconn>\d+)/(?P<feconn>\d+)/' r'(?P<beconn>\d+)/(?P<srv_conn>\d+)/(?P<retries>\d+) ' r'(?P<server_queue>\d+)/(?P<listener_queue>\d+) ' r'(\{(?P<captured_request_headers>.*?)\} )?' r'(\{(?P<captured_response_headers>.*?)\} )?')
if template_escape:
haproxy_re += r'\\"(?P<HTTP_request>.+)\\"'
haproxy_re += r'"(?P<HTTP_request>.+)"'
haproxy_re = re.compile(haproxy_re)
def scan(logfile_path):
(root, ext) = os.path.splitext(logfile_path)
process = None
if ext == ".gz":
# Use a shellout for unzipping. This is about 2-5 times faster
# than doing it in python.
process = sub.Popen(["/bin/gunzip", "--stdout", path],
stdout=sub.PIPE, bufsize=1)fd = process.stdout
line_no = 0
for line in fd:
line_no += 1
match = haproxy_re.search(line) if not match: # A non-request, e.g. an error or an info message of HAProxy # We just ignore it and continue with the next line continue fields = match.groupdict() if fields["captured_request_headers"]: fields["captured_request_headers"] = \ fields["captured_request_headers"].split("|") if fields["captured_response_headers"]: fields["captured_response_headers"] = \ fields["captured_response_headers"].split("|") # Now you have the matched parts in the fields dict # And you can do whatever you like with it :) except: print "An error occurred in line %s. Last line was:" % line_no print line raise
# finalize the file reading
if process:
Received on 2011/03/19 17:32
This archive was generated by hypermail 2.2.0 : 2011/03/19 17:45 CET