Combining log files

List overview All Threads
Download

newer

older

RAID: mdadm vs brtfs raid

Network monitoring scripts

Mark Rogers

25 Jan 2018 25 Jan '18

noon

I have many gigabytes of .gz compressed log files, generally in very small files. For example, one file only contains entries from 20180114-205608 to 20180114-205746 (about 90s), and is about 350k compressed.

I want to extract these and combine them into one file per day*, compressed.

Log file entries are in date/time order, and filenames are easily sorted into date/time order too, so I can easily generate a stream of log entries in date/time order from which to work. So I assume that what I need is something to pipe that to that will look at the date of each line, and if it's changed close any existing output file, open a new one (gzipped), and write to it until the date changes or EOF.

I could write something like that in a scripting language (for my sins PHP would be easiest, Python I'm getting better at, bash could likely do it too). But given the volume of data are there any suggestions as to the "right" tool to use? Is this a job for awk or similar?

As an aside: The files are all archived on my Linux box, but they're sourced from a Windows box, so a cross-platform solution would allow me to do it on the host; transferring lots of small files isn't as efficient as a few big files (athough I have archives going back years on my Linux box to work through first).

* I say a file per day but ideally I could be flexible about the period - day is most likely and probably easiest though

-- Mark Rogers // More Solutions Ltd (Peterborough Office) // 0844 251 1450 Registered in England (0456 0902) 21 Drakes Mews, Milton Keynes, MK8 0ER

Show replies by date

Martijn Koster

25 Jan 25 Jan

5:23 p.m.

Once you have your ordered stream, with awk you could do e.g:

awk -F '-' '{f = "split"$1".log";print >> f}' log.txt

which creates split20180114.log, split20180115.log and so forth.

As for best/efficient, there may be ways that are faster, but I'd not optimise prematurely; if the above simple way gets the job done and works fast enough.

-- Martijn

...

On 25 Jan 2018, at 12:00, Mark Rogers mark@more-solutions.co.uk wrote:

I have many gigabytes of .gz compressed log files, generally in very small files. For example, one file only contains entries from 20180114-205608 to 20180114-205746 (about 90s), and is about 350k compressed.

I want to extract these and combine them into one file per day*, compressed.

Log file entries are in date/time order, and filenames are easily sorted into date/time order too, so I can easily generate a stream of log entries in date/time order from which to work. So I assume that what I need is something to pipe that to that will look at the date of each line, and if it's changed close any existing output file, open a new one (gzipped), and write to it until the date changes or EOF.

I could write something like that in a scripting language (for my sins PHP would be easiest, Python I'm getting better at, bash could likely do it too). But given the volume of data are there any suggestions as to the "right" tool to use? Is this a job for awk or similar?

As an aside: The files are all archived on my Linux box, but they're sourced from a Windows box, so a cross-platform solution would allow me to do it on the host; transferring lots of small files isn't as efficient as a few big files (athough I have archives going back years on my Linux box to work through first).

I say a file per day but ideally I could be flexible about the

period - day is most likely and probably easiest though

-- Mark Rogers // More Solutions Ltd (Peterborough Office) // 0844 251 1450 Registered in England (0456 0902) 21 Drakes Mews, Milton Keynes, MK8 0ER

main@lists.alug.org.uk http://www.alug.org.uk/ https://lists.alug.org.uk/mailman/listinfo/main Unsubscribe? See message headers or the web site above!

Mark Rogers

26 Jan 26 Jan

10:47 a.m.

On 25 January 2018 at 17:23, Martijn Koster mak-alug@greenhills.co.uk wrote:

...

Once you have your ordered stream, with awk you could do e.g:

awk -F '-' '{f = "split"$1".log";print >> f}' log.txt

which creates split20180114.log, split20180115.log and so forth.

Great thanks, I'll give that a go. Can awk either write directly to gzipped files or else can the above be modified to pipe through gzip? It's not just the relative efficiency of doing it one step but also the volume of disk space I'm going to chew up otherwise. I know that PHP can write .gz directly but that feels like a horrible tool for the job in general.

...

As for best/efficient, there may be ways that are faster, but I'd not optimise prematurely; if the above simple way gets the job done and works fast enough.

Absolutely agree about not prematurely or over-optimising.

-- Mark Rogers // More Solutions Ltd (Peterborough Office) // 0844 251 1450 Registered in England (0456 0902) 21 Drakes Mews, Milton Keynes, MK8 0ER

Mark Rogers

12:16 p.m.

On 26 January 2018 at 10:47, Mark Rogers mark@more-solutions.co.uk wrote:

...

Can awk either write directly to gzipped files or else can the above be modified to pipe through gzip? It's not just the relative efficiency of doing it one step but also the volume of disk space I'm going to chew up otherwise. I know that PHP can write .gz directly but that feels like a horrible tool for the job in general.

I've written a Python script to do this (as it can write directly to .gz much as PHP can but doesn't feel like a horrible choice).

I'm currently splitting into one file per hour's data due to their size; each file has around 1,700,000 lines (yes, that's per hour) by the look of it, and I'm averaging 30-45s per file. That works out around 8hrs to process one month's data. I have maybe 10 years of files to work through (so that'll run for about a month and a half).

The script is below (I'm only really getting to grips with Python so I'm sure it's not great code). I run it using $zcat *.log.gz | ./filter.py

Any comments? For example, would it make more sense to open the .gz files directly in python rather than piping them in? (I assume that zcat is efficient and so are pipes, and I'm unlikely to achieve anything better myself.)

#!/usr/bin/env python3

import gzip, sys, time

def closeLast(): if fh: td = time.time() - ts print(' %d lines, %0.3f seconds, %d lines/sec' % (ctr, td, ctr/td) ) fh.close()

oldFilename = False fh = False for l in sys.stdin: filename = l[0:11] if filename != oldFilename: closeLast() fn = 'filtered/%s.event.log.gz'%(filename) fh = gzip.open(fn, 'wt') ts = time.time() ctr=0 oldFilename = filename print("Logging to "+fn, end='', flush=True) ctr += 1 fh.write(l)

closeLast()

-- Mark Rogers // More Solutions Ltd (Peterborough Office) // 0844 251 1450 Registered in England (0456 0902) 21 Drakes Mews, Milton Keynes, MK8 0ER

Mark Rogers

1:33 p.m.

On 26 January 2018 at 12:16, Mark Rogers mark@more-solutions.co.uk wrote:

...

Any comments? For example, would it make more sense to open the .gz files directly in python rather than piping them in? (I assume that zcat is efficient and so are pipes, and I'm unlikely to achieve anything better myself.)

To answer my own question, I had a thought about this.

gzip files can be concatenated to create a valid gzip file. Therefore, if multiple files are being combined I simply need to concatenate them, unless they span a time (eg 1hr) boundary, in which case those will need to be processed line by line.

The source log files have filenames which tell me when they start, so I can tell fairly easily if a file needs to be processed (if there is a later file starting in the same period then it doesn't, otherwise it does).

I haven't scripted it yet but this should get me pretty close to the raw performance of the disks, and is probably something that bash will handle fine using a combination of zgrep for the edge cases and cat for the rest (at the sacrifice of easy cross-platform implementation).

-- Mark Rogers // More Solutions Ltd (Peterborough Office) // 0844 251 1450 Registered in England (0456 0902) 21 Drakes Mews, Milton Keynes, MK8 0ER

Huge

9:13 a.m.

On Thu, 2018-01-25 at 12:00 +0000, Mark Rogers wrote:

...

But given the volume of data are there any suggestions as to the "right" tool to use?

Isn't this what perl was designed for?

:oD

-- Today is Sweetmorn, the 26th day of Chaos in the YOLD 3184 Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn.

Mark Rogers

10:50 a.m.

On 26 January 2018 at 09:13, Huge huge@huge.org.uk wrote:

...

Isn't this what perl was designed for?

Perl has largely managed to pass me by and dare I say it feels like a legacy language to be starting with now, compared with something like Python (which I have a bit of a love/hate relationship with).

Actually I'm sure that any language will handle this and since the bottleneck is filesystem maybe I'm overthinking it.

-- Mark Rogers // More Solutions Ltd (Peterborough Office) // 0844 251 1450 Registered in England (0456 0902) 21 Drakes Mews, Milton Keynes, MK8 0ER

Huge

2:05 p.m.

On Fri, 2018-01-26 at 10:50 +0000, Mark Rogers wrote:

...

On 26 January 2018 at 09:13, Huge huge@huge.org.uk wrote:

...
Isn't this what perl was designed for?

Perl has largely managed to pass me by and dare I say it feels like a legacy language

How dare you, Sir!

(You mean it's no longer fashionable.)

...

to be starting with now, compared with something like Python (which I have a bit of a love/hate relationship with).

Actually I'm sure that any language will handle this and since the bottleneck is filesystem maybe I'm overthinking it.

I'd say so.

-- Today is Sweetmorn, the 26th day of Chaos in the YOLD 3184 Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn.

Mark Rogers

2:53 p.m.

On 26 January 2018 at 14:05, Huge huge@huge.org.uk wrote:

...

How dare you, Sir!

(You mean it's no longer fashionable.)

I do mean that, but in a world of shared libraries that also means that most of the exciting new stuff tends to appear in Python* or something else that isn't Perl, which is what led me to the legacy adjective.

No offence :-)

[ * This is why I have devoted some time to learn Python. Aspects of it I hate, but getting stuck with what I was used to wasn't doing me any favours either. ]

...

...
Actually I'm sure that any language will handle this and since the bottleneck is filesystem maybe I'm overthinking it.

I'd say so.

Reference my last post considering concatenating files that I know share the same timestamp and only processing ones that straddle boundaries, I wrote a basic bash script to test this theory and the time-per-hour dropped from ~40s to ~0.5s. The downside is that the resulting files are larger, presumably due to repeating dictionaries and the loss the opportunity to compress similarities between source files (an md5 verification on the extracted files confirms they have the same content, but the file size jumps from (eg) 11.2MB to 13.1MB). It's a shame to lose that disk space permanently to save a bit of time now, but that's one hell of a time trade off to make so I think I'm taking it. But it does show that this isn't disk bound (since the disk activity will be largely the same, I think?). It looks instead like I'm CPU bound due to the [de]compression.

It's been an interesting (to me anyway) programming challenge anyway.I'm happy to share the bash script if anyone is interested but without any documentation it's a bit meaningless and quite bespoke wrt my files.

-- Mark Rogers // More Solutions Ltd (Peterborough Office) // 0844 251 1450 Registered in England (0456 0902) 21 Drakes Mews, Milton Keynes, MK8 0ER

Huge

5:59 p.m.

On Fri, 2018-01-26 at 14:53 +0000, Mark Rogers wrote:

...

On 26 January 2018 at 14:05, Huge huge@huge.org.uk wrote:

...
How dare you, Sir!

(You mean it's no longer fashionable.)

I do mean that, but in a world of shared libraries that also means that most of the exciting new stuff tends to appear in Python* or something else that isn't Perl, which is what led me to the legacy adjective.

Good point.

...

No offence :-)

None taken.

...

[ * This is why I have devoted some time to learn Python. Aspects of it I hate, but getting stuck with what I was used to wasn't doing me any favours either. ]

I started to learn Python, got to the first example which mixed OO and imperative notation styles, threw up and stopped. I do this for fun now, and I wasn't enjoying myself.

Glad you solved your problem.

-- Today is Sweetmorn, the 26th day of Chaos in the YOLD 3184 Ph'nglui mglw'nafh Cthulhu R'lyeh wgah'nagl fhtagn.

2724

Age (days ago)

2725

Last active (days ago)

main@lists.alug.org.uk

9 comments

3 participants

tags (0)

participants (3)

Huge
Mark Rogers
Martijn Koster