I have many gigabytes of .gz compressed log files, generally in very small files. For example, one file only contains entries from 20180114-205608 to 20180114-205746 (about 90s), and is about 350k compressed.
I want to extract these and combine them into one file per day*, compressed.
Log file entries are in date/time order, and filenames are easily sorted into date/time order too, so I can easily generate a stream of log entries in date/time order from which to work. So I assume that what I need is something to pipe that to that will look at the date of each line, and if it's changed close any existing output file, open a new one (gzipped), and write to it until the date changes or EOF.
I could write something like that in a scripting language (for my sins PHP would be easiest, Python I'm getting better at, bash could likely do it too). But given the volume of data are there any suggestions as to the "right" tool to use? Is this a job for awk or similar?
As an aside: The files are all archived on my Linux box, but they're sourced from a Windows box, so a cross-platform solution would allow me to do it on the host; transferring lots of small files isn't as efficient as a few big files (athough I have archives going back years on my Linux box to work through first).
* I say a file per day but ideally I could be flexible about the period - day is most likely and probably easiest though
Once you have your ordered stream, with awk you could do e.g:
awk -F '-' '{f = "split"$1".log";print >> f}' log.txt
which creates split20180114.log, split20180115.log and so forth.
As for best/efficient, there may be ways that are faster, but I'd not optimise prematurely; if the above simple way gets the job done and works fast enough.
-- Martijn
On 25 Jan 2018, at 12:00, Mark Rogers mark@more-solutions.co.uk wrote:
I have many gigabytes of .gz compressed log files, generally in very small files. For example, one file only contains entries from 20180114-205608 to 20180114-205746 (about 90s), and is about 350k compressed.
I want to extract these and combine them into one file per day*, compressed.
Log file entries are in date/time order, and filenames are easily sorted into date/time order too, so I can easily generate a stream of log entries in date/time order from which to work. So I assume that what I need is something to pipe that to that will look at the date of each line, and if it's changed close any existing output file, open a new one (gzipped), and write to it until the date changes or EOF.
I could write something like that in a scripting language (for my sins PHP would be easiest, Python I'm getting better at, bash could likely do it too). But given the volume of data are there any suggestions as to the "right" tool to use? Is this a job for awk or similar?
As an aside: The files are all archived on my Linux box, but they're sourced from a Windows box, so a cross-platform solution would allow me to do it on the host; transferring lots of small files isn't as efficient as a few big files (athough I have archives going back years on my Linux box to work through first).
- I say a file per day but ideally I could be flexible about the
period - day is most likely and probably easiest though
-- Mark Rogers // More Solutions Ltd (Peterborough Office) // 0844 251 1450 Registered in England (0456 0902) 21 Drakes Mews, Milton Keynes, MK8 0ER
main@lists.alug.org.uk http://www.alug.org.uk/ https://lists.alug.org.uk/mailman/listinfo/main Unsubscribe? See message headers or the web site above!
On 25 January 2018 at 17:23, Martijn Koster mak-alug@greenhills.co.uk wrote:
Once you have your ordered stream, with awk you could do e.g:
awk -F '-' '{f = "split"$1".log";print >> f}' log.txt
which creates split20180114.log, split20180115.log and so forth.
Great thanks, I'll give that a go. Can awk either write directly to gzipped files or else can the above be modified to pipe through gzip? It's not just the relative efficiency of doing it one step but also the volume of disk space I'm going to chew up otherwise. I know that PHP can write .gz directly but that feels like a horrible tool for the job in general.
As for best/efficient, there may be ways that are faster, but I'd not optimise prematurely; if the above simple way gets the job done and works fast enough.
Absolutely agree about not prematurely or over-optimising.
On 26 January 2018 at 10:47, Mark Rogers mark@more-solutions.co.uk wrote:
Can awk either write directly to gzipped files or else can the above be modified to pipe through gzip? It's not just the relative efficiency of doing it one step but also the volume of disk space I'm going to chew up otherwise. I know that PHP can write .gz directly but that feels like a horrible tool for the job in general.
I've written a Python script to do this (as it can write directly to .gz much as PHP can but doesn't feel like a horrible choice).
I'm currently splitting into one file per hour's data due to their size; each file has around 1,700,000 lines (yes, that's per hour) by the look of it, and I'm averaging 30-45s per file. That works out around 8hrs to process one month's data. I have maybe 10 years of files to work through (so that'll run for about a month and a half).
The script is below (I'm only really getting to grips with Python so I'm sure it's not great code). I run it using $zcat *.log.gz | ./filter.py
Any comments? For example, would it make more sense to open the .gz files directly in python rather than piping them in? (I assume that zcat is efficient and so are pipes, and I'm unlikely to achieve anything better myself.)
#!/usr/bin/env python3
import gzip, sys, time
def closeLast(): if fh: td = time.time() - ts print(' %d lines, %0.3f seconds, %d lines/sec' % (ctr, td, ctr/td) ) fh.close()
oldFilename = False fh = False for l in sys.stdin: filename = l[0:11] if filename != oldFilename: closeLast() fn = 'filtered/%s.event.log.gz'%(filename) fh = gzip.open(fn, 'wt') ts = time.time() ctr=0 oldFilename = filename print("Logging to "+fn, end='', flush=True) ctr += 1 fh.write(l)
closeLast()
On 26 January 2018 at 12:16, Mark Rogers mark@more-solutions.co.uk wrote:
Any comments? For example, would it make more sense to open the .gz files directly in python rather than piping them in? (I assume that zcat is efficient and so are pipes, and I'm unlikely to achieve anything better myself.)
To answer my own question, I had a thought about this.
gzip files can be concatenated to create a valid gzip file. Therefore, if multiple files are being combined I simply need to concatenate them, unless they span a time (eg 1hr) boundary, in which case those will need to be processed line by line.
The source log files have filenames which tell me when they start, so I can tell fairly easily if a file needs to be processed (if there is a later file starting in the same period then it doesn't, otherwise it does).
I haven't scripted it yet but this should get me pretty close to the raw performance of the disks, and is probably something that bash will handle fine using a combination of zgrep for the edge cases and cat for the rest (at the sacrifice of easy cross-platform implementation).
On Thu, 2018-01-25 at 12:00 +0000, Mark Rogers wrote:
But given the volume of data are there any suggestions as to the "right" tool to use?
Isn't this what perl was designed for?
:oD
On 26 January 2018 at 09:13, Huge huge@huge.org.uk wrote:
Isn't this what perl was designed for?
Perl has largely managed to pass me by and dare I say it feels like a legacy language to be starting with now, compared with something like Python (which I have a bit of a love/hate relationship with).
Actually I'm sure that any language will handle this and since the bottleneck is filesystem maybe I'm overthinking it.
On Fri, 2018-01-26 at 10:50 +0000, Mark Rogers wrote:
On 26 January 2018 at 09:13, Huge huge@huge.org.uk wrote:
Isn't this what perl was designed for?
Perl has largely managed to pass me by and dare I say it feels like a legacy language
How dare you, Sir!
(You mean it's no longer fashionable.)
to be starting with now, compared with something like Python (which I have a bit of a love/hate relationship with).
Actually I'm sure that any language will handle this and since the bottleneck is filesystem maybe I'm overthinking it.
I'd say so.
On 26 January 2018 at 14:05, Huge huge@huge.org.uk wrote:
How dare you, Sir!
(You mean it's no longer fashionable.)
I do mean that, but in a world of shared libraries that also means that most of the exciting new stuff tends to appear in Python* or something else that isn't Perl, which is what led me to the legacy adjective.
No offence :-)
[ * This is why I have devoted some time to learn Python. Aspects of it I hate, but getting stuck with what I was used to wasn't doing me any favours either. ]
Actually I'm sure that any language will handle this and since the bottleneck is filesystem maybe I'm overthinking it.
I'd say so.
Reference my last post considering concatenating files that I know share the same timestamp and only processing ones that straddle boundaries, I wrote a basic bash script to test this theory and the time-per-hour dropped from ~40s to ~0.5s. The downside is that the resulting files are larger, presumably due to repeating dictionaries and the loss the opportunity to compress similarities between source files (an md5 verification on the extracted files confirms they have the same content, but the file size jumps from (eg) 11.2MB to 13.1MB). It's a shame to lose that disk space permanently to save a bit of time now, but that's one hell of a time trade off to make so I think I'm taking it. But it does show that this isn't disk bound (since the disk activity will be largely the same, I think?). It looks instead like I'm CPU bound due to the [de]compression.
It's been an interesting (to me anyway) programming challenge anyway.I'm happy to share the bash script if anyone is interested but without any documentation it's a bit meaningless and quite bespoke wrt my files.
On Fri, 2018-01-26 at 14:53 +0000, Mark Rogers wrote:
On 26 January 2018 at 14:05, Huge huge@huge.org.uk wrote:
How dare you, Sir!
(You mean it's no longer fashionable.)
I do mean that, but in a world of shared libraries that also means that most of the exciting new stuff tends to appear in Python* or something else that isn't Perl, which is what led me to the legacy adjective.
Good point.
No offence :-)
None taken.
[ * This is why I have devoted some time to learn Python. Aspects of it I hate, but getting stuck with what I was used to wasn't doing me any favours either. ]
I started to learn Python, got to the first example which mixed OO and imperative notation styles, threw up and stopped. I do this for fun now, and I wasn't enjoying myself.
Glad you solved your problem.
H.