On 26 January 2018 at 10:47, Mark Rogers mark@more-solutions.co.uk wrote:
Can awk either write directly to gzipped files or else can the above be modified to pipe through gzip? It's not just the relative efficiency of doing it one step but also the volume of disk space I'm going to chew up otherwise. I know that PHP can write .gz directly but that feels like a horrible tool for the job in general.
I've written a Python script to do this (as it can write directly to .gz much as PHP can but doesn't feel like a horrible choice).
I'm currently splitting into one file per hour's data due to their size; each file has around 1,700,000 lines (yes, that's per hour) by the look of it, and I'm averaging 30-45s per file. That works out around 8hrs to process one month's data. I have maybe 10 years of files to work through (so that'll run for about a month and a half).
The script is below (I'm only really getting to grips with Python so I'm sure it's not great code). I run it using $zcat *.log.gz | ./filter.py
Any comments? For example, would it make more sense to open the .gz files directly in python rather than piping them in? (I assume that zcat is efficient and so are pipes, and I'm unlikely to achieve anything better myself.)
#!/usr/bin/env python3
import gzip, sys, time
def closeLast(): if fh: td = time.time() - ts print(' %d lines, %0.3f seconds, %d lines/sec' % (ctr, td, ctr/td) ) fh.close()
oldFilename = False fh = False for l in sys.stdin: filename = l[0:11] if filename != oldFilename: closeLast() fn = 'filtered/%s.event.log.gz'%(filename) fh = gzip.open(fn, 'wt') ts = time.time() ctr=0 oldFilename = filename print("Logging to "+fn, end='', flush=True) ctr += 1 fh.write(l)
closeLast()