On 26 January 2018 at 12:16, Mark Rogers mark@more-solutions.co.uk wrote:
Any comments? For example, would it make more sense to open the .gz files directly in python rather than piping them in? (I assume that zcat is efficient and so are pipes, and I'm unlikely to achieve anything better myself.)
To answer my own question, I had a thought about this.
gzip files can be concatenated to create a valid gzip file. Therefore, if multiple files are being combined I simply need to concatenate them, unless they span a time (eg 1hr) boundary, in which case those will need to be processed line by line.
The source log files have filenames which tell me when they start, so I can tell fairly easily if a file needs to be processed (if there is a later file starting in the same period then it doesn't, otherwise it does).
I haven't scripted it yet but this should get me pretty close to the raw performance of the disks, and is probably something that bash will handle fine using a combination of zgrep for the edge cases and cat for the rest (at the sacrifice of easy cross-platform implementation).