On 26 January 2018 at 12:16, Mark Rogers <mark@more-solutions.co.uk> wrote:
Any comments? For example, would it make more sense to open the .gz files directly in python rather than piping them in? (I assume that zcat is efficient and so are pipes, and I'm unlikely to achieve anything better myself.)
To answer my own question, I had a thought about this. gzip files can be concatenated to create a valid gzip file. Therefore, if multiple files are being combined I simply need to concatenate them, unless they span a time (eg 1hr) boundary, in which case those will need to be processed line by line. The source log files have filenames which tell me when they start, so I can tell fairly easily if a file needs to be processed (if there is a later file starting in the same period then it doesn't, otherwise it does). I haven't scripted it yet but this should get me pretty close to the raw performance of the disks, and is probably something that bash will handle fine using a combination of zgrep for the edge cases and cat for the rest (at the sacrifice of easy cross-platform implementation). -- Mark Rogers // More Solutions Ltd (Peterborough Office) // 0844 251 1450 Registered in England (0456 0902) 21 Drakes Mews, Milton Keynes, MK8 0ER