I have many gigabytes of .gz compressed log files, generally in very
small files. For example, one file only contains entries from
20180114-205608 to 20180114-205746 (about 90s), and is about 350k
compressed.
I want to extract these and combine them into one file per day*, compressed.
Log file entries are in date/time order, and filenames are easily
sorted into date/time order too, so I can easily generate a stream of
log entries in date/time order from which to work. So I assume that
what I need is something to pipe that to that will look at the date of
each line, and if it's changed close any existing output file, open a
new one (gzipped), and write to it until the date changes or EOF.
I could write something like that in a scripting language (for my sins
PHP would be easiest, Python I'm getting better at, bash could likely
do it too). But given the volume of data are there any suggestions as
to the "right" tool to use? Is this a job for awk or similar?
As an aside: The files are all archived on my Linux box, but they're
sourced from a Windows box, so a cross-platform solution would allow
me to do it on the host; transferring lots of small files isn't as
efficient as a few big files (athough I have archives going back years
on my Linux box to work through first).
* I say a file per day but ideally I could be flexible about the
period - day is most likely and probably easiest though
--
Mark Rogers // More Solutions Ltd (Peterborough Office) // 0844 251 1450
Registered in England (0456 0902) 21 Drakes Mews, Milton Keynes, MK8 0ER