On 26 January 2018 at 14:05, Huge huge@huge.org.uk wrote:
How dare you, Sir!
(You mean it's no longer fashionable.)
I do mean that, but in a world of shared libraries that also means that most of the exciting new stuff tends to appear in Python* or something else that isn't Perl, which is what led me to the legacy adjective.
No offence :-)
[ * This is why I have devoted some time to learn Python. Aspects of it I hate, but getting stuck with what I was used to wasn't doing me any favours either. ]
Actually I'm sure that any language will handle this and since the bottleneck is filesystem maybe I'm overthinking it.
I'd say so.
Reference my last post considering concatenating files that I know share the same timestamp and only processing ones that straddle boundaries, I wrote a basic bash script to test this theory and the time-per-hour dropped from ~40s to ~0.5s. The downside is that the resulting files are larger, presumably due to repeating dictionaries and the loss the opportunity to compress similarities between source files (an md5 verification on the extracted files confirms they have the same content, but the file size jumps from (eg) 11.2MB to 13.1MB). It's a shame to lose that disk space permanently to save a bit of time now, but that's one hell of a time trade off to make so I think I'm taking it. But it does show that this isn't disk bound (since the disk activity will be largely the same, I think?). It looks instead like I'm CPU bound due to the [de]compression.
It's been an interesting (to me anyway) programming challenge anyway.I'm happy to share the bash script if anyone is interested but without any documentation it's a bit meaningless and quite bespoke wrt my files.