Apologies for the slow reply, I've been mostly reading emails on my mobile and it can't do a plain-text reply.
On Thu, 5 May 2022 at 19:34, steve-ALUG@hst.me.uk wrote:
but you also say there are 200k emails. I'm guessing that there are 200k emails in total, but way less than that in a directory.
Correct
[NB, I occasionally use fslint to de-dup]
Ditto, both to fslint and "occasionally" - ie there are probably loads of things it does I don't know about!
Write a program:
That's where I thought I might end up!
so you say that you have different directories, one email per file.
Yes
I'm presuming that file Account1/2022/5/1/aaaa.eml could be the same email as Account2/2022/5/1/bbbb.eml but with a different name.
Yes.
In theory contents and file size will match, although I have found that there are some discrepancies due to things like line endings - I'm guessing the restore has fixed some poor formatting issues in the backups.
What information can you extract about the email? Is the 2022/5/1 the date of the email, or the date of the backup?
Date of the email. That and file size are all I have directly but I can obviously grep the files for anything else I might want (subject, sender, etc).
<<Big snip>>
Where I ended up was to use: find dir1/ dir2/ -type f -name '*.eml' -printf '%h %s %f\n' .. which gives me all the files with file sizes, formatted as: dir1/2019/3/30 112874 169d0343152c5ad6.eml
I then wrote a hacky PHP script to sort and filter this output removing any pairs of files with the same filesize from any date-named directory. It then listed the remaining files along with the Subject of the emails grep'd from the file, for manual processing.
Incidentally I can recommend GYB[1] for archiving GMail mailboxes, and for restoring said backups later. It's a bit fiddly and often inflexible but it's done what I need to migrate emails from legacy free workspace accounts to free ones.
Hope that helps.
It did, thank you.
[1] https://github.com/GAM-team/got-your-back