I have a backup of a GMail account which creates folders by year, month, date and within them .eml files for each email.
I'm migrating to a different GMail account so I'm moving the emails to the new account, and I have a backup also of the second account.
So it's something like: Account1/2022/5/1/aaaa.eml and Account2/2022/5/1/bbbb.eml .. with each being identical emails (a diff of the two files will confirm that).
Of course I don't just have 1 email. I have about 200,000, and the number of files in the two backups is slightly different so I want to find the discrepancies.
So: Any suggestions? I could run a standard dedup program but that'll be very slow given that it'll be comparing 200k files with another 200k files. In most cases I suspect the missing file(s) will stand out just from their file sizes on a per-directory comparison.
On 05/05/2022 09:39, Mark Rogers wrote:
I have a backup of a GMail account which creates folders by year, month, date and within them .eml files for each email.
I'm migrating to a different GMail account so I'm moving the emails to the new account, and I have a backup also of the second account.
So it's something like: Account1/2022/5/1/aaaa.eml and Account2/2022/5/1/bbbb.eml .. with each being identical emails (a diff of the two files will confirm that).
Of course I don't just have 1 email. I have about 200,000, and the number of files in the two backups is slightly different so I want to find the discrepancies.
So: Any suggestions? I could run a standard dedup program but that'll be very slow given that it'll be comparing 200k files with another 200k files. In most cases I suspect the missing file(s) will stand out just from their file sizes on a per-directory comparison.
Many thoughts. The first is, just run a dedup and wait. Alternatively...
You say emails are in the format
Account1/2022/5/1/aaaa.eml
but you also say there are 200k emails. I'm guessing that there are 200k emails in total, but way less than that in a directory. If that's the case, write a script and call a dedup program on each directory.
But that's probably take a while again.
[NB, I occasionally use fslint to de-dup]
Back everything up first!
Write a program:
so you say that you have different directories, one email per file. I'm presuming that file Account1/2022/5/1/aaaa.eml could be the same email as Account2/2022/5/1/bbbb.eml but with a different name.
What information can you extract about the email? Is the 2022/5/1 the date of the email, or the date of the backup? Anyway, my suggestion is this: Presuming the filename is just a sequential number and of no consequence, rename each file. Use whatever info about the email that you can discern. If the 2022/5/1 is the email date, use that. Find the email size. Perhaps do a hash of the email contents (e.g. md5?). Search the email's contents: Find the first date" and time in the email. Find the message ID. Rename the file using all of that (but make a note of the original filename).
So, your files will be renamed something like: YYYYMMDDHHMM-MESSAGEID-FILESIZE_IN_BYTES-HASH
Now, if I'm right, identical messages in the Account1 and Account2 directory should be named identically.
Now you can copy all the Account1 emails into 1 directory, and all the Account2 emails into a 2nd directory. Then run a simple directory file comparison on the whole lot. You should easily see the missing files.
Meld would my go-to tool to do that. Once you've found the differences, then you can work back to the original filename to work out what's what.
However, I suspect that the original filenames are irrelevant. If that's the case, you could skip the comparison step. Copy all the Account1 emails into a directory. Then copy all the Account2 emails into the same directory. If duplicated files occur (which they should if my assumptions are correct), then skip or overwrite (don't rename). Once all the files are in the directory, import it into the email system.
I'm assuming by the way that the .eml files are in a plain format. The first few lines of your message (the one I'm replying to) saved as a .eml file are
Return-path: main-bounces@lists.alug.org.uk Envelope-to: steve-alug@hst.me.uk Delivery-date: Thu, 05 May 2022 09:40:25 +0100 Received: from the.earth.li ([93.93.131.124]) by hst.me.uk with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.93) (envelope-from main-bounces@lists.alug.org.uk) id 1nmX28-0009Ou-24 for steve-alug@hst.me.uk; Thu, 05 May 2022 09:40:25 +0100
Use a search tool (sed, awk?) to search for and return the text between the first date and the end of line - use this as the date. e.g. https://unix.stackexchange.com/questions/188782/how-to-extract-text-using-se...
Use a search tool to search for and return the text between the first id and the end of line - use this as the id.
Hope that helps.
Steve
Apologies for the slow reply, I've been mostly reading emails on my mobile and it can't do a plain-text reply.
On Thu, 5 May 2022 at 19:34, steve-ALUG@hst.me.uk wrote:
but you also say there are 200k emails. I'm guessing that there are 200k emails in total, but way less than that in a directory.
Correct
[NB, I occasionally use fslint to de-dup]
Ditto, both to fslint and "occasionally" - ie there are probably loads of things it does I don't know about!
Write a program:
That's where I thought I might end up!
so you say that you have different directories, one email per file.
Yes
I'm presuming that file Account1/2022/5/1/aaaa.eml could be the same email as Account2/2022/5/1/bbbb.eml but with a different name.
Yes.
In theory contents and file size will match, although I have found that there are some discrepancies due to things like line endings - I'm guessing the restore has fixed some poor formatting issues in the backups.
What information can you extract about the email? Is the 2022/5/1 the date of the email, or the date of the backup?
Date of the email. That and file size are all I have directly but I can obviously grep the files for anything else I might want (subject, sender, etc).
<<Big snip>>
Where I ended up was to use: find dir1/ dir2/ -type f -name '*.eml' -printf '%h %s %f\n' .. which gives me all the files with file sizes, formatted as: dir1/2019/3/30 112874 169d0343152c5ad6.eml
I then wrote a hacky PHP script to sort and filter this output removing any pairs of files with the same filesize from any date-named directory. It then listed the remaining files along with the Subject of the emails grep'd from the file, for manual processing.
Incidentally I can recommend GYB[1] for archiving GMail mailboxes, and for restoring said backups later. It's a bit fiddly and often inflexible but it's done what I need to migrate emails from legacy free workspace accounts to free ones.
Hope that helps.
It did, thank you.
[1] https://github.com/GAM-team/got-your-back
On Mon, 9 May 2022 09:05:03 +0100 Mark Rogers mark@more-solutions.co.uk allegedly wrote:
Apologies for the slow reply, I've been mostly reading emails on my mobile and it can't do a plain-text reply.
Mark
Try k-9 mail, that /can/ do plain text.
Mick
--------------------------------------------------------------------- Mick Morgan gpg fingerprint: FC23 3338 F664 5E66 876B 72C0 0A1F E60B 5BAD D312 https://baldric.net/about-trivia ---------------------------------------------------------------------
On Mon, 16 May 2022 at 17:35, mick mbm@rlogin.net wrote:
Try k-9 mail, that /can/ do plain text.
I used to like K-9 years ago, but haven't used it for ages. I did go to install it a few months ago but the reviews put me off - all of which related to its new user interface.
I did install FairEmail instead which can also handle plain text, but I don't get on with it. It's too restrictive for my taste out of the box, and although everything is configurable, I haven't found the time to sit and go through all the settings to make it functional for me.
Are you currently a K-9 user, and if so how do you get on with its new interface?
On Sun, 22 May 2022 at 09:19, Mark Rogers mark@more-solutions.co.uk wrote:
I used to like K-9 years ago, but haven't used it for ages. I did go to install it a few months ago but the reviews put me off - all of which related to its new user interface.
I just installed K-9 to play with but its lack of XOAUTH2 support combined with me having about 15 Google email accounts to monitor meant I uninstalled it!
(I know it can be worked around by enabling application passwords but that shouldn't really be necessary these days.)
On 22 May 2022 09:19:27 BST, Mark Rogers mark@more-solutions.co.uk wrote:
On Mon, 16 May 2022 at 17:35, mick mbm@rlogin.net wrote:
Try k-9 mail, that /can/ do plain text.
I used to like K-9 years ago, but haven't used it for ages. I did go to install it a few months ago but the reviews put me off - all of which related to its new user interface.
I did install FairEmail instead which can also handle plain text, but I don't get on with it. It's too restrictive for my taste out of the box, and although everything is configurable, I haven't found the time to sit and go through all the settings to make it functional for me.
Are you currently a K-9 user, and if so how do you get on with its new interface?
Yes (he said, replying via K9). And I find the interface fine.
Mick