[ALUG] Re: Comparing large folders of mostly identical file contents with different filenames

5 May 2022


      On 05/05/2022 09:39, Mark Rogers wrote:
...
I have a backup of a GMail account which creates folders by year,
month, date and within them .eml files for each email.
I'm migrating to a different GMail account so I'm moving the emails to
the new account, and I have a backup also of the second account.
So it's something like:
    Account1/2022/5/1/aaaa.eml
and
    Account2/2022/5/1/bbbb.eml
.. with each being identical emails (a diff of the two files will confirm that).
Of course I don't just have 1 email. I have about 200,000, and the
number of files in the two backups is slightly different so I want to
find the discrepancies.
So: Any suggestions? I could run a standard dedup program but that'll
be very slow given that it'll be comparing 200k files with another
200k files. In most cases I suspect the missing file(s) will stand out
just from their file sizes on a per-directory comparison.
Many thoughts.  The first is, just run a dedup and wait. Alternatively...
You say emails are in the format
Account1/2022/5/1/aaaa.eml
but you also say there are 200k emails.  I'm guessing that there are 200k emails in total, but way less than that in a directory.  If that's the case, write a script and call a dedup program on each directory.
But that's probably take a while again.
[NB, I occasionally use fslint to de-dup]
Back everything up first!
Write a program:
so you say that you have different directories, one email per file.  I'm presuming that file Account1/2022/5/1/aaaa.eml could be the same email as Account2/2022/5/1/bbbb.eml but with a different name.
What information can you extract about the email?  Is the 2022/5/1 the date of the email, or the date of the backup?  Anyway, my suggestion is this:  Presuming the filename is just a sequential number and of no consequence, rename each file.  Use whatever info about the email that you can discern.
If the 2022/5/1 is the email date, use that.  Find the email size.  Perhaps do a hash of the email contents (e.g. md5?).  Search the email's contents:  Find the first date" and time in the email.  Find the message ID.  Rename the file using all of that (but make a note of the original filename).
So, your files will be renamed something like:
YYYYMMDDHHMM-MESSAGEID-FILESIZE_IN_BYTES-HASH
Now, if I'm right, identical messages in the Account1 and Account2 directory should be named identically.
Now you can copy all the Account1 emails into 1 directory, and all the Account2 emails into a 2nd directory.   Then run a simple directory file comparison on the whole lot.  You should easily see the missing files.
Meld would my go-to tool to do that.  Once you've found the differences, then you can work back to the original filename to work out what's what.
However, I suspect that the original filenames are irrelevant.  If that's the case, you could skip the comparison step.  Copy all the Account1 emails into a directory.  Then copy all the Account2 emails into the same directory.  If duplicated files occur (which they should if my assumptions are correct), then skip or overwrite (don't rename).
Once all the files are in the directory, import it into the email system.
I'm assuming by the way that the .eml files are in a plain format.  The first few lines of your message (the one I'm replying to) saved as a .eml file are
Return-path: main-bounces@lists.alug.org.uk
Envelope-to: steve-alug@hst.me.uk
Delivery-date: Thu, 05 May 2022 09:40:25 +0100
Received: from the.earth.li ([93.93.131.124])
    by hst.me.uk with esmtps  (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
    (Exim 4.93)
    (envelope-from main-bounces@lists.alug.org.uk)
    id 1nmX28-0009Ou-24
    for steve-alug@hst.me.uk; Thu, 05 May 2022 09:40:25 +0100
Use a search tool (sed, awk?)  to search for and return the text between the first date and the end of line - use this as the date.  e.g. https://unix.stackexchange.com/questions/188782/how-to-extract-text-using-se...
Use a search tool to search for and return the text between the first id and the end of line - use this as the id.
Hope that helps.
Steve

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

1997

[ALUG] Re: Comparing large folders of mostly identical file contents with different filenames