Checking archives for missing messages

FreshPorts is pretty good about not missing commits. It depends exclusively upon the cvs-all mailing list. If the message doesn’t arrive, it doesn’t get into FreshPorts.

In the interests of increasing the complexity of FreshPorts, I’d like to validate the database against the cvs-all archives. Each commit generates one email message. Each email has a unique message-id. This id is stored in the FreshPorts database.

There are a few ways this can be done. I think it is best to first get a list of commit emails that are not in the database, then decide how best to proceed.

I can easily create file that contains a list of all the message-ids contained within FreshPorts. It is a 13MB file. I’d like you to create me a script that will isolate the missing messages. Here is roughly what it should do:

ignore replies
look for message-ids not in a filename supplied on the command line
place the missing emails in a specified directory

There is a wonderful program, formail, that comes with procmail. It can be used to process each email in an archive. It will pass the email to a program for you. You don’t have to parse the whole archive. Just one email at a time.

Use Perl. Or Python. Your choice. Shell scripts are fine too, so long as you use just /bin/sh, not bash or something else. It’s easy. Something like this:

Create this file

$ less file.sh
#!/bin/sh

cat > tmp/cvs-all.${FILENO}.txt.raw

Then run this:

cat cvs-ports-archive | formail -s ./file.sh

That will split each email in the cvs-all-archive file into different files. see mail formail for an explanation of FILENO.

What I’d like is a script I can invoke like this:

# your-program /path/to/cvs-all-archive /path-to/file/of/messages-ids /path/to/missing-emails

Where:

/path/to/cvs-all-archive – filename of a cvs-all archive
/path-to/file/of/messages-ids – filename containing list of message-ids to compare against
/path/to/missing-emails – directory into which messages [from the cvs-all archive which have message-ids which do not appear in the list] are places, one per file

Does that make sense? Feel free to make suggestions for how the program should work, and what it should do.

Thank you.

Related Posts

Leave a Comment Cancel Reply