Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of mailbag.csv #5

Closed
gwiedeman opened this issue Jun 29, 2021 · 4 comments
Closed

Use of mailbag.csv #5

gwiedeman opened this issue Jun 29, 2021 · 4 comments

Comments

@gwiedeman
Copy link
Contributor

The choice of using a CSV tag file to serialize message-level information was also questioned during the working meeting. CSVs can create the potential for error since they can be written using a variety of different delimiters and dialects. Large numbers of rows may also create issues, as different tools have limits, often around 1 million rows. We had some useful discussions about using JSON or another serialization that did not have these issues, but concluded that CSVs were more useful for the Nicholas Garza, Teresa Burns, and Gary Richardson personas, since they are likely to be more comfortable opening and reading a CSV file using spreadsheet software than a JSON file. A suggestion from the Working Meeting was to break up the CSV into multiple files after a certain number of rows, much like WARC files, so we decided to split the file after 100,000 rows.

We also discussed how the specification’s requirement of a separate mailbag.csv tag file is one of the few major costs in meeting the specification over a generic Bagit bag. In reconsidering this, we realized that the reason this CSV file was required was that it pointed to where messages were within the payload directory and also acted as a lookup between the Message-ID and filename-safe Mailbag-Message-ID fields. We had originally required message header information in the mailbag.csv as well but we’ve decided that this should be optional. Feedback from the working meeting also suggested including a column for attachments, so we added an integer field for the number of attachments.

@jamiepb
Copy link

jamiepb commented Jul 7, 2021

Is there a way in the mailbag.csv file or elsewhere to indicate a one-to-many relationship among derivatives, for example if there is a single or a few pst files that are converted into eml?

@gwiedeman
Copy link
Contributor Author

Thank you for your comment. Currently no, and the challenge of documenting this type of relationship is one of the main reasons the Advisory board was hesitant about including multiple email accounts per #8. Though multiple PSTs would not necessarily mean multiple accounts so we definitely need to discuss this more. I could see multiple exports from the same account over time being a common use case.

@jamiepb
Copy link

jamiepb commented Jul 12, 2021

Currently Office 365's email export tool cuts pst files off around 10GB and while that's liable to change over time, in my experience our recent email account exports have been 1-3 pst files and will continue to grow. Allowing for email accounts that comprise multiple psts will make the specification more widely applicable and scalable, whether it's in mailbag.csv or the subfolder structure or up to the user to document elsewhere.

@gwiedeman
Copy link
Contributor Author

As of version 0.3, the specification now supports multiple email accounts, including multiple PST files. mailbag.csv still has a line for each message, but users should be able to make the connections between all derivatives and source files using the Original-File and Derivatives-Path fields.

Thanks for your feedback! I'll close this, but feel free to reopen if the changes don't address your use case.

@gwiedeman gwiedeman transferred this issue from UAlbanyArchives/mailbagit Jun 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants