Use of mailbag.csv #5

gwiedeman · 2021-06-29T16:48:10Z

The choice of using a CSV tag file to serialize message-level information was also questioned during the working meeting. CSVs can create the potential for error since they can be written using a variety of different delimiters and dialects. Large numbers of rows may also create issues, as different tools have limits, often around 1 million rows. We had some useful discussions about using JSON or another serialization that did not have these issues, but concluded that CSVs were more useful for the Nicholas Garza, Teresa Burns, and Gary Richardson personas, since they are likely to be more comfortable opening and reading a CSV file using spreadsheet software than a JSON file. A suggestion from the Working Meeting was to break up the CSV into multiple files after a certain number of rows, much like WARC files, so we decided to split the file after 100,000 rows.

We also discussed how the specification’s requirement of a separate mailbag.csv tag file is one of the few major costs in meeting the specification over a generic Bagit bag. In reconsidering this, we realized that the reason this CSV file was required was that it pointed to where messages were within the payload directory and also acted as a lookup between the Message-ID and filename-safe Mailbag-Message-ID fields. We had originally required message header information in the mailbag.csv as well but we’ve decided that this should be optional. Feedback from the working meeting also suggested including a column for attachments, so we added an integer field for the number of attachments.

jamiepb · 2021-07-07T20:05:33Z

Is there a way in the mailbag.csv file or elsewhere to indicate a one-to-many relationship among derivatives, for example if there is a single or a few pst files that are converted into eml?

gwiedeman · 2021-07-09T21:01:33Z

Thank you for your comment. Currently no, and the challenge of documenting this type of relationship is one of the main reasons the Advisory board was hesitant about including multiple email accounts per #8. Though multiple PSTs would not necessarily mean multiple accounts so we definitely need to discuss this more. I could see multiple exports from the same account over time being a common use case.

jamiepb · 2021-07-12T11:32:53Z

Currently Office 365's email export tool cuts pst files off around 10GB and while that's liable to change over time, in my experience our recent email account exports have been 1-3 pst files and will continue to grow. Allowing for email accounts that comprise multiple psts will make the specification more widely applicable and scalable, whether it's in mailbag.csv or the subfolder structure or up to the user to document elsewhere.

gwiedeman · 2022-03-29T16:30:42Z

As of version 0.3, the specification now supports multiple email accounts, including multiple PST files. mailbag.csv still has a line for each message, but users should be able to make the connections between all derivatives and source files using the Original-File and Derivatives-Path fields.

Thanks for your feedback! I'll close this, but feel free to reopen if the changes don't address your use case.

gwiedeman added the Specification label Aug 16, 2021

gwiedeman closed this as completed Mar 29, 2022

gwiedeman transferred this issue from UAlbanyArchives/mailbagit Jun 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of mailbag.csv #5

Use of mailbag.csv #5

gwiedeman commented Jun 29, 2021

jamiepb commented Jul 7, 2021

gwiedeman commented Jul 9, 2021

jamiepb commented Jul 12, 2021

gwiedeman commented Mar 29, 2022

Use of mailbag.csv #5

Use of mailbag.csv #5

Comments

gwiedeman commented Jun 29, 2021

jamiepb commented Jul 7, 2021

gwiedeman commented Jul 9, 2021

jamiepb commented Jul 12, 2021

gwiedeman commented Mar 29, 2022