Line endings and encoding for tag files #2

gwiedeman · 2021-06-29T17:38:02Z

Bagit supports multiple encodings in tag files, and just specifies the encoding in the bagit.txt Tag-File-Character-Encoding field. Bagit-python only supports UTF-8, so it seems relatively appropriate to mandate UTF-8 in a mailbag.

Bagit is unhelpfully agnostic about line endings. It supports both LF and CRFL and does not contain a standard place to document which is used. It seems very problematic to try and detect line endings. Bagit-python only supports LF line endings, so any bag created with Bagit-python will have LF line endings. We discussed using a custom field in bag-info.txt for this, but decided that since Bagit-python mandates LF, we can require it in the specification.

The main problem is that CSV files most commonly use CRLF line endings as required by RFC4180.

Thus, currently mailbag mandates UTF-8 for all tag files, but requires CRLF line endings for mailbag.csv, and LF line endings for all other tag files. ¯\(ツ)/¯

nkrabben · 2021-07-21T20:27:12Z

Would it be useful to require LF only for those tag file defined in the RFC. This would reduce requirements on any additional tag files that might be added beyond the mailbag spec.

Since you mention the UTF-8 requirement, will there be a specification about whether or not the UTF-8 bytemark will be required? The Bagit spec is agnostic outside of bag-info.txt. https://datatracker.ietf.org/doc/html/rfc8493#section-2.3

gwiedeman · 2022-03-29T16:24:43Z

Thanks for your comment and sorry for taking so long to address this. I think only requiring LF for the defined tag files is a good idea, and we'll make that change before a release.

Having experienced fun encoding issues, I also love the idea of requiring more encoding information/portability generally, but I think we have to follow bagit and bagit-python. Looking into it briefly, it seems like bagit-python writes tag files with encoding='utf-8' does not include the byte mark? My non-expert instinct is to be agnostic like bagit as there doesn't appear to be a consensus on whether it should be included for utf-8. Definitely open to more expert opinions though.

gwiedeman · 2022-04-15T15:41:57Z

Cross-linking comment https://docs.google.com/document/d/1BZHklc6MKktXJBPcvFvlxLRoX8lCidFemflppqpUQ7s/edit?disco=AAAAW38m-kI

nkrabben · 2022-04-15T15:50:03Z

My preference is to discourage the byte-order mark, but that comes from a naive coding point-of-view where I generally have to invoke extra arguments to handle the BOM. If there are good reasons to allow the BOM (and compatibility with tools that include a BOM by default seems like a good one), agnosticism seems good. I'd love to hear an argument for requiring the BOM, mostly so I can better understand why it's useful.

gwiedeman · 2022-06-16T20:21:20Z

This has been changed in the draft 1.0 release to required LF for tag files defined by bagit, CRLF for mailbag.csv. Considered recommending LF for other tag files, but that wouldn't make sense for other CSV files for example so its left agnostic.

Looking into the BOM more, I agree that it would be better if there were no BOMs in UTF-8 tag files. Since we're requiring utf-8, it should be fair to SHOULD NOT a BOM.

gwiedeman · 2022-06-16T20:22:58Z

Also per the Google doc comment, confirmed that LF read fine now in Windows, so I don't think its an issue requiring LF.

gwiedeman added the Specification label Aug 16, 2021

gwiedeman transferred this issue from UAlbanyArchives/mailbagit Jun 15, 2022

gwiedeman closed this as completed Jun 16, 2022

gwiedeman mentioned this issue Jun 16, 2022

1.0 release #9

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Line endings and encoding for tag files #2

Line endings and encoding for tag files #2

gwiedeman commented Jun 29, 2021 •

edited

nkrabben commented Jul 21, 2021

gwiedeman commented Mar 29, 2022

gwiedeman commented Apr 15, 2022

nkrabben commented Apr 15, 2022

gwiedeman commented Jun 16, 2022

gwiedeman commented Jun 16, 2022

Line endings and encoding for tag files #2

Line endings and encoding for tag files #2

Comments

gwiedeman commented Jun 29, 2021 • edited

nkrabben commented Jul 21, 2021

gwiedeman commented Mar 29, 2022

gwiedeman commented Apr 15, 2022

nkrabben commented Apr 15, 2022

gwiedeman commented Jun 16, 2022

gwiedeman commented Jun 16, 2022

gwiedeman commented Jun 29, 2021 •

edited