Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Line endings and encoding for tag files #2

Closed
gwiedeman opened this issue Jun 29, 2021 · 6 comments
Closed

Line endings and encoding for tag files #2

gwiedeman opened this issue Jun 29, 2021 · 6 comments

Comments

@gwiedeman
Copy link
Contributor

gwiedeman commented Jun 29, 2021

Bagit supports multiple encodings in tag files, and just specifies the encoding in the bagit.txt Tag-File-Character-Encoding field. Bagit-python only supports UTF-8, so it seems relatively appropriate to mandate UTF-8 in a mailbag.

Bagit is unhelpfully agnostic about line endings. It supports both LF and CRFL and does not contain a standard place to document which is used. It seems very problematic to try and detect line endings. Bagit-python only supports LF line endings, so any bag created with Bagit-python will have LF line endings. We discussed using a custom field in bag-info.txt for this, but decided that since Bagit-python mandates LF, we can require it in the specification.

The main problem is that CSV files most commonly use CRLF line endings as required by RFC4180.

Thus, currently mailbag mandates UTF-8 for all tag files, but requires CRLF line endings for mailbag.csv, and LF line endings for all other tag files. ¯\(ツ)

@nkrabben
Copy link

Would it be useful to require LF only for those tag file defined in the RFC. This would reduce requirements on any additional tag files that might be added beyond the mailbag spec.

Since you mention the UTF-8 requirement, will there be a specification about whether or not the UTF-8 bytemark will be required? The Bagit spec is agnostic outside of bag-info.txt. https://datatracker.ietf.org/doc/html/rfc8493#section-2.3

@gwiedeman
Copy link
Contributor Author

Thanks for your comment and sorry for taking so long to address this. I think only requiring LF for the defined tag files is a good idea, and we'll make that change before a release.

Having experienced fun encoding issues, I also love the idea of requiring more encoding information/portability generally, but I think we have to follow bagit and bagit-python. Looking into it briefly, it seems like bagit-python writes tag files with encoding='utf-8' does not include the byte mark? My non-expert instinct is to be agnostic like bagit as there doesn't appear to be a consensus on whether it should be included for utf-8. Definitely open to more expert opinions though.

@gwiedeman
Copy link
Contributor Author

@nkrabben
Copy link

My preference is to discourage the byte-order mark, but that comes from a naive coding point-of-view where I generally have to invoke extra arguments to handle the BOM. If there are good reasons to allow the BOM (and compatibility with tools that include a BOM by default seems like a good one), agnosticism seems good. I'd love to hear an argument for requiring the BOM, mostly so I can better understand why it's useful.

@gwiedeman gwiedeman transferred this issue from UAlbanyArchives/mailbagit Jun 15, 2022
@gwiedeman
Copy link
Contributor Author

This has been changed in the draft 1.0 release to required LF for tag files defined by bagit, CRLF for mailbag.csv. Considered recommending LF for other tag files, but that wouldn't make sense for other CSV files for example so its left agnostic.

Looking into the BOM more, I agree that it would be better if there were no BOMs in UTF-8 tag files. Since we're requiring utf-8, it should be fair to SHOULD NOT a BOM.

@gwiedeman
Copy link
Contributor Author

Also per the Google doc comment, confirmed that LF read fine now in Windows, so I don't think its an issue requiring LF.

@gwiedeman gwiedeman mentioned this issue Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants