ASV in Bonsai #19
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
joke
question
wontfix
No Milestone
No project
No Assignees
4 Participants
Notifications
Due Date
No due date set.
Blocks
#74 `ls(1p)` analogue
bonsai/harakit
Reference: bonsai/harakit#19
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally posted by @trinity in /bonsai/coreutils/pulls/18#issuecomment-2722
Currently the idea is that the coreutils will speak ASV natively. Should that be represented differently in the output of commands if stdout is a terminal?
I don't think the representation should be changed just because of where stdout is going. This unfortunately commonplace behavior already causes a frustrating level of unpredictability with certain tools. ripgrep, for as much as I sing its praises, is one of the worst examples that comes to mind, significantly changing output formatting depending on whether stdout is a TTY or not.
I believe it's helpful in this case to think about what the user would expect to happen in a reasonably-written program, and what existing utilities are already doing as a result. Most people aren't accustomed to working with ASV delimiters, but there is a usually-unprintable delimiter that comes up pretty frequently: null! While a tool such as
npc
(or the much-malignedcat -v
) may render a null byte as something like^@
, under normal circumstances, null bytes are simply dumped into the terminal as-is. A lot of programs already support using it as a delimiter in normal output specifically because of the limitations of newline or comma delimitation, and I've yet to see any calls to special-case their output formatting for TTYs.Another concern that has been brought up a number of times is that some fonts do not print the ASCII field separator character; however, I see this as an issue with the fonts and not an issue for our utilities to solve. Bending over backward to solve issues caused by other software on users’ systems isn’t, in my view, what we should be doing.
I had a number of qualms about ASCII separated values I've answered for myself.
Having to change swathes of a system to accomodate Bonsai will make potential users hesitant. My Linux framebuffer doesn't display the field separator, I'm pretty sure neither does rxvt or unscii or the combination of the two. I can cope because npc(1) is in the tree and ASCII
FS
,GS
,RS
, andUS
display^\
,^]
,^^
, and^_
respectively when piped through it.My concern was that it may be better for the sake of beginners to have visually comprehensible output if outputting to a tty and a tab is a visual, intuitive separator. However I think even if an ASCII separator isn't displayed in any form, that nearly-correct output would lead an informed beginner to read the utility's man page and learn about ASCII separators and why we use them.
I agree now that this would cause more confusion than it's worth. I initially proposed this as a compromise with Emma; fae wanted ASV by default and I wanted TSV by default with an ASV option.
Can you name a certain tool? All the tools I know have
-0
as a special case to support this, I don't think I know any that use nul as a delimiter as a default.As an aside, nul as a delimiter is interesting because the only time nul is really used is in binary data; it's disallowed in filenames and practically never used in text data (as it's the string terminator in the C standard library). ASCII separators are also practically never used in text data but there's no reason they couldn't be, they can be used in C standard library strings and filenames. Will this be flatly disallowed, or how will they be quoted?
I'd like to clarify that I agree with the use of ASV, as an unconditional default, for program output in Bonsai. Emma and I discussed this tonight and I came to that conclusion based on what I've mentioned in this comment.
i’m going to close this but feel free to continue discussing
Damn, I replied to this last night via email but it looks like that never got posted. Pasting this from my sent folder and hoping the formatting isn't broken:
It seems to be hardcoded in most terminals that control characters (
0x00
to0x1F
) should not even be allocated a cell, denying fonts, and users, the ability to render glyphs in their place. This behavior is consistent across at least foot, alacritty, Konsole, u?rxvt, and the Linux VT. I believe this to be a deficiency in the design of terminals—one that has only been allowed to persist for so long due to the neglect shown towards the information separators. I'm not naïve enough to even gesture at this project changing that, but I am spiteful enough to continue pushing for ASV in the face of it, and I have been considering looking into what'd be required to patch foot to allow the printing of certain control codes. I do acknowledge this as a problem, but not enough of one to change my stance.I'd like to see a program that takes ASV output and formats it in a nice little TSV table. Perhaps this would be better suited as a feature of betta, I don't know.
Minor miscommunication, what I meant was that there are tools that do support switches like
-0
and will happily dump nulls into a TTY without any special reformatting.Yeah, this is an issue. Even some kernel people have expressed a desire to disallow control characters in filenames as there's genuinely never a legitimate reason for them to be there in the first place, but this is unfortunately the userspace we're stuck with now. Insert vague grumbling about Torvalds. While I don't agree with everything being said in it, https://dwheeler.com/essays/fixing-unix-linux-filenames.html goes into just how horribly borked most software already is when it comes to handling filenames with control characters and escape sequences. This is an already-existing problem, and there's no great solution to it apart from letting users figure out that doing stupid things to their filenames is, in fact, stupid.
@silt the worst part of Gitea and the reason I want to make Mintee is that Gitea doesn’t do e-mail thread replies.
My kitty install prints the record separator character.
foot's codebase is delightfully readable and this ended up being quite easy to do. I've instructed it to, upon encountering an information separator, print the appropriate glyph from the Control Pictures block. Haven't noticed any serious issues and I don't particularly expect to. The fact that they are entirely indistinguishable from the actual Unicode glyphs used to indicate their presence is... not great in my opinion. However, at least for my usage, I think this is better than simply ignoring them. One thing to note is that copy-pasting this from the terminal window will not copy the control codes, but rather the Unicode glyphs. I don't love this behavior, but it is again Good Enough For Me.
Here's the patch for completeness's sake:
I think we ought to reopen this issue as an ongoing discussion regarding ASV handling in Bonsai.
agreed
I was planning on creating a new issue for this, but I guess I'll put it here if this is turning into the ASV discussion thread.
While we have agreed to "use ASV", I have yet to see any discussion of what that practically means. How will ASV be used? USAS X3.4-1968 was already very loose in its description of proper IS usage (see page 10), only asking that their hierarchical relationship be preserved. A look at a more modern version of the ASCII spec shows only further loosening in this regard, with INCITS 4-1986[R2007] dropping the hierarchical requirement. While the proper usage is defined in several places within the document, "4.1.5 Information Separators" most completely explains the current situation. While I was unable to find a more recent revision of the standard, I think it's safe to assume that things haven't gotten any stricter since then. All this is to say that there is no standard for us to follow here, and the precise ways in which Bonsai utilities should/do/will interact with ASV (and IS characters in general) needs to be formally specified.
@trinity has some good input on how to do this
we can write a man(7) page on our asv usage
My rough interpretation of ASV dating back some years (this is why I wrote ascii.h) was the following:
ASCII_US
is the unit separator. This is for cells on a spreadsheet.ASCII_RS
is the record separator. This is for rows on a spreadsheet.ASCII_GS
is the group separator. This is for sheets in a spreadsheet document.ASCII_FS
is the file separator. This is for terminating files, and means proper ASV files can be cat(1p)ed together without loss of content.Coded Character Sets, History and Development may have historical information but I'm looking into some other stuff right now.
ASV terminal output/representationto ASV in BonsaiI've been reading the terrific Coded Character Sets, History and Development which was written by someone who was involved with the creation of ASCII itself. It seems the earliest use of encoded field separators was the Hollerith Card Code (which predates the identically-named Hollerith Card Code used for representing ASCII on punched cards(? - I skimmed the article only long enough to realize it wasn't what I wanted)), named for Herman Hollerith, who designed the code for his tabulation machine, itself designed for the 1890 census and wrote the paper An Electric Tabulating System in 1889 about this work. The company Hollerith founded eventually became IBM. The information I can find on the Hollerith encoding is scarce and mainly comes from the mentioned book Coded Character Sets.
I haven't thoroughly enough looked into following text encodings (or the preceding ones but Hollerith's may be the first).
Relevant: https://github.com/SixArm/usv
They seem to believe they have monopolized Unicode Separated Values:
Alrighty... it would be more egregious if they weren't giving the trademarks to the public domain eventually^(TM).
The thing that really grinds my gears is that they aren't using the Unicode control characters but the runes representing their graphical representation. Yuck! They did find some issues with using the control characters which compelled them to switch to the displayed runes but they're working on a data format while we're working on an ecosystem so we can actually solve these issues (for ourselves). And most of these caveats we've already run into.
They have a history of ASV document which pilfers Wikipedia and this blog post: https://www.lammertbies.nl/comm/info/ascii-characters which is probably interesting but I haven't dug into it yet.
I find their use of the graphical representation for ESC to be interesting... Maybe we should use ASCII ESC to escape literal field separators?
More discussion involving the proposed USV: https://news.ycombinator.com/item?id=39679378.
I haven't had time to read this but I just found an annotated history of some character codes or ASCII: American Standard Code for Information Infiltration linked from an interesting Hacker News comment that has more references.
FWIW, that's what I did here: https://git.tebibyte.media/sashakoshka/go-asv
I think it's a good idea as it allows arbitrary bytes to be encoded.