Number parsing routine #94

Open
opened 2024-04-22 13:40:44 +00:00 by trinity · 3 comments
Owner

Rust has good number parsing with the parse method, meanwhile C's stuck in 1989. atoi(3) sucks. It returns 0 if the value is 0 or if there's an error. strtol(3) is the grown-up version of the command but is a real pain to use (set errno to 0, check after and also check the end pointer). I really want a good number parser to use in C programs.

Things to consider:

  • Where will the library go in the tree?
  • Should there be a libbonsai or should we have specific libraries for specific sets of functionality?
  • Which language should we use for this? How will FFI stuff work?
  • How will Unicode work? Should we accept all types of numerals or just ASCII 0-9?
Rust has good number parsing with the `parse` method, meanwhile C's stuck in 1989. atoi(3) sucks. It returns 0 if the value is 0 *or* if there's an error. strtol(3) is the grown-up version of the command but is a real pain to use (set errno to 0, check after and also check the end pointer). I really want a good number parser to use in C programs. Things to consider: - Where will the library go in the tree? - Should there be a `libbonsai` or should we have specific libraries for specific sets of functionality? - Which language should we use for this? How will FFI stuff work? - How will Unicode work? Should we accept all types of numerals or just ASCII 0-9?
trinity added the
enhancement
help wanted
labels 2024-04-22 13:40:44 +00:00
Owner

Should we accept all types of numerals or just ASCII 0-9?

advocating for only accepting ascii 0-9. homoglyphs are annoying to deal with and often end up being a source of sanitization-related security issues. perhaps an additional unicode-aware version would be warranted.

related: https://util.unicode.org/UnicodeJsps/confusables.jsp?a=47&r=None

> Should we accept all types of numerals or just ASCII 0-9? advocating for only accepting ascii 0-9. homoglyphs are annoying to deal with and often end up being a source of sanitization-related security issues. perhaps an additional unicode-aware version would be warranted. related: https://util.unicode.org/UnicodeJsps/confusables.jsp?a=47&r=None
Owner

I could write C bindings to the Rust functions, that way they just accept anything Rust does.

I’m not really sure what this is all about, though. Can you explain in-depth what this issue is for?

I could write C bindings to the Rust functions, that way they just accept anything Rust does. I’m not really sure what this is all about, though. Can you explain in-depth what this issue is for?
Author
Owner

Integer parsing has to be done in our C programs in a couple of places:

  • src/dj.c:299 has an awkward integer parsing function to handle numeric option arguments.
  • src/intcmp.c:63 has more, different awkward integer parsing.

I'm starting to implement pg(1) from #44. The code is bad and the branch in which I'm implementing it is mainly serving as a playground in which I can toss shit around and see what works. pg(1) needs to parse numeric arguments to configure page lengths.

This will be my third time figuring out integer parsing with strtol(3) and I have found it awkward and difficult to read every time I've done it. strtol(3) seems incredibly overengineered for this task (base configuration? an end pointer?) but scanning a string with isdigit(3p) and using atoi(3p) seems crude and checks each byte in the input at least twice. There are probably even some subtle inconsistencies between integer parsing in dj(1) and intcmp(1) (I really hope not) - I don't wanna take a third risk at inconsistent behavior.

Rust parsing is nice and it would be very welcome in our C utilities. Rust makes this very easy.

Integer parsing has to be done in our C programs in a couple of places: - [src/dj.c:299](https://git.tebibyte.media/bonsai/coreutils/src/branch/main/src/dj.c#L299) has an awkward integer parsing function to handle numeric option arguments. - [src/intcmp.c:63](https://git.tebibyte.media/bonsai/coreutils/src/branch/main/src/intcmp.c#L63) has more, different awkward integer parsing. I'm [starting to implement pg(1)](https://git.tebibyte.media/bonsai/coreutils/src/branch/pg/src/pg.c) from #44. The code is bad and the branch in which I'm implementing it is mainly serving as a playground in which I can toss shit around and see what works. pg(1) needs to parse numeric arguments to configure page lengths. This will be my third time figuring out integer parsing with strtol(3) and I have found it awkward and difficult to read every time I've done it. strtol(3) seems incredibly overengineered for this task (base configuration? an end pointer?) but scanning a string with isdigit(3p) and using atoi(3p) seems crude and checks each byte in the input at least twice. There are probably even some subtle inconsistencies between integer parsing in dj(1) and intcmp(1) (I really hope not) - I don't wanna take a third risk at inconsistent behavior. Rust parsing is nice and it would be very welcome in our C utilities. [Rust makes this very easy](https://git.tebibyte.media/bonsai/coreutils/src/branch/main/src/swab.rs#L59).
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: bonsai/coreutils#94
No description provided.