Number parsing routine #94

New Issue

Open

opened 2024-04-22 13:40:44 +00:00 by trinity · 3 comments

trinity commented

2024-04-22 13:40:44 +00:00

Owner

Rust has good number parsing with the parse method, meanwhile C's stuck in 1989. atoi(3) sucks. It returns 0 if the value is 0 or if there's an error. strtol(3) is the grown-up version of the command but is a real pain to use (set errno to 0, check after and also check the end pointer). I really want a good number parser to use in C programs.

Things to consider:

Where will the library go in the tree?
Should there be a libbonsai or should we have specific libraries for specific sets of functionality?
Which language should we use for this? How will FFI stuff work?
How will Unicode work? Should we accept all types of numerals or just ASCII 0-9?

Rust has good number parsing with the `parse` method, meanwhile C's stuck in 1989. atoi(3) sucks. It returns 0 if the value is 0 *or* if there's an error. strtol(3) is the grown-up version of the command but is a real pain to use (set errno to 0, check after and also check the end pointer). I really want a good number parser to use in C programs. Things to consider: - Where will the library go in the tree? - Should there be a `libbonsai` or should we have specific libraries for specific sets of functionality? - Which language should we use for this? How will FFI stuff work? - How will Unicode work? Should we accept all types of numerals or just ASCII 0-9?

trinity added the

enhancement

help wanted

labels 2024-04-22 13:40:44 +00:00

trinity referenced this issue

2024-04-22 13:44:43 +00:00

String tokenizing routine #95

silt commented

2024-04-22 21:03:18 +00:00

Owner

Should we accept all types of numerals or just ASCII 0-9?

advocating for only accepting ascii 0-9. homoglyphs are annoying to deal with and often end up being a source of sanitization-related security issues. perhaps an additional unicode-aware version would be warranted.

> Should we accept all types of numerals or just ASCII 0-9? advocating for only accepting ascii 0-9. homoglyphs are annoying to deal with and often end up being a source of sanitization-related security issues. perhaps an additional unicode-aware version would be warranted. related: https://util.unicode.org/UnicodeJsps/confusables.jsp?a=47&r=None

emma commented

2024-04-24 01:25:54 +00:00

Owner

I could write C bindings to the Rust functions, that way they just accept anything Rust does.

I’m not really sure what this is all about, though. Can you explain in-depth what this issue is for?

I could write C bindings to the Rust functions, that way they just accept anything Rust does. I’m not really sure what this is all about, though. Can you explain in-depth what this issue is for?

trinity commented

2024-04-27 02:32:29 +00:00

Author

Owner

Integer parsing has to be done in our C programs in a couple of places:

src/dj.c:299 has an awkward integer parsing function to handle numeric option arguments.
src/intcmp.c:63 has more, different awkward integer parsing.

I'm starting to implement pg(1) from #44. The code is bad and the branch in which I'm implementing it is mainly serving as a playground in which I can toss shit around and see what works. pg(1) needs to parse numeric arguments to configure page lengths.

This will be my third time figuring out integer parsing with strtol(3) and I have found it awkward and difficult to read every time I've done it. strtol(3) seems incredibly overengineered for this task (base configuration? an end pointer?) but scanning a string with isdigit(3p) and using atoi(3p) seems crude and checks each byte in the input at least twice. There are probably even some subtle inconsistencies between integer parsing in dj(1) and intcmp(1) (I really hope not) - I don't wanna take a third risk at inconsistent behavior.

Rust parsing is nice and it would be very welcome in our C utilities. Rust makes this very easy.

Integer parsing has to be done in our C programs in a couple of places: - [src/dj.c:299](https://git.tebibyte.media/bonsai/coreutils/src/branch/main/src/dj.c#L299) has an awkward integer parsing function to handle numeric option arguments. - [src/intcmp.c:63](https://git.tebibyte.media/bonsai/coreutils/src/branch/main/src/intcmp.c#L63) has more, different awkward integer parsing. I'm [starting to implement pg(1)](https://git.tebibyte.media/bonsai/coreutils/src/branch/pg/src/pg.c) from #44. The code is bad and the branch in which I'm implementing it is mainly serving as a playground in which I can toss shit around and see what works. pg(1) needs to parse numeric arguments to configure page lengths. This will be my third time figuring out integer parsing with strtol(3) and I have found it awkward and difficult to read every time I've done it. strtol(3) seems incredibly overengineered for this task (base configuration? an end pointer?) but scanning a string with isdigit(3p) and using atoi(3p) seems crude and checks each byte in the input at least twice. There are probably even some subtle inconsistencies between integer parsing in dj(1) and intcmp(1) (I really hope not) - I don't wanna take a third risk at inconsistent behavior. Rust parsing is nice and it would be very welcome in our C utilities. [Rust makes this very easy](https://git.tebibyte.media/bonsai/coreutils/src/branch/main/src/swab.rs#L59).