Internationalization for non-OS error messages #77

Open
opened 2024-03-02 06:47:19 +00:00 by emma · 9 comments
Owner
No description provided.
emma added the
bug
help wanted
question
labels 2024-03-02 06:47:19 +00:00
silt changed title from i18n for non-OS error messages to Internationalization for non-OS error messages 2024-03-28 07:42:56 +00:00
Owner

alright this turned into a nearly three-hour rabbit hole but i've crawled back out of it with a something that might hopefully be of some use:

#[macro_use]
pub mod tstrutils {
    macro_rules! translated_string {
        ( $name:ident, $( $lang:ident: $tstr:expr ),+ $(,)?) => {
            #[allow(non_camel_case_types)]
            pub struct $name {
                $(
                    pub $lang: &'static str,
                )+
            }

            pub const $name: $name = $name {
                $(
                    $lang: $tstr,
                )+
            };
        };
    }
}

mod s {
    translated_string!(
        HELLO_WORLD,
        en: "Hello, world!",
        fr: "Salut le Monde!",
        jp: "こんにちは、 世界!",
    );

    translated_string!(
        VOCALIZATION,
        dog: "woof",
        bird: "chirp",
        cat: "meow",
        cat_fr: "miau",
    );
}

fn main() {
    println!("{}", s::HELLO_WORLD.en);
    println!("{}", s::HELLO_WORLD.fr);
    println!("{}", s::HELLO_WORLD.jp);

    println!("{}", s::VOCALIZATION.dog);
    println!("{}", s::VOCALIZATION.bird);
    println!("{}", s::VOCALIZATION.cat);
    println!("{}", s::VOCALIZATION.cat_fr);
}
$ cargo run
Hello, world!
Salut le Monde!
こんにちは、 世界!
woof
chirp
meow
miau

no clue how idiomatic any of this is, never used macros before either so there are almost certainly issues here but it at least seems like a reasonable enough starting point. thoughts, praise, criticism, personal attacks, etc are all welcome.

alright this turned into a nearly three-hour rabbit hole but i've crawled back out of it with a something that might hopefully be of some use: ```rust #[macro_use] pub mod tstrutils { macro_rules! translated_string { ( $name:ident, $( $lang:ident: $tstr:expr ),+ $(,)?) => { #[allow(non_camel_case_types)] pub struct $name { $( pub $lang: &'static str, )+ } pub const $name: $name = $name { $( $lang: $tstr, )+ }; }; } } mod s { translated_string!( HELLO_WORLD, en: "Hello, world!", fr: "Salut le Monde!", jp: "こんにちは、 世界!", ); translated_string!( VOCALIZATION, dog: "woof", bird: "chirp", cat: "meow", cat_fr: "miau", ); } fn main() { println!("{}", s::HELLO_WORLD.en); println!("{}", s::HELLO_WORLD.fr); println!("{}", s::HELLO_WORLD.jp); println!("{}", s::VOCALIZATION.dog); println!("{}", s::VOCALIZATION.bird); println!("{}", s::VOCALIZATION.cat); println!("{}", s::VOCALIZATION.cat_fr); } ``` ``` $ cargo run Hello, world! Salut le Monde! こんにちは、 世界! woof chirp meow miau ``` no clue how idiomatic any of this is, never used macros before either so there are almost certainly issues here but it at least seems like a reasonable enough starting point. thoughts, praise, criticism, personal attacks, etc are all welcome.
Owner

and yes i'm aware of how disgusting it is to reuse the identifier like that, but it works and i don't really see any practical issue with it?

and yes i'm aware of how disgusting it is to reuse the identifier like that, but it works and i don't really see any practical issue with it?
Owner

one potential issue that i do see is that, because of my choice to do str_id.lang instead of lang.str_id, there's seemingly no simple way to get and store a usable language identifier once at program init and then not incur additional overhead with language selection each time a string is used later on in the program. i made that choice for the sake of keeping related things in the same place, and i stand by it. i believe that this can be elegantly worked around with yet another macro, but unfortunately my only clear idea on that front so far would involve bringing in Syn for its ability to turn strings into identifiers. it's a hefty dependency for smth this minor, but the only alternative i can see at this point is using a hashmap so that we can use strings as identifiers, which i feel only moves that heftiness into runtime.

one potential issue that i do see is that, because of my choice to do `str_id.lang` instead of `lang.str_id`, there's seemingly no simple way to get and store a usable language identifier *once* at program init and then not incur additional overhead with language selection each time a string is used later on in the program. i made that choice for the sake of keeping related things in the same place, and i stand by it. i believe that this can be elegantly worked around with yet another macro, but unfortunately my only clear idea on that front so far would involve bringing in [Syn](https://docs.rs/syn/latest/syn/index.html) for its ability to turn strings into identifiers. it's a hefty dependency for smth this minor, but the only alternative i can see at this point is using a hashmap so that we can use strings as identifiers, which i feel only moves that heftiness into runtime.
Owner

upon closer inspection, it appears that i can do what i want without Syn, but it will still require a proc macro. not a huge deal, but it does add some complexity to the repo

upon closer inspection, it appears that i can do what i want without Syn, but it will still require a proc macro. not a huge deal, but it does add some complexity to the repo
Owner

I've been interested in GNU gettext(3) which is the GNU solution to localization. The underscore macro as a shortcut to gettext(3) is sprinkled throughout the coreutils (here it is in their true(1) abomination).

gettext(3) looks up the given message in a catalog and returns localizations. We could use UNIX locales and implement a gettext(3), mindful of the classic complaints about locales (reimplementing some C standard library functions, at least <ctype.h>). Or have an environment variable BONSAI_LOCALE or something.

The tstrutils solution is a lot easier for programmers but a massive source file with all the combinations of locale and string might be somewhat difficult for non-programmer translators and it's harder to audit big files. I would prefer at least a bonsai/i18n/lang database or something.

Another thought I have is somehow adding errnos and strerrors, but that wouldn't be useful for non-errors (e.g. Usage:).

I haven't considered efficiency for any of this and am out of time to write this comment but will come back to this thread later with ideas.

I've been interested in [GNU gettext(3)](https://www.man7.org/linux/man-pages/man3/gettext.3.html) which is the GNU solution to localization. [The underscore macro as a shortcut to gettext(3)](https://www.linuxquestions.org/questions/programming-9/c-underscore-before-string-870510/#post4300788) is sprinkled throughout the coreutils ([here it is in their true(1) abomination](https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/true.c#n36)). gettext(3) looks up the given message in a catalog and returns localizations. We could use UNIX locales and implement a gettext(3), mindful of [the classic complaints about locales](https://github.com/mpv-player/mpv/commit/1e70e82baa9193f6f027338b0fab0f5078971fbe) (reimplementing some C standard library functions, at least `<ctype.h>`). Or have an environment variable `BONSAI_LOCALE` or something. The `tstrutils` solution is a lot easier for programmers but a massive source file with all the combinations of locale and string might be somewhat difficult for non-programmer translators and it's harder to audit big files. I would prefer at least a bonsai/i18n/lang database or something. Another thought I have is somehow adding errnos and strerrors, but that wouldn't be useful for non-errors (e.g. `Usage:`). I haven't considered efficiency for any of this and am out of time to write this comment but will come back to this thread later with ideas.
Owner

after banging my head against proc macros for the past 8 or so hours, i've decided to call it quits on that front.
here's a more reasonable solution with similar ease-of-use to GNU's _():

mod s;
use std::io::Result;

fn main() -> Result<()> {
    println!("\nField accessors:");
    println!("{}", s::HelloWorld.en_US);
    println!("{}", s::HelloWorld.fr_FR);
    println!("{}", s::HelloWorld.ja_JP);

    println!("\ntranslated_string()!:");
    println!("{}", s::HelloWorld.in_locale("en_US")?);
    println!("{}", s::HelloWorld.in_locale("fr_FR")?);
    println!("{}", s::HelloWorld.in_locale("ja_JP")?);

    // psst hey look over here this is the cool part
    println!("\ntranslated_string()! with LANG var:");
    println!("{}", s::HelloWorld);

    println!("\ntranslated_string()! error on untranslated:");
    println!("{}", s::HelloWorld.in_locale("blah")?);

    Ok(())
}
#[macro_export]
macro_rules! translated_string {
    ( $name:ident, $( $lang:ident: $tstr:expr ),+ $(,)?) => {
        #[allow(non_snake_case)]
        #[derive(Debug)]
        pub struct $name {
            $(
                pub $lang: &'static str,
            )+
        }
        
        #[allow(non_upper_case_globals)]
        pub const $name: $name = $name {
            $(
                $lang: $tstr,
            )+
        };

        use std::io::{Result, Error};
        impl $name {
            pub fn in_locale(&self, lang: &str) -> Result<&'static str> {
                match lang {
                    $( stringify!($lang) => Ok(self.$lang), )+
                    _ => Err(Error::other(format!(
                        "Untranslated string for current locale: {}", 
                        stringify!($name)
                    )))
                }
            }
        }

        impl std::fmt::Display for $name {
            fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
                write!(f, "{}", self.in_locale(&std::env::var("LANG").unwrap().split('.').take(1).collect::<String>().replace("-", "_")).unwrap())
            }
        }
    };
}
use trans::translated_string;

translated_string!(
    HelloWorld,
    en_US: "Hello, world!",
    fr_FR: "Salut le Monde!",
    ja_JP: "こんにちは、 世界!",
);

it isn't very good or optimized code, but it does work!
this assumes that LANG is set to a standard RFC5646 language tag (or the common-but-nonstandard variant where subtags are instead delimited by an underscore to work around programming language syntax), but it does gracefully ignore any .encoding specifiers in order to maintain compatibility with normal Linux (*nix?) systems

after banging my head against proc macros for the past 8 or so hours, i've decided to call it quits on that front. here's a more reasonable solution with similar ease-of-use to GNU's `_()`: ```rust mod s; use std::io::Result; fn main() -> Result<()> { println!("\nField accessors:"); println!("{}", s::HelloWorld.en_US); println!("{}", s::HelloWorld.fr_FR); println!("{}", s::HelloWorld.ja_JP); println!("\ntranslated_string()!:"); println!("{}", s::HelloWorld.in_locale("en_US")?); println!("{}", s::HelloWorld.in_locale("fr_FR")?); println!("{}", s::HelloWorld.in_locale("ja_JP")?); // psst hey look over here this is the cool part println!("\ntranslated_string()! with LANG var:"); println!("{}", s::HelloWorld); println!("\ntranslated_string()! error on untranslated:"); println!("{}", s::HelloWorld.in_locale("blah")?); Ok(()) } ``` ```rust #[macro_export] macro_rules! translated_string { ( $name:ident, $( $lang:ident: $tstr:expr ),+ $(,)?) => { #[allow(non_snake_case)] #[derive(Debug)] pub struct $name { $( pub $lang: &'static str, )+ } #[allow(non_upper_case_globals)] pub const $name: $name = $name { $( $lang: $tstr, )+ }; use std::io::{Result, Error}; impl $name { pub fn in_locale(&self, lang: &str) -> Result<&'static str> { match lang { $( stringify!($lang) => Ok(self.$lang), )+ _ => Err(Error::other(format!( "Untranslated string for current locale: {}", stringify!($name) ))) } } } impl std::fmt::Display for $name { fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result { write!(f, "{}", self.in_locale(&std::env::var("LANG").unwrap().split('.').take(1).collect::<String>().replace("-", "_")).unwrap()) } } }; } ``` ```rust use trans::translated_string; translated_string!( HelloWorld, en_US: "Hello, world!", fr_FR: "Salut le Monde!", ja_JP: "こんにちは、 世界!", ); ``` it isn't very good or optimized code, but it does work! this assumes that `LANG` is set to a standard [RFC5646](https://datatracker.ietf.org/doc/html/rfc5646) language tag (or the common-but-nonstandard variant where subtags are instead delimited by an underscore to work around programming language syntax), but it does gracefully ignore any `.encoding` specifiers in order to maintain compatibility with normal Linux (*nix?) systems
silt self-assigned this 2024-03-28 21:56:01 +00:00
Owner

a lot easier for programmers but a massive source file with all the combinations of locale and string might be somewhat difficult for non-programmer translators

i've been pondering this issue since you mentioned it, and the conclusion i've come to is this:
ultimately, the goal of this isn't translation—it's localization. localization isn't just blindly mapping words onto other words. it requires context, not only for the locale for which the text is being translated, but for the actual text itself. non-technical people who get scared off by fairly lightweight syntax (fwiw i could make the macro even lighter on this; the quotes are not strictly necessary) aren't going to provide localized strings that are of any more use than what google translate could give us.

it's harder to audit big files

i've seen the localization files for other large projects, and they aren't any better. since these are rust source files, as long as everything is imported under the s module, splitting them up should not actually be an issue.

> a lot easier for programmers but a massive source file with all the combinations of locale and string might be somewhat difficult for non-programmer translators i've been pondering this issue since you mentioned it, and the conclusion i've come to is this: ultimately, the goal of this isn't translation—it's *localization*. localization isn't just blindly mapping words onto other words. it requires context, not only for the locale for which the text is being translated, but for the actual text itself. non-technical people who get scared off by fairly lightweight syntax (fwiw i could make the macro even lighter on this; the quotes are not strictly necessary) aren't going to provide localized strings that are of any more use than what google translate could give us. > it's harder to audit big files i've seen the localization files for other large projects, and they aren't any better. since these are rust source files, as long as everything is imported under the `s` module, splitting them up should not actually be an issue.
Author
Owner

this assumes that LANG is set to a standard RFC5646 language tag (or the common-but-nonstandard variant where subtags are instead delimited by an underscore to work around programming language syntax), but it does gracefully ignore any .encoding specifiers in order to maintain compatibility with normal Linux (*nix?) systems

The standard is a combination of ISO 639 and ISO 3166 language and country identifiers, not IETF BCP 47 language tags.

> this assumes that `LANG` is set to a standard [RFC5646](https://datatracker.ietf.org/doc/html/rfc5646) language tag (or the common-but-nonstandard variant where subtags are instead delimited by an underscore to work around programming language syntax), but it does gracefully ignore any `.encoding` specifiers in order to maintain compatibility with normal Linux (*nix?) systems The standard is [a combination of ISO 639 and ISO 3166 language and country identifiers](https://wiki.archlinux.org/title/Locale), not IETF BCP 47 language tags.
Owner

oops, tysm for the correction :D

oops, tysm for the correction :D
Sign in to join this conversation.
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: bonsai/coreutils#77
No description provided.