Making npc(1) Unicode-aware #180

Open
opened 2026-02-27 12:28:43 -07:00 by emma · 1 comment
Owner

npc(1) was produced before we settled on a Unicode-native system.

https://doc.rust-lang.org/stable/std/primitive.char.html#method.is_control

https://www.unicode.org/charts/nameslist/n_2400.html

  • Plan 9/9front displays control characters as control pictures in drawterm
`npc(1)` was produced before we settled on a Unicode-native system. https://doc.rust-lang.org/stable/std/primitive.char.html#method.is_control https://www.unicode.org/charts/nameslist/n_2400.html - Plan 9/9front displays control characters as control pictures in drawterm
emma added the
bug
enhancement
labels 2026-02-27 12:28:43 -07:00
Owner

Regardless of the US-ASCII nativity of the system, there has to be at
least a fragment of US-ASCII compatibility in order to attempt use on
ASCII terminals, many of which were implemented in hardware/firmware.

The largest field of potentially non printing characters should be the
default, and then we could implement -8 for "print eight-bit chars"
and print graphics for the control character range of UTF-8. Per the
hyperlinked source, the Unicode consortium states in
https://www.unicode.org/policies/stability_policy.html#Property_Value
that the set of Unicode control characters won't change, so it would
just take a dash of decoding:

/* If the retval > 0xff, it's a UTF-8 encoded Unicode control character which
 * needs to be converted to its graphic. The high byte is byte 1. */
static
#ifdef UINT_FAST16_MAX
uint_fast16_t
#else
signed long int
#endif
supergetc(FILE *stream, enum {BIG = 0, LITTLE = 1} endian /* of stream */) {
	/* PROOFREAD ME! */
	static int c;
	int c2 = EOF;

	/* Unicode control characters are 0x80-0x9f. */
	 * 0b 110x xxyy 10yy zzzz
	 *       0 0010   01 1111 -> 0b 1001 1111 -> 0x 9f
	 *    1100 0010 1001 1111 -> 0b 1100 0010 1001 1111 -> 0x c2 9f */

	if ((c = fgetc(stream)) <= 0x7f /* ASCII */
			|| (endian == BIG /* so exclude 0b 10xx xxxx */
				&& ((c >> 4) & 0xe /* 0b1110 */)
					!= 0xc /* 0b1100 */ /* >2B */
				|| (c & 0x3) <= 0x2 /* UCP < 0xbf */)
			|| (endian == LITTLE
				&& ((c >> 4) & 0xe /* 0b1110 */)
					< 0xe /* >2B UTF-8 except cont. */)
			|| (c2 = fgetc(stream)) < 0x80 /* malformed! pass-thru */
			|| (endian == BIG
				&& (c2 & 0xc0 /* 0b 1100 0000 */) != 0x80
				|| ((c & 0x3) == 0x2 /* UCP >= 0x80 */
					&& ((c2 >> 4) & 0x3)
						>= 0x2) /* UCP >= 0xa0 */)
			|| (endian == LITTLE
				&& ((c2 >> 4) & 0xe) != 0xc
				|| ((c2 & 0x3) == 0x2
					&& ((c >> 4) & 0x3) >= 0x2))) {
		ungetc(c2);
		return c;
	} else {
		return (endian == BIG
			? (c << 8) | c2
			: (c2 << 8) | c;
	}
}

For example.

Regardless of the US-ASCII nativity of the system, there has to be at least a fragment of US-ASCII compatibility in order to attempt use on ASCII terminals, many of which were implemented in hardware/firmware. The largest field of potentially non printing characters should be the default, and then we could implement `-8` for "print eight-bit chars" and print graphics for the control character range of UTF-8. Per the hyperlinked source, the Unicode consortium states in <https://www.unicode.org/policies/stability_policy.html#Property_Value> that the set of Unicode control characters won't change, so it would just take a dash of decoding: ```c /* If the retval > 0xff, it's a UTF-8 encoded Unicode control character which * needs to be converted to its graphic. The high byte is byte 1. */ static #ifdef UINT_FAST16_MAX uint_fast16_t #else signed long int #endif supergetc(FILE *stream, enum {BIG = 0, LITTLE = 1} endian /* of stream */) { /* PROOFREAD ME! */ static int c; int c2 = EOF; /* Unicode control characters are 0x80-0x9f. */ * 0b 110x xxyy 10yy zzzz * 0 0010 01 1111 -> 0b 1001 1111 -> 0x 9f * 1100 0010 1001 1111 -> 0b 1100 0010 1001 1111 -> 0x c2 9f */ if ((c = fgetc(stream)) <= 0x7f /* ASCII */ || (endian == BIG /* so exclude 0b 10xx xxxx */ && ((c >> 4) & 0xe /* 0b1110 */) != 0xc /* 0b1100 */ /* >2B */ || (c & 0x3) <= 0x2 /* UCP < 0xbf */) || (endian == LITTLE && ((c >> 4) & 0xe /* 0b1110 */) < 0xe /* >2B UTF-8 except cont. */) || (c2 = fgetc(stream)) < 0x80 /* malformed! pass-thru */ || (endian == BIG && (c2 & 0xc0 /* 0b 1100 0000 */) != 0x80 || ((c & 0x3) == 0x2 /* UCP >= 0x80 */ && ((c2 >> 4) & 0x3) >= 0x2) /* UCP >= 0xa0 */) || (endian == LITTLE && ((c2 >> 4) & 0xe) != 0xc || ((c2 & 0x3) == 0x2 && ((c >> 4) & 0x3) >= 0x2))) { ungetc(c2); return c; } else { return (endian == BIG ? (c << 8) | c2 : (c2 << 8) | c; } } ``` For example.
Sign in to join this conversation.
2 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: bonsai/harakit#180
No description provided.