Need to support larger messages, arrays #2

New Issue

sashakoshka · 2025-02-01T17:26:01-07:00

sashakoshka commented

2025-02-01 17:26:01 -07:00

16 KiB is a lot of data sometimes but other times it really isn't. It's in the territory where some data could very well go over the limit and cause random issues. The reason its like this is because large messages could clog the line when using METADAPT-A, blocking other transactions while a gigabyte or more of data is transferred, and it could enable DOS attacks on applications that face the internet (or ingest a lot of data from the user without checking for an upper limit).

For METADAPT-B, this could be solved by simply increasing the message size field to a U64 (like in websockets), because with QUIC each stream is flow-controlled individually. For METADAPT-A, we would need to have the first bit of the message length be a "chunk" bit, meaning the data of the next message in the same transaction should be appended onto the data in this message, and the "chain" of chunks would end when a message doesn't have that bit set. Multiplexing with other transactions would be unaffected.

For DOS mitigation, a good solution would be to allow the protocol to impose a size limit on messages (including chunked ones) that would default to maybe like 1 megabyte. An application still might want to send unbounded streams of data though—so there could be an alternate way of reading the next message in a transaction that would return a reader, which would be fed chunks and closed on end. This would work, because imagine this scenario:

A protocol defines a file transfer transaction that starts with a Get message from the client, and ends with a Return message from the server. The return message is not TAPE encoded, but contains file contents.

A client initiates the transaction with:

Method: Get
Resource: /index.txt
And then requests the next message from the API as a reader.

The Server sends:

Method: Return
Using an alternate method where it receives a writer from the API. It pipes the data into the message, and then closes the writer, which flushes the buffer and sends the final message.

The client is returned the method code, alongside a reader which is fed new chunks as they arrive.

16 KiB is a lot of data sometimes but other times it really isn't. It's in the territory where some data could very well go over the limit and cause random issues. The reason its like this is because large messages could clog the line when using METADAPT-A, blocking other transactions while a gigabyte or more of data is transferred, and it could enable DOS attacks on applications that face the internet (or ingest a lot of data from the user without checking for an upper limit). For METADAPT-B, this could be solved by simply increasing the message size field to a U64 (like in websockets), because with QUIC each stream is flow-controlled individually. For METADAPT-A, we would need to have the first bit of the message length be a "chunk" bit, meaning the data of the next message in the same transaction should be appended onto the data in this message, and the "chain" of chunks would end when a message doesn't have that bit set. Multiplexing with other transactions would be unaffected. For DOS mitigation, a good solution would be to allow the protocol to impose a size limit on messages (including chunked ones) that would default to maybe like 1 megabyte. An application still might want to send unbounded streams of data though—so there could be an alternate way of reading the next message in a transaction that would return a reader, which would be fed chunks and closed on end. This would work, because imagine this scenario: A protocol defines a file transfer transaction that starts with a Get message from the client, and ends with a Return message from the server. The return message is not TAPE encoded, but contains file contents. A client initiates the transaction with: - Method: Get - Resource: /index.txt And then requests the next message from the API as a reader. The Server sends: - Method: Return Using an alternate method where it receives a writer from the API. It pipes the data into the message, and then closes the writer, which flushes the buffer and sends the final message. The client is returned the method code, alongside a reader which is fed new chunks as they arrive.

sashakoshka commented

2025-02-01 17:58:15 -07:00

Additionally, TAPE could use a redesign. It does not need to be corruption resistant because a slightly corrupt message should just be totally rejected, and it needs to handle bigger data anyway if these changes are to mean anything. Plain TLV with a U16 tag and a U32 length would work well, the reason it isn't a U64 is accepting a buffer of that length would open the floodgates to DOS attacks. A U32 would at least be reasonable.

However, this would end up wasting a lot of data. Most things aren't even going to be close to a U32 in size, so on average that is 3/4 bytes that are zeros, which means the TL would in many cases be longer than the V. To solve this, there could be some sort of expanding integer format that works a bit like the METADAPT-A chunks, where the first bit of every byte could be a chunk bit, and all byte chunks would be big-endian'd together (not counting the chunk bit). Because the vast majority of data is likely going to be under 127 bytes long, this would allow the vast majority of Ls to be only one byte long, but infinitely long fields would still be possible.

Here's how much it costs to encode the type and length information with each encoding method, compared to the size of the actual value:

	U8	100 U64 pairs	"Hello world!"	Buffer, 1 megabyte in size
Current TAPE	4	400	4	Not even possible
U32 length	6	600	6	6
Chunked length	3	300	3	6
Length of value	1	800	12	1048576

For how much length chunks it would take to describe 1 megabyte of data:

Binary form: 000100000000000000000000
Chunking: 000 1000000 0000000 0000000
Adding chunk bits: 10000000 11000000 10000000 00000000

As can be seen, chunked encoding in all cases uses up either the least amount of bytes, or is tied for the least amount of bytes. I'd call that an epic win.

Additionally, TAPE could use a redesign. It does not need to be corruption resistant because a slightly corrupt message should just be totally rejected, and it needs to handle bigger data anyway if these changes are to mean anything. Plain TLV with a U16 tag and a U32 length would work well, the reason it isn't a U64 is accepting a buffer of that length would open the floodgates to DOS attacks. A U32 would at least be reasonable. However, this would end up wasting a lot of data. Most things aren't even going to be close to a U32 in size, so on average that is 3/4 bytes that are zeros, which means the TL would in many cases be longer than the V. To solve this, there could be some sort of expanding integer format that works a bit like the METADAPT-A chunks, where the first bit of every byte could be a chunk bit, and all byte chunks would be big-endian'd together (not counting the chunk bit). Because the vast majority of data is likely going to be under 127 bytes long, this would allow the vast majority of Ls to be only one byte long, but infinitely long fields would still be possible. Here's how much it costs to encode the type and length information with each encoding method, compared to the size of the actual value: | | U8 | 100 U64 pairs |"Hello world!" | Buffer, 1 megabyte in size | --------------- | -: | ------------: | -------------: | -------------------------: | Current TAPE | 4 | 400 | 4 | Not even possible | U32 length | 6 | 600 | 6 | 6 | Chunked length | 3 | 300 | 3 | 6 | Length of value | 1 | 800 | 12 | 1048576 For how much length chunks it would take to describe 1 megabyte of data: Binary form: `000100000000000000000000` Chunking: `000 1000000 0000000 0000000` Adding chunk bits: `10000000 11000000 10000000 00000000` As can be seen, chunked encoding in all cases uses up either the least amount of bytes, or is tied for the least amount of bytes. I'd call that an epic win.

sashakoshka commented

2025-02-01 18:08:41 -07:00

TAPE could have a similar DOS mitigation measure as METADAPT, where the protocol can define a maximum field length (for all fields, because to enable backwards compatible protocols, field lengths must be self describing and not dependent on their type) that is around 1 megabyte by default.

TAPE could have a similar DOS mitigation measure as METADAPT, where the protocol can define a maximum field length (for *all* fields, because to enable backwards compatible protocols, field lengths must be self describing and not dependent on their type) that is around 1 megabyte by default.

sashakoshka commented

2025-02-01 18:13:36 -07:00

VILA also needs to be altered to support chunked length encoding, and should probably draw its maximum element length from TAPE.

sashakoshka referenced this issue from a commit

2025-04-06 09:26:08 -06:00

Add new METADAPT protocol specifications from #2

sashakoshka commented

2025-04-06 15:05:37 -06:00

After some dastardly force pushing, this is now being worked on at https://git.tebibyte.media/sashakoshka/hopp/src/branch/message-size-increase

sashakoshka referenced this issue

2025-05-14 15:22:01 -06:00

message-size-increase #3