Need to skip over values we don't recognize #8

New Issue

sashakoshka · 2025-08-02T11:54:56-06:00

sashakoshka commented

2025-08-02 11:54:56 -06:00

The problem

TAPE is designed so that the decoder can gloss over data it does not understand. Technically the protocol allows for this, but I completely forgot to implement this in the generated decoder, oops. This would be trivial if TAPE messages were still flat tables, but they aren't, because those aren't useful enough. So, let's analyze the problem.

When it happens

There are two reasons something might not match up with the expected data:

The first and most obvious is unrecognized keys. If the key is not in the set of recognized keys for a KTV, it should leave the corresponding struct field blank. Once #6 has been implemented, throw an error if the data was not optional.

The second is wrong types. If we are expecting KTV and get SBA, we should leave the data as empty. The aforementioned concern about #6 also applies here. We don't need to worry about special cases at the structure root, because it would be technically possible to make the structure root an option, so it really is just a normal value. Until #6, we will leave that blank too.

Preliminary ideas

The first is going to be pretty simple. All we need to do is have a skimmer function that skims over TAPE data very, and then call that on the KTV value each time we run into a mystery key. It should only return an error if the structure of the data is malformed in such a way that it cannot continue to the next one. This should be stored in the tape package alongside the dynamic decoding functions, because they will essentially function the same way and could probably share lots of code.

The second is a bit more complicated because of the existence of KTV and OTA because they are aggregate types. Go types work a bit differently, as if you have an array of an array of an array of ints, that information is represented in one place, whereas TAPE doesn't really do that. All of that information is sort of buried within the data structure, so we don't know what we will be decoding before we actually do it. Whenever we encounter a type we don't expect, we would need to abort decoding of the entire data structure, and then skim over whatever detritus is left, which would literally be in a half-decoded state. The fact that the code is generated flat and thus cannot use return or defer statements contributes to the complexity of this problem. We need to go up, but we can't. There is no up, only forward.

Of course, the dynamic decoder does not have this problem in the first place because it doesn't expect anything, and constructs the destination to fit whatever it sees in the TAPE structure as it is decoding it. KTVs are completely dynamic because they are implemented as maps, so the only time it needs to completely comprehend a type is with OTAs. There is a function called typeOf that gets the type of the current tag and returns it as a reflect.Type, which necessitates recursion and peeking at OTAs and their elements.

We could try to do the same thing in the generated decoder, comparing the determined type against the expected type to try to figure out whether we should decode an array or a table, etc. This is immediately problematic as it requires memory to be allocated, both for the peek buffer and the resulting tree of type information. If we end up with some crazy way to keep track of the types, that's only one half of the allocation problem and we would still be spending extra cycles going over all of that twice.

Performance constraints

The generated decoder is supposed to blaze through data, and it can't do that if it does all the singing and dancing that the dynamic decoder does. It's time for some performance constraints:

No allocations, except as required to build the destination for the data
No redundant work
So, no freaking peeking
It should take well under 500 lines of generated code to decode one message of reasonable size (i.e. be careful not to bloat the binary)

I'm not really going to do my usual thing here of making a slow version and speeding it up over time based on evidence and experimentation because these constraints inform the design so much it would be impossible to continue without them. I am 99% confident that these constraints will allow for an acceptable baseline of performance (for generated code) and we can still profile and micro-optimize later. This is good enough for me.

Heavy solution

There is a solution that might work very well which involves completely redoing the generated decoding code. We could create a function for every source type to destination type mapping that exists in protocol, and then compose them all together. The decoding methods for each message or type would be wrappers around the correct function for their root TAPE -> Go type mapping. The main benefit of this is it would make this problem a lot more manageable because the interface points between the data would be represented by function boundaries. This would allow the use of return and defer statements, and would allow more code sharing, producing a smaller binary. Go would probably inline these where needed.

Would this work? Probably. More investigation is required to make sure. I want to stop re-writing things I don't need to. On the other hand, it is just the decoder.

Light solution

TODO: find a solution that satisfies the performance constraints, keeps the same identical interface, and works off the same code. I am convinced this is doable, and it might even allow us to extract more data from an unexpected structure. However, continuing this way might introduce unmanageable complexity. It is already a little unmanageable and I am just one pony (kind of).

# The problem TAPE is designed so that the decoder can gloss over data it does not understand. Technically the protocol allows for this, but I completely forgot to implement this in the generated decoder, oops. This would be trivial if TAPE messages were still flat tables, but they aren't, because those aren't useful enough. So, let's analyze the problem. # When it happens There are two reasons something might not match up with the expected data: The first and most obvious is unrecognized keys. If the key is not in the set of recognized keys for a KTV, it should leave the corresponding struct field blank. Once #6 has been implemented, throw an error if the data was not optional. The second is wrong types. If we are expecting KTV and get SBA, we should leave the data as empty. The aforementioned concern about #6 also applies here. We don't need to worry about special cases at the structure root, because it would be technically possible to make the structure root an option, so it really is just a normal value. Until #6, we will leave that blank too. # Preliminary ideas The first is going to be pretty simple. All we need to do is have a skimmer function that skims over TAPE data very, and then call that on the KTV value each time we run into a mystery key. It should only return an error if the *structure* of the data is malformed in such a way that it cannot continue to the next one. This should be stored in the `tape` package alongside the dynamic decoding functions, because they will essentially function the same way and could probably share lots of code. The second is a bit more complicated because of the existence of KTV and OTA because they are aggregate types. Go types work a bit differently, as if you have an array of an array of an array of ints, that information is represented in one place, whereas TAPE doesn't really do that. All of that information is sort of buried within the data structure, so we don't know what we will be decoding before we actually do it. Whenever we encounter a type we don't expect, we would need to abort decoding of the entire data structure, and then skim over whatever detritus is left, which would literally be in a half-decoded state. The fact that the code is generated flat and thus cannot use return or defer statements contributes to the complexity of this problem. We need to go up, but we can't. There is no up, only forward. Of course, the dynamic decoder does not have this problem in the first place because it doesn't expect anything, and constructs the destination to fit whatever it sees in the TAPE structure *as* it is decoding it. KTVs are completely dynamic because they are implemented as maps, so the only time it needs to completely comprehend a type is with OTAs. There is a function called `typeOf` that gets the type of the current tag and returns it as a `reflect.Type`, which necessitates recursion and *peeking* at OTAs and their elements. We could try to do the same thing in the generated decoder, comparing the determined type against the expected type to try to figure out whether we should decode an array or a table, etc. This is immediately problematic as it requires memory to be allocated, both for the peek buffer and the resulting tree of type information. If we end up with some crazy way to keep track of the types, that's only one half of the allocation problem and we would still be spending extra cycles going over all of that twice. # Performance constraints The generated decoder is supposed to blaze through data, and it can't do that if it does all the singing and dancing that the dynamic decoder does. It's time for some performance constraints: - No allocations, except as required to build the destination for the data - No redundant work - So, no freaking peeking - It should take well under 500 lines of generated code to decode one message of reasonable size (i.e. be careful not to bloat the binary) I'm not *really* going to do my usual thing here of making a slow version and speeding it up over time based on evidence and experimentation because these constraints inform the design so much it would be impossible to continue without them. I am 99% confident that these constraints will allow for an acceptable baseline of performance (for generated code) and we can still profile and micro-optimize later. This is good enough for me. # Heavy solution There is a solution that might work very well which involves completely redoing the generated decoding code. We could create a function for every source type to destination type mapping that exists in protocol, and then compose them all together. The decoding methods for each message or type would be wrappers around the correct function for their root TAPE -> Go type mapping. The main benefit of this is it would make this problem a lot more manageable because the interface points between the data would be represented by function boundaries. This would allow the use of return and defer statements, and would allow more code sharing, producing a smaller binary. Go would probably inline these where needed. Would this work? Probably. More investigation is required to make sure. I want to stop re-writing things I don't need to. On the other hand, it is *just* the decoder. # Light solution TODO: find a solution that satisfies the performance constraints, keeps the same identical interface, and works off the same code. I am convinced this is doable, and it might even allow us to extract more data from an unexpected structure. However, continuing this way might introduce unmanageable complexity. It is already a little unmanageable and I am just one pony (kind of).

sashakoshka commented

2025-08-05 06:51:00 -06:00

Full document at: https://git.tebibyte.media/sashakoshka/hopp/src/branch/branched-generated-encoder/design/branched-generated-encoder.md

sashakoshka referenced a pull request that will close this issue

2025-08-06 19:09:59 -06:00

branched-generated-encoder #9

sashakoshka closed this issue

2025-08-06 19:11:15 -06:00

sashakoshka commented

2025-08-07 19:16:03 -06:00

Forgot to implement tape.Skip, need to do after #7

Forgot to implement `tape.Skip`, need to do after #7

sashakoshka reopened this issue

2025-08-07 19:16:03 -06:00

sashakoshka commented

2025-08-12 06:40:34 -06:00

This also isn't done in the dynamic decoder, it simply returns an error if the destination type doesn't match the tag. Need to fix this too, and also make sure this is taken into account when working on #11 .

sashakoshka closed this issue

2025-08-18 20:28:25 -06:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: sashakoshka/hopp#8