Skip to content

Support unknown fields #456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
steven-sheehy opened this issue Apr 4, 2025 · 5 comments · May be fixed by #471
Open

Support unknown fields #456

steven-sheehy opened this issue Apr 4, 2025 · 5 comments · May be fixed by #471
Assignees
Labels
Feature Enhancement Enhancing an existing feature driven by business requirements. Typically backwards compatible.

Comments

@steven-sheehy
Copy link
Member

Problem

The mirror node needs to be both backwards and future compatible with all versions of the record and block stream protobufs. If any field is removed, we still need to parse historical versions of it. If any new transaction type is added but we have not updated the protobuf version to add support, we still store generic transaction info without failing.

Solution

Add optional support for unknown fields when using parse(). In protoc, this takes the form of Map<Integer, Field> fields. parseStrict() should remain unchanged and throw an error if unknown fields are present.

Alternatives

No response

@steven-sheehy steven-sheehy added the Feature Enhancement Enhancing an existing feature driven by business requirements. Typically backwards compatible. label Apr 4, 2025
@anthony-swirldslabs
Copy link
Contributor

As mentioned off-line previously, PBJ deliberately didn't implement support for unknown fields as a performance optimization, so as to avoid allocating objects (e.g. maps, byte arrays, and similar) to keep this information in the models. However, as long as this feature is optional, then PBJ could support unknown fields.

For example, it could be enabled via an additional boolean argument passed to the parse() method, e.g.:

@NonNull
T parse(@NonNull ReadableSequentialData input, boolean strictMode, int maxDepth, boolean preserveUnknownFields) throws ParseException;

In the generated model implementation, we'd add a private field of type Map<Integer, UnknownField> which would be initialized to null if preserveUnknownFields is false. However, if the parsing of unknown fields is enabled, then PBJ could create a map object when it encounters the first unknown field and store a reference to this map in that new field. The UnknownField could be defined similar to:

record UnknownField(ProtoConstants wireType, Bytes data){}

The model's copyBuilder() method would preserve the unknown fields map in the new builder.

Finally, the codec's write() method would serialize all the unknown fields so as to preserve them and mimic the Google Protobuf behavior.

However, it's unclear whether applications need to have access to the unknown fields data. Since they are unknown to the application, it's clear that the application is using a .proto model that differs from the model which was used to serialize the original object. Therefore, with the exception of a few edge cases, the application wouldn't be able to meaningfully interpret the unknown fields.

Could you please confirm if access to the unknown fields is necessary at the model public API level, and if so, please clarify the use case?

@steven-sheehy
Copy link
Member Author

Yes, as already stated, the mirror node uses the data from the unknown fields. It is a requirement that mirror node not fail if it encounters a transaction type that it doesn't understand. But we still want to store generic transaction info so we extract the transaction type ordinal from the protobuf. We also extract the signature type and bytes for newly added signatures we're unaware of. And it's used in various tests. Here's a link to those areas: https://github.com/search?q=repo%3Ahiero-ledger%2Fhiero-mirror-node%20unknownfields&type=code

I might also suggest that the parse method is getting a bit unwieldy with the addition of more overloaded params. We might want to add an options class so it is future compatible with any new option. For example, ParseOptions.newBuilder().depth(1).strictMode(true).unknownFields(true).build() which apps can create once and reuse across requests.

@anthony-swirldslabs
Copy link
Contributor

I agree that an API with an Options argument would be easier to extend. However, PBJ is focused on the performance and we try really hard to avoid allocating new objects, or requiring the client code to allocate new objects just for the purpose of calling a PBJ method. While I understand that the client code could cache and reuse frequently used options instances, I don't think this would establish the right pattern. For this reason, I think we'd go with an extra argument. The backward compatibility can trivially be achieved by adding a default overloaded version of this method to the Codec interface in PBJ that uses the default value of false for this new argument. So the old code wouldn't break, and the new code could start using the new method.

Thanks for elaborating on your use-case. That makes sense. Would an access to an unmodifiable Map<Integer, UnknownField> with record UnknownField(ProtoConstants wireType, Bytes data){} as proposed above be sufficient for you? The keys in the map would be the field numbers, and the values would tell you the Protobuf wire type and the actual bytes containing the data of the field. You'd be responsible for parsing the bytes using the wire type of the data if you need to actually interpret it as, say, an integer, or whatever the wire type of the data might be (see ProtoConstants in PBJ for more details.)

@steven-sheehy
Copy link
Member Author

Yes, that works for us. Thanks for looking into this.

@anthony-swirldslabs anthony-swirldslabs linked a pull request Apr 17, 2025 that will close this issue
2 tasks
@anthony-swirldslabs anthony-swirldslabs moved this to 👀 In Review in Foundation Team Apr 17, 2025
@anthony-swirldslabs
Copy link
Contributor

@steven-sheehy : The fix is out for review at #471

The only notable change from what we've discussed above is the UnknownField which now looks like:

public record UnknownField(ProtoConstants wireType, List<Bytes> bytes) {}

You can see that bytes is now a list. This is necessary to properly support repeated fields. This list will always contain at least one entry. But if the same field is present on the wire multiple times, then the list may contain more entries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature Enhancement Enhancing an existing feature driven by business requirements. Typically backwards compatible.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants