proposal: encoding/json: opt-in for true streaming support

# Overview

I have long wanted proper streaming support in the `encoding/json` library. I’ve been doing some homework to understand the current state of things, and I think I’ve come to grips with most of it.

A number of previous issues relate to this topic: https://github.com/golang/go/issues/7872, https://github.com/golang/go/issues/11046, https://github.com/golang/go/issues/12001, https://github.com/golang/go/issues/14140

In a nutshell: The library implicitly guarantees that marshaling will never write an incomplete JSON object due to an error, and that during unmarshaling, it will never pass an incomplete JSON message to `UnmarshalJSON`, and this seems a reasonable, conservative default, but is not always the desired behavior.

Work toward this has been done on a couple of occasions, but abandoned or stalled for various reasons. See https://go-review.googlesource.com/c/go/+/13818/ and https://go-review.googlesource.com/c/go/+/135595

See also my related post on golang-nuts: https://groups.google.com/d/msg/golang-nuts/ABD4fTkP4Nc/bliIAAAeAQAJ

# The problem to be solved

Dealing with large JSON structures is inefficient, due to the internal buffering done by `encoding/json`. `json.NewEncoder` and `json.NewDecoder` appear to offer streaming benefits, but this is mostly an idiomatic advantage, not a performance one, as internal buffering still takes place.

To elaborate:

When encoding, even with `json.Encoder`, the entire object is marshaled into memory, before it is written to the `io.Writer`. This proposal allows writing the JSON output immediately, rather than waiting for the entire process to complete successfully first.

The same problem occurs in reverse--when reading a large JSON object: you cannot begin processing the result until the entire result is received.

# A naïve solution

I believe a simple solution (simple from the perspective of a consumer of the library--the internal changes are not so simple) would be to add two interfaces:

    type StreamMarshaler interface {
        MarshalJSONStream(io.Writer) error
    }

    type StreamUnmarshaler interface {
        UnmarshalJSONStream(io.Reader) error
    }

During (un)marshaling, where `encoding/json` looks for `json.Marshaler` and `json.Unmarshaler` respectively, it will now look for (and possibly prefer) the new interfaces instead. Wrapping either the old or new interfaces to work as the other is a trivial matter.

With this change, and the requisite internal changes, it would be possible to begin streaming large JSON data to a server immediately, from within a `MarshalJSONStream()` implementation, for instance.

The drawback is that it violates the above mentioned promise of complete reads and writes, even with errors.

# Making it Opt-in

To accommodate this requirement, I believe it would be possible to expose the streaming functionality _only_ with the `json.Encoder` and `json.Decoder` implementations, and only when `SetDirect*` (name TBD, borrowed from https://go-review.googlesource.com/c/go/+/135595/8/src/encoding/json/stream.go#283) is enabled.  So further, the following two functions would be added to the public API:

    func (*Encoder) SetDirectWrite()

    func (*Decoder) SetDirectRead()

The default behavior, even when a type implements one of the new `Stream*` interfaces, will be to operate on an entire JSON object at once. That is to say, the Encoder will internally buffer `MarshalJSONStream`'s output, and process any error before continuing, and a decoder will read an entire JSON object into a buffer, then pass it to `UnmarshalJSONStream` only if there are no errors.

However, when `SetDirect*` is enabled, the library will bypass this internal buffering, allowing for immediate streaming to/from the source/destination.

Enabling streaming with the `SetDirect*` toggle could be enough to already experience a benefit for many users, even without the use of the additional interfaces above.

Toggling `SetDirect*` on will, of course, enable streaming for all types, not just those which implement the new interface above, so this could be considered a separate part of the proposal. In my opinion, this alone would be worth implementing, even if the new interface types above are done later or never.

# Internals

CLs 13818 and 135595 can serve as informative for this part of the discussion. I've also done some digging in the `encoding/json` package (as of 1.12) recently, for more current context.

A large number of internal changes will be necessary to allow for this. I started playing around with a few internals, and I believe this is doable, but will mean a lot of code churn, so will need to be done carefully, in small steps with good code review.

As an exercise, I have successfully rewritten`indent()` to work with streams, rather than on byte slices, and began doing the same with `compact()`. The `encodeState` type would need to work with a standard `io.Writer` rather than specifically a `bytes.Buffer`. This seems to be a bigger change, but not technically difficult. I know there are other changes needed--I haven't done a complete audit of the code.

An open question is how these changes might impact performance. My benchmarks after changing `indent()` showed no change in performance, but it wasn't a particularly rigorous test.

With the internals rewritten to support streams, then it's just a matter of doing the internal buffering at the appropriate place, such as at API boundaries (i.e. in `Marshal()` and `Unmarshal()`), rather than as a bulit-in fundamental concept.  Then, as described above, turning off that buffering when properly configured above.

# Final comments

To be clear, I am interested in working on this. I’m not just trying to throw out a “nice to have, now would somebody do this for me?” type of proposal. But I want to make sure I fully understand the history and context of this situation before I start too far down this rabbit hole.

I'm curious to hear the opinions of others who have been around longer. Perhaps such a proposal was already discussed (and possibly rejected?) in greater length than I can find in the above linked tickets. If so, please point me to the relevant conversation(s).

I am aware of several third-party libraries that offer some support like this, but most have various drawbacks (relying on code generation, or over-complex APIs). I would love to see this kind of support in the standard library.

If this general direction is approved, I think the first step is to break it into smaller parts that can be accomplished incrementally. I have given this thought, but so as not to jump the gun too much, will withhold my thoughts for a while, to allow proper discussion.

And one last aside: CL 13818 also added support for marshaling channels. That may or may not be a good idea (my personal feeling: probably not), but that can be addressed separately.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

proposal: encoding/json: opt-in for true streaming support #33714

Overview

The problem to be solved

A naïve solution

Making it Opt-in

Internals

Final comments

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

proposal: encoding/json: opt-in for true streaming support #33714

Description

Overview

The problem to be solved

A naïve solution

Making it Opt-in

Internals

Final comments

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions