Description
Overview
I have long wanted proper streaming support in the encoding/json
library. I’ve been doing some homework to understand the current state of things, and I think I’ve come to grips with most of it.
A number of previous issues relate to this topic: #7872, #11046, #12001, #14140
In a nutshell: The library implicitly guarantees that marshaling will never write an incomplete JSON object due to an error, and that during unmarshaling, it will never pass an incomplete JSON message to UnmarshalJSON
, and this seems a reasonable, conservative default, but is not always the desired behavior.
Work toward this has been done on a couple of occasions, but abandoned or stalled for various reasons. See https://go-review.googlesource.com/c/go/+/13818/ and https://go-review.googlesource.com/c/go/+/135595
See also my related post on golang-nuts: https://groups.google.com/d/msg/golang-nuts/ABD4fTkP4Nc/bliIAAAeAQAJ
The problem to be solved
Dealing with large JSON structures is inefficient, due to the internal buffering done by encoding/json
. json.NewEncoder
and json.NewDecoder
appear to offer streaming benefits, but this is mostly an idiomatic advantage, not a performance one, as internal buffering still takes place.
To elaborate:
When encoding, even with json.Encoder
, the entire object is marshaled into memory, before it is written to the io.Writer
. This proposal allows writing the JSON output immediately, rather than waiting for the entire process to complete successfully first.
The same problem occurs in reverse--when reading a large JSON object: you cannot begin processing the result until the entire result is received.
A naïve solution
I believe a simple solution (simple from the perspective of a consumer of the library--the internal changes are not so simple) would be to add two interfaces:
type StreamMarshaler interface {
MarshalJSONStream(io.Writer) error
}
type StreamUnmarshaler interface {
UnmarshalJSONStream(io.Reader) error
}
During (un)marshaling, where encoding/json
looks for json.Marshaler
and json.Unmarshaler
respectively, it will now look for (and possibly prefer) the new interfaces instead. Wrapping either the old or new interfaces to work as the other is a trivial matter.
With this change, and the requisite internal changes, it would be possible to begin streaming large JSON data to a server immediately, from within a MarshalJSONStream()
implementation, for instance.
The drawback is that it violates the above mentioned promise of complete reads and writes, even with errors.
Making it Opt-in
To accommodate this requirement, I believe it would be possible to expose the streaming functionality only with the json.Encoder
and json.Decoder
implementations, and only when SetDirect*
(name TBD, borrowed from https://go-review.googlesource.com/c/go/+/135595/8/src/encoding/json/stream.go#283) is enabled. So further, the following two functions would be added to the public API:
func (*Encoder) SetDirectWrite()
func (*Decoder) SetDirectRead()
The default behavior, even when a type implements one of the new Stream*
interfaces, will be to operate on an entire JSON object at once. That is to say, the Encoder will internally buffer MarshalJSONStream
's output, and process any error before continuing, and a decoder will read an entire JSON object into a buffer, then pass it to UnmarshalJSONStream
only if there are no errors.
However, when SetDirect*
is enabled, the library will bypass this internal buffering, allowing for immediate streaming to/from the source/destination.
Enabling streaming with the SetDirect*
toggle could be enough to already experience a benefit for many users, even without the use of the additional interfaces above.
Toggling SetDirect*
on will, of course, enable streaming for all types, not just those which implement the new interface above, so this could be considered a separate part of the proposal. In my opinion, this alone would be worth implementing, even if the new interface types above are done later or never.
Internals
CLs 13818 and 135595 can serve as informative for this part of the discussion. I've also done some digging in the encoding/json
package (as of 1.12) recently, for more current context.
A large number of internal changes will be necessary to allow for this. I started playing around with a few internals, and I believe this is doable, but will mean a lot of code churn, so will need to be done carefully, in small steps with good code review.
As an exercise, I have successfully rewrittenindent()
to work with streams, rather than on byte slices, and began doing the same with compact()
. The encodeState
type would need to work with a standard io.Writer
rather than specifically a bytes.Buffer
. This seems to be a bigger change, but not technically difficult. I know there are other changes needed--I haven't done a complete audit of the code.
An open question is how these changes might impact performance. My benchmarks after changing indent()
showed no change in performance, but it wasn't a particularly rigorous test.
With the internals rewritten to support streams, then it's just a matter of doing the internal buffering at the appropriate place, such as at API boundaries (i.e. in Marshal()
and Unmarshal()
), rather than as a bulit-in fundamental concept. Then, as described above, turning off that buffering when properly configured above.
Final comments
To be clear, I am interested in working on this. I’m not just trying to throw out a “nice to have, now would somebody do this for me?” type of proposal. But I want to make sure I fully understand the history and context of this situation before I start too far down this rabbit hole.
I'm curious to hear the opinions of others who have been around longer. Perhaps such a proposal was already discussed (and possibly rejected?) in greater length than I can find in the above linked tickets. If so, please point me to the relevant conversation(s).
I am aware of several third-party libraries that offer some support like this, but most have various drawbacks (relying on code generation, or over-complex APIs). I would love to see this kind of support in the standard library.
If this general direction is approved, I think the first step is to break it into smaller parts that can be accomplished incrementally. I have given this thought, but so as not to jump the gun too much, will withhold my thoughts for a while, to allow proper discussion.
And one last aside: CL 13818 also added support for marshaling channels. That may or may not be a good idea (my personal feeling: probably not), but that can be addressed separately.
Metadata
Metadata
Assignees
Type
Projects
Status