Skip to content

internal/encoding: binary files are incorrectly treated as UTF-8 #3741

Closed
@nichtsundniemand

Description

@nichtsundniemand

What version of CUE are you using (cue version)?

$ cue version
cue version (devel)

go version go1.23.5
      -buildmode exe
       -compiler gc
  DefaultGODEBUG asynctimerchan=1,gotypesalias=0,httpservecontentkeepheaders=1,tls3des=1,tlskyber=0,x509keypairleaf=0,x509negativeserial=1
     CGO_ENABLED 0
          GOARCH amd64
            GOOS linux
         GOAMD64 v1
cue.lang.version v0.12.0

Does this issue reproduce with the latest stable release?

Yes it was found using it.

What did you do?

I was working with some ELF files, trying to use CUE's embed feature to process these files using CUE. To embed them I used type=binary.

As I do not want to include some big binaries here is a minimal reproducer (which cannot be packed into txtar format, unfortunately):

$ printf "\xf0" > invalid.bin
$ hexdump invalid.bin
0000000 00f0                                   
0000001
$ cat repro.cue
@extern(embed)

package repro
import (
	"list"
	"strings"
)

want: '\xf0'
got: '\xef\xbf\xbd'

invalid: bytes @embed(file=invalid.bin, type=binary)
length_check: len(invalid) & 1
content_check: invalid & want

invalid_length_check: len(invalid) & 3
invalid_content_check: invalid & got

bytelist: [for i in list.Range(0, len(invalid), 1) {strings.ByteAt(invalid, i)}]
$ cue eval
content_check: conflicting values '\xf0' and '�':
    
    ./repro.cue:9:7
    ./repro.cue:14:16
    ./repro.cue:14:26
length_check: conflicting values 1 and 3:
    ./repro.cue:13:15
    ./repro.cue:13:30

What did you expect to see?

I was expecting CUE to give me the file's contents verbatim.

What did you see instead?

As is hopefully clear by looking at the example above CUE's @embed() is returning three bytes instead of just one.

Those three bytes happen to be the unicode "replacement character" encoded in UTF-8.

So CUE appears to pass the binary file through a UTF-8 decoder before handing the value to the evaluator.

I was going through the relevant function and found an old comment foreseeing this problem:

For now we assume that all encodings require UTF-8. This will not be the case for some binary protocols. We need to exempt those explicitly here once we introduce them.

My attempt at fixing this issue basically does exactly that: Instead of reading the bytes out of the UTF8-Reader in l.265, I pass in the raw file reader srcr to ReadAll(). This resolves the issue, at least in my case, so I created a small PR: #3740

I also noticed that the Decoder is pretty much untested. I can offer to write some go-tests in case they are wanted.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions