Add golden tests for the validation benchmarks

Currently the `validation` benchmarks don't seem to have any golden tests ensuring that what scripts evaluate to doesn't change. Unlike e.g. `nofib` or `bitwise` benchmarks (and `lists` ones have some property tests). We should fix that, otherwise an apparent optimization might as well turn out to be a bug.

Same about `marlowe`?