You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The jl_rng_split function forks a task's RNG state in a way that is essentially
871
+
guaranteed to avoid collisions between the RNG streams of all tasks. The main
872
+
RNG is the xoshiro256++ RNG whose state is stored in rngState[0..3]. There is
873
+
also a small internal RNG used for task forking stored in rngState[4]. This
874
+
state is used to iterate a LCG (linear congruential generator), which is then
875
+
put through four different variations of the strongest PCG output function,
876
+
referred to as PCG-RXS-M-XS-64 [1]. This output function is invertible: it maps
877
+
a 64-bit state to 64-bit output; which is one of the reasons it's not
878
+
recommended for general purpose RNGs unless space is at a premium, but in our
879
+
usage invertibility is actually a benefit, as is explained below.
880
+
881
+
The goal of jl_rng_split is to perturb the state of each child task's RNG in
882
+
such a way each that for an entire tree of tasks spawned starting with a given
883
+
state in a root task, no two tasks have the same RNG state. Moreover, we want to
884
+
do this in a way that is deterministic and repeatable based on (1) the root
885
+
task's seed, (2) how many random numbers are generated, and (3) the task tree
886
+
structure. The RNG state of a parent task is allowed to affect the initial RNG
887
+
state of a child task, but the mere fact that a child was spawned should not
888
+
alter the RNG output of the parent. This second requirement rules out using the
889
+
main RNG to seed children -- some separate state must be maintained and changed
890
+
upon forking a child task while leaving the main RNG state unchanged.
891
+
892
+
The basic approach is that used by the DotMix [2] and SplitMix [3] RNG systems:
893
+
each task is uniquely identified by a sequence of "pedigree" numbers, indicating
894
+
where in the task tree it was spawned. This vector of pedigree coordinates is
895
+
then reduced to a single value by computing a dot product with a common vector
896
+
of random weights. The DotMix paper provides a proof that this dot product hash
897
+
value (referred to as a "compression function") is collision resistant in the
898
+
sense the the pairwise collision probability of two distinct tasks is 1/N where
899
+
N is the number of possible weight values. Both DotMix and SplitMix use a prime
900
+
value of N because the proof requires that the difference between two distinct
901
+
pedigree coordinates must be invertible, which is guaranteed by N being prime.
902
+
We take a different approach: we instead limit pedigree coordinates to being
903
+
binary instead -- when a task spawns a child, both tasks share the same pedigree
904
+
prefix, with the parent appending a zero and the child appending a one. This way
905
+
a binary pedigree vector uniquely identifies each task. Moreover, since the
906
+
coordinates are binary, the difference between coordinates is always one which
907
+
is its own inverse regardless of whether N is prime or not. This allows us to
908
+
compute the dot product modulo 2^64 using native machine arithmetic, which is
909
+
considerably more efficient and simpler to implement than arithmetic in a prime
910
+
modulus. It also means that when accumulating the dot product incrementally, as
911
+
described in SplitMix, we don't need to multiply weights by anything, we simply
912
+
add the random weight for the current task tree depth to the parent's dot
913
+
product to derive the child's dot product.
914
+
915
+
We use the LCG in rngState[4] to derive generate pseudorandom weights for the
916
+
dot product. Each time a child is forked, we update the LCG in both parent and
917
+
child tasks. In the parent, that's all we have to do -- the main RNG state
918
+
remains unchanged (recall that spawning a child should *not* affect subsequence
919
+
RNG draws in the parent). The next time the parent forks a child, the dot
920
+
product weight used will be different, corresponding to being a level deeper in
921
+
the binary task tree. In the child, we use the LCG state to generate four
922
+
pseudorandom 64-bit weights (more below) and add each weight to one of the
923
+
xoshiro256 state registers, rngState[0..3]. If we assume the main RNG remains
924
+
unused in all tasks, then each register rngState[0..3] accumulates a different
925
+
Dot/SplitMix dot product hash as additional child tasks are spawned. Each one is
926
+
collision resistant with a pairwise collision chance of only 1/2^64. Assuming
927
+
that the four pseudorandom 64-bit weight streams are sufficiently independent,
928
+
the pairwise collision probability for distinct tasks is 1/2^256. If we somehow
929
+
managed to spawn a trillion tasks, the probability of a collision would be on
930
+
the order of 1/10^54. Practically impossible. Put another way, this is the same
931
+
as the probability of two SHA256 hash values accidentally colliding, which we
932
+
generally consider so unlikely as not to be worth worrying about.
933
+
934
+
What about the random "junk" that's in the xoshiro256 state registers from
935
+
normal use of the RNG? For a tree of tasks spawned with no intervening samples
936
+
taken from the main RNG, all tasks start with the same junk which doesn't affect
937
+
the chance of collision. The Dot/SplitMix papers even suggest adding a random
938
+
base value to the dot product, so we can consider whatever happens to be in the
939
+
xoshiro256 registers to be that. What if the main RNG gets used between task
940
+
forks? In that case, the initial state registers will be different. The DotMix
941
+
collision resistance proof doesn't apply without modification, but we can
942
+
generalize the setup by adding a different base constant to each compression
943
+
function and observe that we still have a 1/N chance of the weight value
944
+
matching that exact difference. This proves collision resistance even between
945
+
tasks whose dot product hashes are computed with arbitrary offsets. We can
946
+
conclude that this scheme provides collision resistance even in the face of
947
+
different starting states of the main RNG. Does this seem too good to be true?
948
+
Perhaps another way of thinking about it will help. Suppose we seeded each task
949
+
completely randomly. Then there would also be a 1/2^256 chance of collision,
950
+
just as the DotMix proof gives. Essentially what the proof is telling us is that
951
+
if the weights are chosen uniformly and uncorrelated with the rest of the
952
+
compression function, then the dot product construction is a good enough way to
953
+
pseudorandomly seed each task. From that perspective, it's easier to believe
954
+
that adding an arbitrary constant to each seed doesn't worsen its randomness.
955
+
956
+
This leaves us with the question of how to generate four pseudorandom weights to
957
+
add to the rngState[0..3] registers at each depth of the task tree. The scheme
958
+
used here is that a single 64-bit LCG state is iterated in both parent and child
959
+
at each task fork, and four different variations of the PCG-RXS-M-XS-64 output
960
+
function are applied to that state to generate four different pseudorandom
961
+
weights. Another obvious way to generate four weights would be to iterate the
962
+
LCG four times per task split. There are two main reasons we've chosen to use
963
+
four output variants instead:
964
+
965
+
1. Advancing four times per fork reduces the set of possible weights that each
966
+
register can be perturbed by from 2^64 to 2^60. Since collision resistance is
967
+
proportional to the number of possible weight values, that would reduce
968
+
collision resistance.
969
+
970
+
2. It's easier to compute four PCG output variants in parallel. Iterating the
971
+
LCG is inherently sequential. Each PCG variant can be computed independently
972
+
from the LCG state. All four can even be computed at once with SIMD vector
973
+
instructions, but the compiler doesn't currently choose to do that.
974
+
975
+
A key question is whether the approach of using four variations of PCG-RXS-M-XS
976
+
is sufficiently random both within and between streams to provide the collision
977
+
resistance we expect. We obviously can't test that with 256 bits, but we have
978
+
tested it with a reduced state analogue using four PCG-RXS-M-XS-8 output
979
+
variations applied to a common 8-bit LCG. Test results do indicate sufficient
980
+
independence: a single register has collisions at 2^5 while four registers only
981
+
start having collisions at 2^20, which is actually better scaling of collision
982
+
resistance than we expect in theory. In theory, with one byte of resistance we
983
+
have a 50% chance of some collision at 20, which matches, but four bytes gives a
984
+
50% chance of collision at 2^17 and our (reduced size analogue) construction is
985
+
still collision free at 2^19. This may be due to the next observation, which guarantees collision avoidance for certain shapes of task trees as a result of using an
986
+
invertible RNG to generate weights.
987
+
988
+
In the specific case where a parent task spawns a sequence of child tasks with
989
+
no intervening usage of its main RNG, the parent and child tasks are actually
990
+
_guaranteed_ to have different RNG states. This is true because the four PCG
991
+
streams each produce every possible 2^64 bit output exactly once in the full
992
+
2^64 period of the LCG generator. This is considered a weakness of PCG-RXS-M-XS
993
+
when used as a general purpose RNG, but is quite beneficial in this application.
994
+
Since each of up to 2^64 children will be perturbed by different weights, they
995
+
cannot have hash collisions. What about parent colliding with child? That can
996
+
only happen if all four main RNG registers are perturbed by exactly zero. This
997
+
seems unlikely, but could it occur? Consider this part of each output function:
998
+
999
+
p ^= p >> ((p >> 59) + 5);
1000
+
p *= m[i];
1001
+
p ^= p >> 43
1002
+
1003
+
It's easy to check that this maps zero to zero. An unchanged parent RNG can only
1004
+
happen if all four `p` values are zero at the end of this, which implies that
1005
+
they were all zero at the beginning. However, that is impossible since the four
1006
+
`p` values differ from `x` by different additive constants, so they cannot all
1007
+
be zero. Stated more generally, this non-collision property: assuming the main
1008
+
RNG isn't used between task forks, sibling and parent tasks cannot have RNG
1009
+
collisions. If the task tree structure is more deeply nested or if there are
1010
+
intervening uses of the main RNG, we're back to relying on "merely" 256 bits of
1011
+
collision resistance, but it's nice to know that in what is likely the most
1012
+
common case, RNG collisions are actually impossible. This fact may also explain
1013
+
better-than-theoretical collision resistance observed in our experiment with a
0 commit comments