
Description
This may be riiiight on the edge of the entropy ceiling, but I think the gains outweigh the losses.
First-class SIMD support is great, but currently it's limited to the case of one-dimensional element-wise operations -- more complex tensor architecture support is on the horizon (and of course GPUs have had it for years), and we don't want to be left in the dust. Furthermore, the premise of fitting a vector into any syntactic role that may be filled by a scalar leads to ambiguity in some cases.
This issue addresses both of these problems.
Isotropic SIMD
I propose an extension to #6771's syntax: allowing multiple indexes between bars, for arbitrary-rank tensors. For instance, a 4x4 matrix of floats is [|4, 4|]f32
, and a rendering surface might be [3][|1920, 1080|]u8
. Note that these need not correspond to hardware vectors, although when that is possible the compiler will guarantee it -- a tensor is a syntactic construct, in that it allows one written value to represent multiple parallel values. In particular, a scalar will coerce to a tensor populated with copies of it, whereas an array will not. The compiler may choose to represent tensors in whatever rank order it sees fit, to best optimise for hardware SIMD operation -- this is the primary motivation for condensing multiple indices into one functor.
Index Operations
The expressive power of tensors highlights a potential syntactic ambiguity also inherent in regular vectors -- does an operation apply to the tensor as a whole, or to its elements individually? For instance, if locs
is a vector of multi-pointers, is locs[3]
the fourth element of locs
, or a gather vector (#3575)? In either case, how do we represent the other case?
I propose that any index of a tensor will be taken as an index into the tensor, so in the above case locs[3]
is unambiguously the fourth element of locs
. An index into a tensor must contain the appropriate number of indices. In cases where index operations on SIMD values are desired, some or all indices may simply be left out: locs[][3]
is a gather vector, unambiguously. After one index, a tensor behaves syntactically like its child type, regardless of how many indices were actually collapsed -- it is compatible with primitive operators in expressions with scalars or other tensors of the same shape.
By this same method, we may have operations on arbitrary planes of arbitrary-rank tensors: for instance, a[i,] * b[,j]
is the vector resulting from pairwise multiplying the elements of row i
of matrix a
by column j
of matrix b
. Non-collapsed indices are paired in order of listing. Such planed tensors may be lvalues or rvalues.
Iteration
for
shall be expanded to operate on tensors. If an index is captured, all indices must be captured -- that is, if a
is a matrix, then for (a) |e|
and for (a) |e, i, j|
are both allowed, but for (a) |e, i|
is not. (Unwanted indices may be ignored with _
, as usual.)
const matrixMultiply = fn (comptime T: type, comptime n: usize, a: [|n, n|]T, b: [|n, n|]T) [|n, n|]T {
var acc: [|n, n|]T = undefined;
for (acc) |*e, i, j| e.* = @reduce(.Add, a[i,] * b[,j]);
return acc;
};
const transpose = fn (in: anytype) anytype {
const Child = @TypeOf(in).child;
const shape = @TypeOf(in).shape; // `shape` is the array of dimensions
var result: [|shape[1], shape[0]|]Child = undefined;
for (result) |*e, i, j| e.* = in[j, i];
return result;
};
The compiler guarantees that if a for
map over a tensor does not mutate external state or call non-inline functions, it will be vectorised along relevant axes (that is, entire rows or columns of the iteration will condense to single machine operations) -- when iteration order is significant, minor indices will be incremented first. If #6965 is accepted, we may abstract over the common case of initialising a tensor with a map. (Crazy idea: such an abstraction could be made rank-polymorphic if we had a way to capture all indices at once -- see #6965 (comment).)
Gather/Scatter
Any syntactic single index operation (array, multipointer, slice, vector) may take an integer tensor instead of an integer. The result of the operation is a tensor of the same shape as the index, populated by the results of indexing the object with every one of the elements. Gather/scattering a higher rank tensor would lead to ambiguity and is hence not allowed directly -- to do this, first eliminate any other indices: (a[,y,])[b]
(the parentheses are necessary to apply the gascer to the whole plane and not its elements).
Coda
There is a danger, in any SIMD formalism, of making things too dense and "magical" (see: APL). I've tried hard to avoid that here -- I believe this proposal is the right balance of concise and expressive, and that it has just the right level of restriction to be legible without requiring a plethora of special cases to do anything useful. One downside is that it is some degree abstracted from the hardware -- however, I believe this is inevitable, given the varying capabilities of SIMD hardware present and future. It will at least be easier to optimise than hand-rolled loops, and in cases where performance comes before portability, intrinsic libraries are still available (#5241 even allows passing vectors directly to assembly).