Description
Can we live without constants for dimension names in datasets (cf. https://github.com/pystatgen/sgkit/pull/10)?
Specifically I mean these in api.py
:
DIM_VARIANT = "variants"
DIM_SAMPLE = "samples"
DIM_PLOIDY = "ploidy"
DIM_ALLELE = "alleles"
DIM_GENOTYPE = "genotypes"
Using these to build/append a dataset involves something like this:
data_vars = { "variant/contig": ([DIM_VARIANT], variant_contig)}
What I'm wondering is if we're going to use constants, why stop at the dimension names? It could look like this:
data_vars = { f"{VAR_GROUP_VARIANT}/{VAR_NAME_CONTIG}": ([DIM_VARIANT], variant_contig)}
but I doubt any of us would prefer it. I think the two biggest advantages of the constants are:
- Preventing typos
- Allowing us to change the names
Of these, 2 seems unlikely to be important (and we'll probably not use the constants in examples/documentation anyhow) and 1 might eventually be solved with things like python/typing#28 (comment) and pydata/xarray#3967. I'm not going to hold my breath for that, but I do think it's worth asking whether or not we would all prefer this instead:
data_vars = { "variant/contig": (["variants"], variant_contig)}
As a user, I think I would be happy to make my own constants or use mypy Literal types if I wanted some static safety and to just live with the risk of typos otherwise. As a contributor, I'm not so sure but I'm leaning towards preferring the latter. Any thoughts?