Skip to content

Constants for dimensions and variables #17

Closed
@eric-czech

Description

@eric-czech

Can we live without constants for dimension names in datasets (cf. https://github.com/pystatgen/sgkit/pull/10)?

Specifically I mean these in api.py:

DIM_VARIANT = "variants"
DIM_SAMPLE = "samples"
DIM_PLOIDY = "ploidy"
DIM_ALLELE = "alleles"
DIM_GENOTYPE = "genotypes"

Using these to build/append a dataset involves something like this:

data_vars = { "variant/contig": ([DIM_VARIANT], variant_contig)}

What I'm wondering is if we're going to use constants, why stop at the dimension names? It could look like this:

data_vars = { f"{VAR_GROUP_VARIANT}/{VAR_NAME_CONTIG}": ([DIM_VARIANT], variant_contig)}

but I doubt any of us would prefer it. I think the two biggest advantages of the constants are:

  1. Preventing typos
  2. Allowing us to change the names

Of these, 2 seems unlikely to be important (and we'll probably not use the constants in examples/documentation anyhow) and 1 might eventually be solved with things like python/typing#28 (comment) and pydata/xarray#3967. I'm not going to hold my breath for that, but I do think it's worth asking whether or not we would all prefer this instead:

data_vars = { "variant/contig": (["variants"], variant_contig)}

As a user, I think I would be happy to make my own constants or use mypy Literal types if I wanted some static safety and to just live with the risk of typos otherwise. As a contributor, I'm not so sure but I'm leaning towards preferring the latter. Any thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions