Integer and Float schema types fail after creation #219

mina-asham · 2025-02-24T12:43:51Z

When supplying an Avro schema type with the option to enable creating the BigQuery table, it maps:

Integer Avro type -> INTEGER BigQuery
Float Avro type to -> FLOAT BigQuery

This is fine, however on mapping back the schema from BigQuery to Avro on any subsequent run to validate the schema of the existing table against the Avro one, the mapper maps:

INTEGER BigQuery -> Long Avro type
FLOAT BigQuery -> Double Avro type

This makes sense I believe because BigQuery doesn't have the option for low actual int/float so it's only safe to upcast, but the connector shouldn't allow creating schemas with int/float to start with I think (or maybe does something super smart as storing the avro schema in the description of the table or something to be able to get it back).

At the moment, we wrote a custom schema checker to disallow the use of Avro's int/float as it will always fail otherwise, but this should be done in the connector.

Here is the validation snippet we currently use:

void validateSchema(String fieldName, Schema schema) {
    Type type = schema.getType();

    if (type == Type.RECORD) {
        schema.getFields().forEach(field -> validateSchema(field.name(), field.schema()));
    } else if (type == Type.ARRAY) {
        validateSchema(fieldName, schema.getElementType());
    } else if (type == Type.UNION) {
        schema.getTypes().forEach(unionSchema -> validateSchema(fieldName, unionSchema));
    } else {
        // In BQ, INT/LONG are mapped to NUMBER and FLOAT/DOUBLE are mapped to FLOAT,
        // the connector fails to validate the schema when going from BQ type to Avro type
        // as it can only assume the large Avro type is used, so we need to stop people from using INT/FLOAT
        // https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#avro_conversions
        Preconditions.checkArgument(
                type != Type.INT, "Field: %s, INT is not supported for BQ, please use LONG", fieldName);
        Preconditions.checkArgument(
                type != Type.FLOAT, "Field: %s, FLOAT is not supported for BQ, please use DOUBLE", fieldName);

        // Curated list of what we should support, we can
        Preconditions.checkArgument(
                type == Type.NULL
                        || type == Type.BYTES
                        || type == Type.STRING
                        || type == Type.BOOLEAN
                        || type == Type.LONG
                        || type == Type.DOUBLE,
                "Field: %s, %s is not supported for BQ",
                fieldName,
                type);
    }
}

jayehwhyehentee · 2025-02-24T14:37:35Z

This is a valid issue, and I agree that disallowing int/float would be a clean solution.
However, it would be a breaking change for our users and we do not want that so close to the connector's first major release 1.0.0 (targeted for this month).
That being said, I'll add a callout in the readme to advice users against int/float. This issue will remain open until we properly address it in v2.

mina-asham · 2025-02-24T14:54:16Z

I think I disagree it's a breaking change for two reasons:

currently anyone using int/float, would have to change their code to long/double or it's already broken, if someone had int/float in their Avro schema, then the connector is failing silently for them without them knowing
the connector is currently in 0.x versions, it's fair and expected that a breaking change can happen with 1.0 release

jayehwhyehentee · 2025-02-24T15:16:35Z

currently anyone using int/float, would have to change their code to long/double or it's already broken, if someone had int/float in their Avro schema, then the connector is failing silently for them without them knowing
Not entirely true. This does not impact users who are creating new tables in every run.
Anywho, if table auto-creation is used, then the "subsequent run" could also be a failure recovery. So its true that the issue is not ignorable.
I guess we can remove int/float until the connector properly supports them.

mina-asham · 2025-02-24T15:28:07Z

I guess we can remove int/float until the connector properly supports them.

I think that's the right path too, restrict in v1, properly solve in v2 (encoding the schema in metadata/description or something)

jayehwhyehentee · 2025-02-25T08:12:56Z

Got a better solution: #220. We are internally upcasting int/float to long/double, without limiting users to larger types.

Fixes #219

jayehwhyehentee mentioned this issue Feb 25, 2025

Upcast int/float in long/double in sink #220

Merged

jayehwhyehentee closed this as completed in #220 Feb 25, 2025

jayehwhyehentee added a commit that referenced this issue Feb 25, 2025

Upcast int/float in long/double in sink (#220)

374dd7b

Fixes #219

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integer and Float schema types fail after creation #219

Integer and Float schema types fail after creation #219

mina-asham commented Feb 24, 2025 •

edited

Loading

jayehwhyehentee commented Feb 24, 2025

mina-asham commented Feb 24, 2025

jayehwhyehentee commented Feb 24, 2025

mina-asham commented Feb 24, 2025

jayehwhyehentee commented Feb 25, 2025

Integer and Float schema types fail after creation #219

Integer and Float schema types fail after creation #219

Comments

mina-asham commented Feb 24, 2025 • edited Loading

jayehwhyehentee commented Feb 24, 2025

mina-asham commented Feb 24, 2025

jayehwhyehentee commented Feb 24, 2025

mina-asham commented Feb 24, 2025

jayehwhyehentee commented Feb 25, 2025

mina-asham commented Feb 24, 2025 •

edited

Loading