Skip to content

Integer and Float schema types fail after creation #219

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mina-asham opened this issue Feb 24, 2025 · 5 comments · Fixed by #220
Closed

Integer and Float schema types fail after creation #219

mina-asham opened this issue Feb 24, 2025 · 5 comments · Fixed by #220

Comments

@mina-asham
Copy link

mina-asham commented Feb 24, 2025

When supplying an Avro schema type with the option to enable creating the BigQuery table, it maps:

  • Integer Avro type -> INTEGER BigQuery
  • Float Avro type to -> FLOAT BigQuery

This is fine, however on mapping back the schema from BigQuery to Avro on any subsequent run to validate the schema of the existing table against the Avro one, the mapper maps:

  • INTEGER BigQuery -> Long Avro type
  • FLOAT BigQuery -> Double Avro type

This makes sense I believe because BigQuery doesn't have the option for low actual int/float so it's only safe to upcast, but the connector shouldn't allow creating schemas with int/float to start with I think (or maybe does something super smart as storing the avro schema in the description of the table or something to be able to get it back).

At the moment, we wrote a custom schema checker to disallow the use of Avro's int/float as it will always fail otherwise, but this should be done in the connector.

Here is the validation snippet we currently use:

void validateSchema(String fieldName, Schema schema) {
    Type type = schema.getType();

    if (type == Type.RECORD) {
        schema.getFields().forEach(field -> validateSchema(field.name(), field.schema()));
    } else if (type == Type.ARRAY) {
        validateSchema(fieldName, schema.getElementType());
    } else if (type == Type.UNION) {
        schema.getTypes().forEach(unionSchema -> validateSchema(fieldName, unionSchema));
    } else {
        // In BQ, INT/LONG are mapped to NUMBER and FLOAT/DOUBLE are mapped to FLOAT,
        // the connector fails to validate the schema when going from BQ type to Avro type
        // as it can only assume the large Avro type is used, so we need to stop people from using INT/FLOAT
        // https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#avro_conversions
        Preconditions.checkArgument(
                type != Type.INT, "Field: %s, INT is not supported for BQ, please use LONG", fieldName);
        Preconditions.checkArgument(
                type != Type.FLOAT, "Field: %s, FLOAT is not supported for BQ, please use DOUBLE", fieldName);

        // Curated list of what we should support, we can
        Preconditions.checkArgument(
                type == Type.NULL
                        || type == Type.BYTES
                        || type == Type.STRING
                        || type == Type.BOOLEAN
                        || type == Type.LONG
                        || type == Type.DOUBLE,
                "Field: %s, %s is not supported for BQ",
                fieldName,
                type);
    }
}
@jayehwhyehentee
Copy link
Collaborator

This is a valid issue, and I agree that disallowing int/float would be a clean solution.
However, it would be a breaking change for our users and we do not want that so close to the connector's first major release 1.0.0 (targeted for this month).
That being said, I'll add a callout in the readme to advice users against int/float. This issue will remain open until we properly address it in v2.

@mina-asham
Copy link
Author

I think I disagree it's a breaking change for two reasons:

  • currently anyone using int/float, would have to change their code to long/double or it's already broken, if someone had int/float in their Avro schema, then the connector is failing silently for them without them knowing
  • the connector is currently in 0.x versions, it's fair and expected that a breaking change can happen with 1.0 release

@jayehwhyehentee
Copy link
Collaborator

currently anyone using int/float, would have to change their code to long/double or it's already broken, if someone had int/float in their Avro schema, then the connector is failing silently for them without them knowing
Not entirely true. This does not impact users who are creating new tables in every run.
Anywho, if table auto-creation is used, then the "subsequent run" could also be a failure recovery. So its true that the issue is not ignorable.
I guess we can remove int/float until the connector properly supports them.

@mina-asham
Copy link
Author

I guess we can remove int/float until the connector properly supports them.

I think that's the right path too, restrict in v1, properly solve in v2 (encoding the schema in metadata/description or something)

@jayehwhyehentee
Copy link
Collaborator

Got a better solution: #220. We are internally upcasting int/float to long/double, without limiting users to larger types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants