Mixed case identifier support #24551

agrawalreetika · 2025-02-13T05:18:29Z

Description

Improves identifier handling (schema, table) to align with SQL standards for better compatibility with case-sensitive and case-normalizing databases, while minimizing SPI-breaking changes.

Motivation and Context

RFC details - prestodb/rfcs#36

This PR focuses on generalizing schema and table name handling to better align with SQL standards. As per the current API behavior, identifiers are lowercased by default unless explicitly handled (RFC reference).
Column name support and related changes will be addressed in a follow-up PR. Currently, column names are lowercased at the SPI level (ColumnMetadata.java#L45). Removing this generic lowercase conversion will require updates to normalize column names via the metadata API in each connector.

Impact

Improves identifier handling (schema, table) to align with SQL standards for better compatibility with case-sensitive and case-normalizing databases, while minimizing SPI-breaking changes.

Here is how Analysis Time on local Mac with 8GB JVM for EXPLAIN TPCH Queries, tpch-sf100 with (master) and with the PR changes -

Test Plan

Existing UT passing
Added support for Mysql and new UT added for Mysql for when mixed-case support is enabled

Contributor checklist

Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add mixed case support for schema and table names.

JDBC Driver Changes
* Add mixed case support related catalog property in JDBC connector ``case-sensitive-name-matching``.

MySQL Connector Changes
* Add support for mixed-case in MySQL. It can be enabled by setting ``case-sensitive-name-matching=true`` configuration in the catalog configuration

prestodb-ci · 2025-03-27T09:49:15Z

@ethanyzhang imported this issue into IBM GitHub Enterprise

prestodb-ci · 2025-03-29T02:05:30Z

@ethanyzhang imported this issue into IBM GitHub Enterprise

aaneja

Flush 1 - Finished looking at code. Looking at tests next

aaneja · 2025-04-24T10:28:56Z

presto-main-base/src/main/java/com/facebook/presto/metadata/MetadataUtil.java

            if (parts.size() > 2) {
                throw new SemanticException(INVALID_SCHEMA_NAME, node, "Too many parts in schema name: %s", schema.get());
            }
            if (parts.size() == 2) {
-                catalogName = parts.get(0);
+                catalogName = parts.get(0).toLowerCase(ENGLISH);


Why was toLowerCase added here

aaneja · 2025-04-24T10:41:14Z

presto-main-base/src/main/java/com/facebook/presto/metadata/MetadataManager.java

@@ -1478,6 +1521,17 @@ public void addConstraint(Session session, TableHandle tableHandle, TableConstra
        ConnectorMetadata metadata = getMetadataForWrite(session, connectorId);
        metadata.addConstraint(session.toConnectorSession(connectorId), tableHandle.getConnectorHandle(), tableConstraint);
    }
+    @Override


nit: Add an empty line above

aaneja · 2025-04-24T10:59:31Z

presto-main-base/src/main/java/com/facebook/presto/metadata/MetadataManager.java

            for (ConnectorId connectorId : catalogMetadata.listConnectorIds()) {
                ConnectorMetadata metadata = catalogMetadata.getMetadataFor(connectorId);
                ConnectorSession connectorSession = session.toConnectorSession(connectorId);
                metadata.listTables(connectorSession, prefix.getSchemaName()).stream()
                        .map(convertFromSchemaTableName(prefix.getCatalogName()))
-                        .filter(prefix::matches)
+                        .filter(name -> prefix.matches(new QualifiedObjectName(name.getCatalogName(),


This is a bit confusing. prefix is supplied by the user and is NOT normalized. name is the output of listTables and should already be normalized ? If so, shouldn't we instead normalize the prefix and then call matches ?

prefix would have catalog.schema mapping resulted from listSchemaNames on which listTables needs to be called. So I think that should be passed the original value to listTables API here.

ZacBlanco

Overall, I think this looks good, but I have two concerns

I am paranoid about the overhead that normalizeIdentifiers could introduce. There is quite a few layers of functions calls to get into a plugin's ConnectorMetadata instance before hitting normalizeIdentifiers, including setting up the ContextClassLoader for each metadata call. If we have a large query, or have to call many metadata functions during query planning, We might see some small performance hit on the query analysis times. I don't really know how small the impact would be, but it would be nice to verify that we don't have any regressions.
There are a lot of locations where we call normalizeIdentifier in the current approach. However, the source of identifiers all comes from one place: the user query. It seems like with the current solution it is possible to call normalizeIdentifier many times for a single identifier. Do you think it would be possible to move the normalization of identifiers closer to the query analysis phase in order to not have to run it again? Theoretically, we should only need to call it once per identifier in the query, right?

ZacBlanco · 2025-04-24T20:29:39Z

presto-main-base/src/main/java/com/facebook/presto/metadata/CatalogMetadata.java

@@ -110,11 +111,14 @@ public ConnectorTransactionHandle getTransactionHandleFor(ConnectorId connectorI

    public ConnectorId getConnectorId(Session session, QualifiedObjectName table)
    {
-        if (table.getSchemaName().equals(INFORMATION_SCHEMA_NAME)) {
+        if (table.getSchemaName().equalsIgnoreCase(INFORMATION_SCHEMA_NAME)) {


why ignore case now?

agrawalreetika · 2025-04-25T07:56:38Z

Overall, I think this looks good, but I have two concerns

I am paranoid about the overhead that normalizeIdentifiers could introduce. There is quite a few layers of functions calls to get into a plugin's ConnectorMetadata instance before hitting normalizeIdentifiers, including setting up the ContextClassLoader for each metadata call. If we have a large query, or have to call many metadata functions during query planning, We might see some small performance hit on the query analysis times. I don't really know how small the impact would be, but it would be nice to verify that we don't have any regressions.

There are a lot of locations where we call normalizeIdentifier in the current approach. However, the source of identifiers all comes from one place: the user query. It seems like with the current solution, it is possible to call normalizeIdentifier many times for a single identifier. Do you think it would be possible to move the normalization of identifiers closer to the query analysis phase in order to not have to run it again? Theoretically, we should only need to call it once per identifier in the query, right?

Thanks for your review @ZacBlanco

I will try to take some benchmarks with and w/o normalizeIdentifiers and post the results.
In the current state, lowercase conversion is done for user query (SELECT, DML) as well as the metadata queries. so in the current approach normalizeIdentifiers is called even for metadata calls (for ex - show tables etc) as well, to preserve the current lowercase conversion when no specific casing is done at the connector level. I think there are some places where normalizeIdentifiers call can be reduced in ConnectorMetadata where QualifiedObjectName is already normalized, I will make the changes for the same.

aaneja · 2025-04-28T05:58:12Z

I will try to take some benchmarks with and w/o normalizeIdentifiers and post the results.

YMMV, but adding a session RuntimeMetric should show up the impact pretty well, both in terms of count (times called) and wall clock time

agrawalreetika · 2025-04-29T05:27:05Z

@ZacBlanco @aaneja, updated based on your review comments. Please take a look at your convenience.

BryanCutler · 2025-04-29T18:51:54Z

presto-base-jdbc/src/main/java/com/facebook/presto/plugin/jdbc/BaseJdbcConfig.java

    @Config("case-insensitive-name-matching")
+    @ConfigDescription("Deprecated: This will be deprecated in future. Use 'case-sensitive-name-matching=true' instead for mysql. " +


nit:

Suggested change

@ConfigDescription("Deprecated: This will be deprecated in future. Use 'case-sensitive-name-matching=true' instead for mysql. " +

@ConfigDescription("Deprecated: This will be removed in future releases. Use 'case-sensitive-name-matching=true' instead for mysql. " +

BryanCutler · 2025-04-29T18:53:36Z

presto-base-jdbc/src/main/java/com/facebook/presto/plugin/jdbc/BaseJdbcConfig.java

+    @ConfigDescription("Enable case-sensitive matching of schema, table names across the connector, " +
+            "When disabled, names are matched case-insensitively, using lowercase normalization.")


Suggested change

@ConfigDescription("Enable case-sensitive matching of schema, table names across the connector, " +

"When disabled, names are matched case-insensitively, using lowercase normalization.")

@ConfigDescription("Enable case-sensitive matching of schema, table names across the connector. " +

"When disabled, names are matched case-insensitively using lowercase normalization.")

BryanCutler · 2025-04-29T18:54:57Z

presto-docs/src/main/sphinx/connector/mysql.rst

+                                                   names for the connector, When disabled, names are matched
+                                                   case-insensitively, using lowercase normalization.


Suggested change

names for the connector, When disabled, names are matched

case-insensitively, using lowercase normalization.

names for the connector. When disabled, names are matched

case-insensitively using lowercase normalization.

BryanCutler · 2025-04-29T18:57:26Z

.../src/test/java/com/facebook/presto/iceberg/procedure/TestRemoveOrphanFilesProcedureBase.java

@@ -49,6 +49,7 @@
 import java.util.Map;

 import static com.facebook.presto.SystemSessionProperties.LEGACY_TIMESTAMP;
+import static com.facebook.presto.iceberg.IcebergQueryRunner.ICEBERG_CATALOG;


You add the import on the base class, where the subclasses are using it?

This is unused, will remove it. Didn't get checkstyle failure since in test I think

BryanCutler · 2025-04-29T19:06:59Z

...eberg/src/test/java/com/facebook/presto/iceberg/procedure/TestSetTablePropertyProcedure.java

@@ -160,6 +160,7 @@ public void testInvalidSetTablePropertyProcedureCases()

    private Table loadTable(String tableName)
    {
+        tableName = normalizeIdentifier(tableName, ICEBERG_CATALOG);


You could add a method to the base class like loadTableAndNormalizeName that just calls normalizeIdentifier then loadTable in the subclass. You would just have to update the existing calls to loadTable.

BryanCutler · 2025-04-29T19:10:22Z

presto-main-base/src/main/java/com/facebook/presto/metadata/MetadataManager.java

+            ConnectorMetadata metadata = catalogMetadata.get().getMetadataFor(connectorId);
+            normalizedString = metadata.normalizeIdentifier(session.toConnectorSession(connectorId), identifier);
+            session.getRuntimeStats().addMetricValue(GET_IDENTIFIER_NORMALIZATION_TIME_NANOS, NANO, System.nanoTime() - startTime);
+            return normalizedString;


you could remove this return and the above session.getRuntimeStats().addMetricValue, since you are just assigning normalizedString

Also just curious how useful this metric might be - are there some normalization that could take considerable time?

you could remove this return and the above session.getRuntimeStats().addMetricValue, since you are just assigning normalizedString

normalizedString = metadata.normalizeIdentifier(session.toConnectorSession(connectorId), identifier);

This is the connector metadata call for API which is why added end of metrcis after it.

This is helpful to know how many times newly added normalizer API is getting called for a particular query and how much time its spending overall.

This is one of the sample runtime metrics looks like for one of the tpch query -

"getIdentifierNormalizationTimeNanos":{"name":"getIdentifierNormalizationTimeNanos", "unit":"NANO", "sum":8177043, "count":2, "max":7601834, "min":575209}

You could add a method to the base class like loadTableAndNormalizeName that just calls normalizeIdentifier then loadTable in the subclass. You would just have to update the existing calls to loadTable.

Regarding loadTable method clean up in tests, I was thinking I could do that in different PR since that change is not related? Would it be Ok?

It's not a big deal, it's only a few repeated lines and I thought it would a little cleaner but either way is fine.

BryanCutler · 2025-04-30T17:33:21Z

presto-main-base/src/main/java/com/facebook/presto/metadata/MetadataManager.java

+            ConnectorMetadata metadata = catalogMetadata.get().getMetadataFor(connectorId);
+            normalizedString = metadata.normalizeIdentifier(session.toConnectorSession(connectorId), identifier);
+            session.getRuntimeStats().addMetricValue(GET_IDENTIFIER_NORMALIZATION_TIME_NANOS, NANO, System.nanoTime() - startTime);
+            return normalizedString;


I still don't see why you need to return here? It looks like you session.getRuntimeStats().addMetricValue(GET_IDENTIFIER_NORMALIZATION_TIME_NANOS, NANO, System.nanoTime() - startTime); is the same below, so just assign normalizedString and let it return at the end of the method

You are right, I overlooked it. We can remove return normalizedString; and line before this. I will make the changes. Thanks!

BryanCutler

LGTM

ZacBlanco · 2025-05-01T22:39:53Z

presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestQueryFramework.java

@@ -644,4 +648,15 @@ public static void dropTableIfExists(QueryRunner queryRunner, String catalogName
    {
        queryRunner.execute(format("DROP TABLE IF EXISTS %s.%s.%s", catalogName, schemaName, tableName));
    }
+
+    protected String normalizeIdentifier(String name, String catalogName)


If this is only used in Iceberg tests, does it need to be in AbstractTestQueryFramework?

Currenlty yes, its only used for Iceberg tests since it had mixed-case used in the tests. But as I checked, this method seems genric, if needed we could use it with other connectors as well. Do you see any problem in keeping here?

ZacBlanco · 2025-05-01T22:42:53Z

presto-main-base/src/main/java/com/facebook/presto/metadata/MetadataUtil.java

@@ -197,9 +220,12 @@ public static Optional<TableHandle> getOptionalTableHandle(Session session, Tran
            ConnectorMetadata metadata = catalogMetadata.getMetadataFor(connectorId);

            ConnectorTableHandle tableHandle;
+            String schemaName = metadata.normalizeIdentifier(session.toConnectorSession(connectorId), table.getSchemaName());


If we already normalize when creating a QualifiedObjectName, why do we have to normalize it again here?

ZacBlanco · 2025-05-01T22:43:02Z

presto-main-base/src/main/java/com/facebook/presto/metadata/MetadataUtil.java

@@ -197,9 +220,12 @@ public static Optional<TableHandle> getOptionalTableHandle(Session session, Tran
            ConnectorMetadata metadata = catalogMetadata.getMetadataFor(connectorId);

            ConnectorTableHandle tableHandle;
+            String schemaName = metadata.normalizeIdentifier(session.toConnectorSession(connectorId), table.getSchemaName());
+            String tableName = metadata.normalizeIdentifier(session.toConnectorSession(connectorId), table.getObjectName());


ZacBlanco · 2025-05-01T22:46:50Z

presto-main-base/src/main/java/com/facebook/presto/metadata/MetadataUtil.java

@@ -90,6 +91,21 @@ public static SchemaTableName toSchemaTableName(QualifiedObjectName qualifiedObj
        return new SchemaTableName(qualifiedObjectName.getSchemaName(), qualifiedObjectName.getObjectName());
    }

+    public static SchemaTableName toSchemaTableName(String schemaName, String tableName)


checkTableName/checkSchemaName at the top of this file does not seem to be used any more

ZacBlanco · 2025-05-01T23:22:23Z

presto-parser/src/main/java/com/facebook/presto/sql/tree/QualifiedName.java

@@ -105,6 +106,11 @@ public String getSuffix()
        return Iterables.getLast(parts);
    }

+    public String getOriginalSuffix()
+    {
+        return Iterables.getLast(originalParts);


nit: static import

ZacBlanco · 2025-05-01T23:35:07Z

presto-product-tests/src/main/java/com/facebook/presto/tests/hive/TestHiveMixedCaseSupport.java

+     *
+     * This class includes tests for:
+     * 1. Creating a schema with a lowercase name.
+     * 2. Creating a schema with a mixed-case name that has the same syllables as an existing schema.


Why are we using the term syllables here? Just say it's the same name as the existing schema but mxed case

ZacBlanco · 2025-05-01T23:38:19Z

presto-product-tests/src/main/java/com/facebook/presto/tests/hive/TestHiveMixedCaseSupport.java

+    {
+        // Define schema names and their expected stored values in Hive
+        String schemaNameMixedSyllables = "HiveMixedCaseOn";
+        String schemaNameMixed = "HiveMixedCase";


Why is this and the above separate cases? They are basically the same test case

The first test checks the behaviour when a schema with the same name already exists but in a different case,
second one checks the behaviour when we use mixed case

ZacBlanco · 2025-05-01T23:39:03Z

presto-product-tests/src/main/java/com/facebook/presto/tests/hive/TestHiveMixedCaseSupport.java

+     * It ensures that data is inserted and retrieved correctly regardless of case sensitivity.
+     */
+    @Test(groups = {MIXED_CASE})
+    public void testInsertDataWithMixedCaseNames()


Is this test necessary? We know none of our code is touching the datapath, so mixed case would not affect rows stored in the table

ZacBlanco · 2025-05-01T23:40:09Z

presto-product-tests/src/main/java/com/facebook/presto/tests/hive/TestHiveMixedCaseSupport.java

+     * It ensures that queries return correct results regardless of case sensitivity.
+     */
+    @Test(groups = {MIXED_CASE})
+    public void testSelectDataWithMixedCaseNames()


This test isn't really any different than the above test. I also think it is not necessary to check on the data path.

Yes, SELECT looks the same as in the INSERT one. So I think we can remove this one altogether and just keep testInsertDataWithMixedCaseNames() one as it is. WDYT?

ZacBlanco · 2025-05-01T23:40:55Z

presto-product-tests/src/main/java/com/facebook/presto/tests/hive/TestHiveMixedCaseSupport.java

+        query("ALTER TABLE " + SCHEMA_NAME_UPPER + ".TESTTABLE02 ADD COLUMN num2 REAL");
+
+        // Verify the added columns
+        assertThat(query("DESCRIBE hivemixedcaseon.testtable"))


Why is this one not using a variable for the schema?

hantangwangd

Change overall looks good to me, only some little things and nits.

hantangwangd · 2025-05-02T16:44:23Z

presto-main-base/src/main/java/com/facebook/presto/metadata/MetadataListing.java

-                .map(MetadataUtil::toSchemaTableName)
+                .map(table -> toSchemaTableName(table))


Do we need this change?

hantangwangd · 2025-05-02T16:44:35Z

presto-main-base/src/main/java/com/facebook/presto/metadata/MetadataListing.java

-                .map(MetadataUtil::toSchemaTableName)
+                .map(view -> toSchemaTableName(view))


The same as above.

hantangwangd · 2025-05-02T16:45:10Z

presto-main-base/src/main/java/com/facebook/presto/metadata/MetadataUtil.java

+//    public static SchemaTableName toSchemaTableName(QualifiedObjectName qualifiedObjectName, Metadata metadata, Session session)
+//    {
+//        String schemaName = metadata.normalizeIdentifier(session, qualifiedObjectName.getCatalogName(), qualifiedObjectName.getSchemaName());
+//        String tableName = metadata.normalizeIdentifier(session, qualifiedObjectName.getCatalogName(), qualifiedObjectName.getObjectName());
+//        return toSchemaTableName(schemaName, tableName);
+//    }


nit: remove this?

hantangwangd · 2025-05-02T16:47:16Z