Skip to content

Mixed case identifier support #24551

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

agrawalreetika
Copy link
Member

@agrawalreetika agrawalreetika commented Feb 13, 2025

Description

Improves identifier handling (schema, table) to align with SQL standards for better compatibility with case-sensitive and case-normalizing databases, while minimizing SPI-breaking changes.

Motivation and Context

RFC details - prestodb/rfcs#36

  • This PR focuses on generalizing schema and table name handling to better align with SQL standards. As per the current API behavior, identifiers are lowercased by default unless explicitly handled (RFC reference).

  • Column name support and related changes will be addressed in a follow-up PR. Currently, column names are lowercased at the SPI level (ColumnMetadata.java#L45). Removing this generic lowercase conversion will require updates to normalize column names via the metadata API in each connector.

Impact

Improves identifier handling (schema, table) to align with SQL standards for better compatibility with case-sensitive and case-normalizing databases, while minimizing SPI-breaking changes.

Here is how Analysis Time on local Mac with 8GB JVM for EXPLAIN TPCH Queries, tpch-sf100 with (master) and with the PR changes -
image

Test Plan

  • Existing UT passing
  • Added support for Mysql and new UT added for Mysql for when mixed-case support is enabled

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add mixed case support for schema and table names.

JDBC Driver Changes
* Add mixed case support related catalog property in JDBC connector ``case-sensitive-name-matching``.

MySQL Connector Changes
* Add support for mixed-case in MySQL. It can be enabled by setting ``case-sensitive-name-matching=true`` configuration in the catalog configuration

@prestodb-ci prestodb-ci added the from:IBM PR from IBM label Feb 13, 2025
@agrawalreetika agrawalreetika self-assigned this Feb 13, 2025
@agrawalreetika agrawalreetika force-pushed the mixed-case branch 6 times, most recently from 6f127fc to e2f23fd Compare February 17, 2025 15:49
@agrawalreetika agrawalreetika force-pushed the mixed-case branch 10 times, most recently from 1ad263b to bcb3a7c Compare February 25, 2025 09:13
@agrawalreetika agrawalreetika force-pushed the mixed-case branch 3 times, most recently from 0e0c8be to 8e65855 Compare March 8, 2025 16:07
@agrawalreetika agrawalreetika force-pushed the mixed-case branch 3 times, most recently from 8f48ca4 to 63944ed Compare March 15, 2025 20:19
@prestodb-ci
Copy link
Contributor

@ethanyzhang imported this issue into IBM GitHub Enterprise

@agrawalreetika agrawalreetika force-pushed the mixed-case branch 3 times, most recently from b0a5dd9 to 76cb96e Compare March 28, 2025 10:02
@prestodb-ci
Copy link
Contributor

@ethanyzhang imported this issue into IBM GitHub Enterprise

Copy link
Contributor

@aaneja aaneja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flush 1 - Finished looking at code. Looking at tests next

if (parts.size() > 2) {
throw new SemanticException(INVALID_SCHEMA_NAME, node, "Too many parts in schema name: %s", schema.get());
}
if (parts.size() == 2) {
catalogName = parts.get(0);
catalogName = parts.get(0).toLowerCase(ENGLISH);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was toLowerCase added here

@@ -1478,6 +1521,17 @@ public void addConstraint(Session session, TableHandle tableHandle, TableConstra
ConnectorMetadata metadata = getMetadataForWrite(session, connectorId);
metadata.addConstraint(session.toConnectorSession(connectorId), tableHandle.getConnectorHandle(), tableConstraint);
}
@Override
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Add an empty line above

for (ConnectorId connectorId : catalogMetadata.listConnectorIds()) {
ConnectorMetadata metadata = catalogMetadata.getMetadataFor(connectorId);
ConnectorSession connectorSession = session.toConnectorSession(connectorId);
metadata.listTables(connectorSession, prefix.getSchemaName()).stream()
.map(convertFromSchemaTableName(prefix.getCatalogName()))
.filter(prefix::matches)
.filter(name -> prefix.matches(new QualifiedObjectName(name.getCatalogName(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing. prefix is supplied by the user and is NOT normalized. name is the output of listTables and should already be normalized ? If so, shouldn't we instead normalize the prefix and then call matches ?

Copy link
Member Author

@agrawalreetika agrawalreetika Apr 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefix would have catalog.schema mapping resulted from listSchemaNames on which listTables needs to be called. So I think that should be passed the original value to listTables API here.

Copy link
Contributor

@ZacBlanco ZacBlanco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think this looks good, but I have two concerns

  1. I am paranoid about the overhead that normalizeIdentifiers could introduce. There is quite a few layers of functions calls to get into a plugin's ConnectorMetadata instance before hitting normalizeIdentifiers, including setting up the ContextClassLoader for each metadata call. If we have a large query, or have to call many metadata functions during query planning, We might see some small performance hit on the query analysis times. I don't really know how small the impact would be, but it would be nice to verify that we don't have any regressions.

  2. There are a lot of locations where we call normalizeIdentifier in the current approach. However, the source of identifiers all comes from one place: the user query. It seems like with the current solution it is possible to call normalizeIdentifier many times for a single identifier. Do you think it would be possible to move the normalization of identifiers closer to the query analysis phase in order to not have to run it again? Theoretically, we should only need to call it once per identifier in the query, right?

@@ -110,11 +111,14 @@ public ConnectorTransactionHandle getTransactionHandleFor(ConnectorId connectorI

public ConnectorId getConnectorId(Session session, QualifiedObjectName table)
{
if (table.getSchemaName().equals(INFORMATION_SCHEMA_NAME)) {
if (table.getSchemaName().equalsIgnoreCase(INFORMATION_SCHEMA_NAME)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why ignore case now?

@agrawalreetika
Copy link
Member Author

Overall, I think this looks good, but I have two concerns

  1. I am paranoid about the overhead that normalizeIdentifiers could introduce. There is quite a few layers of functions calls to get into a plugin's ConnectorMetadata instance before hitting normalizeIdentifiers, including setting up the ContextClassLoader for each metadata call. If we have a large query, or have to call many metadata functions during query planning, We might see some small performance hit on the query analysis times. I don't really know how small the impact would be, but it would be nice to verify that we don't have any regressions.
  2. There are a lot of locations where we call normalizeIdentifier in the current approach. However, the source of identifiers all comes from one place: the user query. It seems like with the current solution, it is possible to call normalizeIdentifier many times for a single identifier. Do you think it would be possible to move the normalization of identifiers closer to the query analysis phase in order to not have to run it again? Theoretically, we should only need to call it once per identifier in the query, right?

Thanks for your review @ZacBlanco

  1. I will try to take some benchmarks with and w/o normalizeIdentifiers and post the results.
  2. In the current state, lowercase conversion is done for user query (SELECT, DML) as well as the metadata queries. so in the current approach normalizeIdentifiers is called even for metadata calls (for ex - show tables etc) as well, to preserve the current lowercase conversion when no specific casing is done at the connector level. I think there are some places where normalizeIdentifiers call can be reduced in ConnectorMetadata where QualifiedObjectName is already normalized, I will make the changes for the same.

@aaneja
Copy link
Contributor

aaneja commented Apr 28, 2025

I will try to take some benchmarks with and w/o normalizeIdentifiers and post the results.

YMMV, but adding a session RuntimeMetric should show up the impact pretty well, both in terms of count (times called) and wall clock time

@agrawalreetika agrawalreetika force-pushed the mixed-case branch 3 times, most recently from c28190d to cd4d50f Compare April 29, 2025 04:12
@agrawalreetika
Copy link
Member Author

@ZacBlanco @aaneja, updated based on your review comments. Please take a look at your convenience.

@Config("case-insensitive-name-matching")
@ConfigDescription("Deprecated: This will be deprecated in future. Use 'case-sensitive-name-matching=true' instead for mysql. " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
@ConfigDescription("Deprecated: This will be deprecated in future. Use 'case-sensitive-name-matching=true' instead for mysql. " +
@ConfigDescription("Deprecated: This will be removed in future releases. Use 'case-sensitive-name-matching=true' instead for mysql. " +

Comment on lines 142 to 143
@ConfigDescription("Enable case-sensitive matching of schema, table names across the connector, " +
"When disabled, names are matched case-insensitively, using lowercase normalization.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@ConfigDescription("Enable case-sensitive matching of schema, table names across the connector, " +
"When disabled, names are matched case-insensitively, using lowercase normalization.")
@ConfigDescription("Enable case-sensitive matching of schema, table names across the connector. " +
"When disabled, names are matched case-insensitively using lowercase normalization.")

Comment on lines 71 to 72
names for the connector, When disabled, names are matched
case-insensitively, using lowercase normalization.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
names for the connector, When disabled, names are matched
case-insensitively, using lowercase normalization.
names for the connector. When disabled, names are matched
case-insensitively using lowercase normalization.

@@ -49,6 +49,7 @@
import java.util.Map;

import static com.facebook.presto.SystemSessionProperties.LEGACY_TIMESTAMP;
import static com.facebook.presto.iceberg.IcebergQueryRunner.ICEBERG_CATALOG;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You add the import on the base class, where the subclasses are using it?

Copy link
Member Author

@agrawalreetika agrawalreetika Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unused, will remove it. Didn't get checkstyle failure since in test I think

@@ -160,6 +160,7 @@ public void testInvalidSetTablePropertyProcedureCases()

private Table loadTable(String tableName)
{
tableName = normalizeIdentifier(tableName, ICEBERG_CATALOG);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add a method to the base class like loadTableAndNormalizeName that just calls normalizeIdentifier then loadTable in the subclass. You would just have to update the existing calls to loadTable.

ConnectorMetadata metadata = catalogMetadata.get().getMetadataFor(connectorId);
normalizedString = metadata.normalizeIdentifier(session.toConnectorSession(connectorId), identifier);
session.getRuntimeStats().addMetricValue(GET_IDENTIFIER_NORMALIZATION_TIME_NANOS, NANO, System.nanoTime() - startTime);
return normalizedString;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could remove this return and the above session.getRuntimeStats().addMetricValue, since you are just assigning normalizedString

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also just curious how useful this metric might be - are there some normalization that could take considerable time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could remove this return and the above session.getRuntimeStats().addMetricValue, since you are just assigning normalizedString

normalizedString = metadata.normalizeIdentifier(session.toConnectorSession(connectorId), identifier);

This is the connector metadata call for API which is why added end of metrcis after it.

This is helpful to know how many times newly added normalizer API is getting called for a particular query and how much time its spending overall.

This is one of the sample runtime metrics looks like for one of the tpch query -

"getIdentifierNormalizationTimeNanos":{"name":"getIdentifierNormalizationTimeNanos",
"unit":"NANO",
"sum":8177043,
"count":2,
"max":7601834,
"min":575209}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add a method to the base class like loadTableAndNormalizeName that just calls normalizeIdentifier then loadTable in the subclass. You would just have to update the existing calls to loadTable.

Regarding loadTable method clean up in tests, I was thinking I could do that in different PR since that change is not related? Would it be Ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a big deal, it's only a few repeated lines and I thought it would a little cleaner but either way is fine.

@agrawalreetika agrawalreetika force-pushed the mixed-case branch 2 times, most recently from 7c2f5cc to 48467ab Compare April 30, 2025 14:13
ConnectorMetadata metadata = catalogMetadata.get().getMetadataFor(connectorId);
normalizedString = metadata.normalizeIdentifier(session.toConnectorSession(connectorId), identifier);
session.getRuntimeStats().addMetricValue(GET_IDENTIFIER_NORMALIZATION_TIME_NANOS, NANO, System.nanoTime() - startTime);
return normalizedString;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't see why you need to return here? It looks like you session.getRuntimeStats().addMetricValue(GET_IDENTIFIER_NORMALIZATION_TIME_NANOS, NANO, System.nanoTime() - startTime); is the same below, so just assign normalizedString and let it return at the end of the method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I overlooked it. We can remove return normalizedString; and line before this. I will make the changes. Thanks!

BryanCutler
BryanCutler previously approved these changes May 1, 2025
Copy link
Contributor

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -644,4 +648,15 @@ public static void dropTableIfExists(QueryRunner queryRunner, String catalogName
{
queryRunner.execute(format("DROP TABLE IF EXISTS %s.%s.%s", catalogName, schemaName, tableName));
}

protected String normalizeIdentifier(String name, String catalogName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is only used in Iceberg tests, does it need to be in AbstractTestQueryFramework?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currenlty yes, its only used for Iceberg tests since it had mixed-case used in the tests. But as I checked, this method seems genric, if needed we could use it with other connectors as well. Do you see any problem in keeping here?

@@ -197,9 +220,12 @@ public static Optional<TableHandle> getOptionalTableHandle(Session session, Tran
ConnectorMetadata metadata = catalogMetadata.getMetadataFor(connectorId);

ConnectorTableHandle tableHandle;
String schemaName = metadata.normalizeIdentifier(session.toConnectorSession(connectorId), table.getSchemaName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we already normalize when creating a QualifiedObjectName, why do we have to normalize it again here?

@@ -197,9 +220,12 @@ public static Optional<TableHandle> getOptionalTableHandle(Session session, Tran
ConnectorMetadata metadata = catalogMetadata.getMetadataFor(connectorId);

ConnectorTableHandle tableHandle;
String schemaName = metadata.normalizeIdentifier(session.toConnectorSession(connectorId), table.getSchemaName());
String tableName = metadata.normalizeIdentifier(session.toConnectorSession(connectorId), table.getObjectName());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@@ -90,6 +91,21 @@ public static SchemaTableName toSchemaTableName(QualifiedObjectName qualifiedObj
return new SchemaTableName(qualifiedObjectName.getSchemaName(), qualifiedObjectName.getObjectName());
}

public static SchemaTableName toSchemaTableName(String schemaName, String tableName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkTableName/checkSchemaName at the top of this file does not seem to be used any more

@@ -105,6 +106,11 @@ public String getSuffix()
return Iterables.getLast(parts);
}

public String getOriginalSuffix()
{
return Iterables.getLast(originalParts);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: static import

*
* This class includes tests for:
* 1. Creating a schema with a lowercase name.
* 2. Creating a schema with a mixed-case name that has the same syllables as an existing schema.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using the term syllables here? Just say it's the same name as the existing schema but mxed case

{
// Define schema names and their expected stored values in Hive
String schemaNameMixedSyllables = "HiveMixedCaseOn";
String schemaNameMixed = "HiveMixedCase";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this and the above separate cases? They are basically the same test case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first test checks the behaviour when a schema with the same name already exists but in a different case,
second one checks the behaviour when we use mixed case

* It ensures that data is inserted and retrieved correctly regardless of case sensitivity.
*/
@Test(groups = {MIXED_CASE})
public void testInsertDataWithMixedCaseNames()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test necessary? We know none of our code is touching the datapath, so mixed case would not affect rows stored in the table

* It ensures that queries return correct results regardless of case sensitivity.
*/
@Test(groups = {MIXED_CASE})
public void testSelectDataWithMixedCaseNames()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test isn't really any different than the above test. I also think it is not necessary to check on the data path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, SELECT looks the same as in the INSERT one. So I think we can remove this one altogether and just keep testInsertDataWithMixedCaseNames() one as it is. WDYT?

query("ALTER TABLE " + SCHEMA_NAME_UPPER + ".TESTTABLE02 ADD COLUMN num2 REAL");

// Verify the added columns
assertThat(query("DESCRIBE hivemixedcaseon.testtable"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this one not using a variable for the schema?

Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change overall looks good to me, only some little things and nits.

.map(MetadataUtil::toSchemaTableName)
.map(table -> toSchemaTableName(table))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this change?

.map(MetadataUtil::toSchemaTableName)
.map(view -> toSchemaTableName(view))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as above.

Comment on lines +82 to +87
// public static SchemaTableName toSchemaTableName(QualifiedObjectName qualifiedObjectName, Metadata metadata, Session session)
// {
// String schemaName = metadata.normalizeIdentifier(session, qualifiedObjectName.getCatalogName(), qualifiedObjectName.getSchemaName());
// String tableName = metadata.normalizeIdentifier(session, qualifiedObjectName.getCatalogName(), qualifiedObjectName.getObjectName());
// return toSchemaTableName(schemaName, tableName);
// }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove this?

{
this.analysis = requireNonNull(analysis, "analysis is null");
this.session = requireNonNull(session, "session is null");
this.metadata = metadata;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Seems metadata shouldn't be null, so maybe add requireNonNull check for metadata?

Comment on lines -113 to +116
return getColumnMappings(analysis, mappedBaseColumns);
return getColumnMappings(analysis, mappedBaseColumns, metadata, session);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems parameter metadata and session turns out to be useless in the end, so should we remove them?

@@ -124,7 +127,7 @@ public Map<String, Map<SchemaTableName, String>> getMaterializedViewColumnMappin
*/
public Map<String, Map<SchemaTableName, String>> getMaterializedViewDirectColumnMappings()
{
return getColumnMappings(analysis, directMappedBaseColumns);
return getColumnMappings(analysis, directMappedBaseColumns, metadata, session);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as above.

@@ -211,7 +214,7 @@ protected Void visitComparisonExpression(ComparisonExpression node, Materialized
return null;
}

private static Map<String, TableColumn> getOriginalColumnsFromAnalysis(Analysis analysis)
private static Map<String, TableColumn> getOriginalColumnsFromAnalysis(Analysis analysis, Metadata metadata, Session session)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metadata and session are useless in this method, should we remove them?

@@ -222,15 +225,15 @@ private static Map<String, TableColumn> getOriginalColumnsFromAnalysis(Analysis
field -> new TableColumn(toSchemaTableName(field.getOriginTable().get()), field.getOriginColumnName().get(), true)));
}

private static Map<String, Map<SchemaTableName, String>> getColumnMappings(Analysis analysis, Map<TableColumn, Set<TableColumn>> columnMappings)
private static Map<String, Map<SchemaTableName, String>> getColumnMappings(Analysis analysis, Map<TableColumn, Set<TableColumn>> columnMappings, Metadata metadata, Session session)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as above.

{
this.session = requireNonNull(session, "session is null");
this.predicates = requireNonNull(predicates, "predicates is null");
this.metadata = metadata;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add requireNonNull check for metadata?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
from:IBM PR from IBM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants