-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Mixed case identifier support #24551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
6f127fc
to
e2f23fd
Compare
1ad263b
to
bcb3a7c
Compare
0e0c8be
to
8e65855
Compare
8f48ca4
to
63944ed
Compare
@ethanyzhang imported this issue into IBM GitHub Enterprise |
b0a5dd9
to
76cb96e
Compare
@ethanyzhang imported this issue into IBM GitHub Enterprise |
76cb96e
to
30d669e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flush 1 - Finished looking at code. Looking at tests next
if (parts.size() > 2) { | ||
throw new SemanticException(INVALID_SCHEMA_NAME, node, "Too many parts in schema name: %s", schema.get()); | ||
} | ||
if (parts.size() == 2) { | ||
catalogName = parts.get(0); | ||
catalogName = parts.get(0).toLowerCase(ENGLISH); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was toLowerCase
added here
@@ -1478,6 +1521,17 @@ public void addConstraint(Session session, TableHandle tableHandle, TableConstra | |||
ConnectorMetadata metadata = getMetadataForWrite(session, connectorId); | |||
metadata.addConstraint(session.toConnectorSession(connectorId), tableHandle.getConnectorHandle(), tableConstraint); | |||
} | |||
@Override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Add an empty line above
for (ConnectorId connectorId : catalogMetadata.listConnectorIds()) { | ||
ConnectorMetadata metadata = catalogMetadata.getMetadataFor(connectorId); | ||
ConnectorSession connectorSession = session.toConnectorSession(connectorId); | ||
metadata.listTables(connectorSession, prefix.getSchemaName()).stream() | ||
.map(convertFromSchemaTableName(prefix.getCatalogName())) | ||
.filter(prefix::matches) | ||
.filter(name -> prefix.matches(new QualifiedObjectName(name.getCatalogName(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing. prefix
is supplied by the user and is NOT normalized. name
is the output of listTables
and should already be normalized ? If so, shouldn't we instead normalize the prefix
and then call matches
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefix
would have catalog.schema
mapping resulted from listSchemaNames
on which listTables needs to be called. So I think that should be passed the original value to listTables
API here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think this looks good, but I have two concerns
-
I am paranoid about the overhead that
normalizeIdentifiers
could introduce. There is quite a few layers of functions calls to get into a plugin's ConnectorMetadata instance before hittingnormalizeIdentifiers
, including setting up the ContextClassLoader for each metadata call. If we have a large query, or have to call many metadata functions during query planning, We might see some small performance hit on the query analysis times. I don't really know how small the impact would be, but it would be nice to verify that we don't have any regressions. -
There are a lot of locations where we call normalizeIdentifier in the current approach. However, the source of identifiers all comes from one place: the user query. It seems like with the current solution it is possible to call
normalizeIdentifier
many times for a single identifier. Do you think it would be possible to move the normalization of identifiers closer to the query analysis phase in order to not have to run it again? Theoretically, we should only need to call it once per identifier in the query, right?
@@ -110,11 +111,14 @@ public ConnectorTransactionHandle getTransactionHandleFor(ConnectorId connectorI | |||
|
|||
public ConnectorId getConnectorId(Session session, QualifiedObjectName table) | |||
{ | |||
if (table.getSchemaName().equals(INFORMATION_SCHEMA_NAME)) { | |||
if (table.getSchemaName().equalsIgnoreCase(INFORMATION_SCHEMA_NAME)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why ignore case now?
Thanks for your review @ZacBlanco
|
YMMV, but adding a session RuntimeMetric should show up the impact pretty well, both in terms of count (times called) and wall clock time |
c28190d
to
cd4d50f
Compare
@ZacBlanco @aaneja, updated based on your review comments. Please take a look at your convenience. |
cd4d50f
to
a09f2c8
Compare
@Config("case-insensitive-name-matching") | ||
@ConfigDescription("Deprecated: This will be deprecated in future. Use 'case-sensitive-name-matching=true' instead for mysql. " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
@ConfigDescription("Deprecated: This will be deprecated in future. Use 'case-sensitive-name-matching=true' instead for mysql. " + | |
@ConfigDescription("Deprecated: This will be removed in future releases. Use 'case-sensitive-name-matching=true' instead for mysql. " + |
@ConfigDescription("Enable case-sensitive matching of schema, table names across the connector, " + | ||
"When disabled, names are matched case-insensitively, using lowercase normalization.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ConfigDescription("Enable case-sensitive matching of schema, table names across the connector, " + | |
"When disabled, names are matched case-insensitively, using lowercase normalization.") | |
@ConfigDescription("Enable case-sensitive matching of schema, table names across the connector. " + | |
"When disabled, names are matched case-insensitively using lowercase normalization.") |
names for the connector, When disabled, names are matched | ||
case-insensitively, using lowercase normalization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
names for the connector, When disabled, names are matched | |
case-insensitively, using lowercase normalization. | |
names for the connector. When disabled, names are matched | |
case-insensitively using lowercase normalization. |
@@ -49,6 +49,7 @@ | |||
import java.util.Map; | |||
|
|||
import static com.facebook.presto.SystemSessionProperties.LEGACY_TIMESTAMP; | |||
import static com.facebook.presto.iceberg.IcebergQueryRunner.ICEBERG_CATALOG; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You add the import on the base class, where the subclasses are using it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unused, will remove it. Didn't get checkstyle failure since in test I think
@@ -160,6 +160,7 @@ public void testInvalidSetTablePropertyProcedureCases() | |||
|
|||
private Table loadTable(String tableName) | |||
{ | |||
tableName = normalizeIdentifier(tableName, ICEBERG_CATALOG); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add a method to the base class like loadTableAndNormalizeName
that just calls normalizeIdentifier
then loadTable
in the subclass. You would just have to update the existing calls to loadTable
.
ConnectorMetadata metadata = catalogMetadata.get().getMetadataFor(connectorId); | ||
normalizedString = metadata.normalizeIdentifier(session.toConnectorSession(connectorId), identifier); | ||
session.getRuntimeStats().addMetricValue(GET_IDENTIFIER_NORMALIZATION_TIME_NANOS, NANO, System.nanoTime() - startTime); | ||
return normalizedString; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could remove this return and the above session.getRuntimeStats().addMetricValue, since you are just assigning normalizedString
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also just curious how useful this metric might be - are there some normalization that could take considerable time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could remove this return and the above session.getRuntimeStats().addMetricValue, since you are just assigning normalizedString
normalizedString = metadata.normalizeIdentifier(session.toConnectorSession(connectorId), identifier);
This is the connector metadata call for API which is why added end of metrcis after it.
This is helpful to know how many times newly added normalizer API is getting called for a particular query and how much time its spending overall.
This is one of the sample runtime metrics looks like for one of the tpch query -
"getIdentifierNormalizationTimeNanos":{"name":"getIdentifierNormalizationTimeNanos",
"unit":"NANO",
"sum":8177043,
"count":2,
"max":7601834,
"min":575209}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could add a method to the base class like loadTableAndNormalizeName that just calls normalizeIdentifier then loadTable in the subclass. You would just have to update the existing calls to loadTable.
Regarding loadTable
method clean up in tests, I was thinking I could do that in different PR since that change is not related? Would it be Ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a big deal, it's only a few repeated lines and I thought it would a little cleaner but either way is fine.
7c2f5cc
to
48467ab
Compare
ConnectorMetadata metadata = catalogMetadata.get().getMetadataFor(connectorId); | ||
normalizedString = metadata.normalizeIdentifier(session.toConnectorSession(connectorId), identifier); | ||
session.getRuntimeStats().addMetricValue(GET_IDENTIFIER_NORMALIZATION_TIME_NANOS, NANO, System.nanoTime() - startTime); | ||
return normalizedString; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't see why you need to return here? It looks like you session.getRuntimeStats().addMetricValue(GET_IDENTIFIER_NORMALIZATION_TIME_NANOS, NANO, System.nanoTime() - startTime);
is the same below, so just assign normalizedString
and let it return at the end of the method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I overlooked it. We can remove return normalizedString;
and line before this. I will make the changes. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -644,4 +648,15 @@ public static void dropTableIfExists(QueryRunner queryRunner, String catalogName | |||
{ | |||
queryRunner.execute(format("DROP TABLE IF EXISTS %s.%s.%s", catalogName, schemaName, tableName)); | |||
} | |||
|
|||
protected String normalizeIdentifier(String name, String catalogName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is only used in Iceberg tests, does it need to be in AbstractTestQueryFramework?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currenlty yes, its only used for Iceberg tests since it had mixed-case used in the tests. But as I checked, this method seems genric, if needed we could use it with other connectors as well. Do you see any problem in keeping here?
@@ -197,9 +220,12 @@ public static Optional<TableHandle> getOptionalTableHandle(Session session, Tran | |||
ConnectorMetadata metadata = catalogMetadata.getMetadataFor(connectorId); | |||
|
|||
ConnectorTableHandle tableHandle; | |||
String schemaName = metadata.normalizeIdentifier(session.toConnectorSession(connectorId), table.getSchemaName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we already normalize when creating a QualifiedObjectName, why do we have to normalize it again here?
@@ -197,9 +220,12 @@ public static Optional<TableHandle> getOptionalTableHandle(Session session, Tran | |||
ConnectorMetadata metadata = catalogMetadata.getMetadataFor(connectorId); | |||
|
|||
ConnectorTableHandle tableHandle; | |||
String schemaName = metadata.normalizeIdentifier(session.toConnectorSession(connectorId), table.getSchemaName()); | |||
String tableName = metadata.normalizeIdentifier(session.toConnectorSession(connectorId), table.getObjectName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
@@ -90,6 +91,21 @@ public static SchemaTableName toSchemaTableName(QualifiedObjectName qualifiedObj | |||
return new SchemaTableName(qualifiedObjectName.getSchemaName(), qualifiedObjectName.getObjectName()); | |||
} | |||
|
|||
public static SchemaTableName toSchemaTableName(String schemaName, String tableName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkTableName/checkSchemaName at the top of this file does not seem to be used any more
@@ -105,6 +106,11 @@ public String getSuffix() | |||
return Iterables.getLast(parts); | |||
} | |||
|
|||
public String getOriginalSuffix() | |||
{ | |||
return Iterables.getLast(originalParts); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: static import
* | ||
* This class includes tests for: | ||
* 1. Creating a schema with a lowercase name. | ||
* 2. Creating a schema with a mixed-case name that has the same syllables as an existing schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we using the term syllables here? Just say it's the same name as the existing schema but mxed case
{ | ||
// Define schema names and their expected stored values in Hive | ||
String schemaNameMixedSyllables = "HiveMixedCaseOn"; | ||
String schemaNameMixed = "HiveMixedCase"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this and the above separate cases? They are basically the same test case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first test checks the behaviour when a schema with the same name already exists but in a different case,
second one checks the behaviour when we use mixed case
* It ensures that data is inserted and retrieved correctly regardless of case sensitivity. | ||
*/ | ||
@Test(groups = {MIXED_CASE}) | ||
public void testInsertDataWithMixedCaseNames() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this test necessary? We know none of our code is touching the datapath, so mixed case would not affect rows stored in the table
* It ensures that queries return correct results regardless of case sensitivity. | ||
*/ | ||
@Test(groups = {MIXED_CASE}) | ||
public void testSelectDataWithMixedCaseNames() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test isn't really any different than the above test. I also think it is not necessary to check on the data path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, SELECT looks the same as in the INSERT one. So I think we can remove this one altogether and just keep testInsertDataWithMixedCaseNames() one as it is. WDYT?
query("ALTER TABLE " + SCHEMA_NAME_UPPER + ".TESTTABLE02 ADD COLUMN num2 REAL"); | ||
|
||
// Verify the added columns | ||
assertThat(query("DESCRIBE hivemixedcaseon.testtable")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this one not using a variable for the schema?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change overall looks good to me, only some little things and nits.
.map(MetadataUtil::toSchemaTableName) | ||
.map(table -> toSchemaTableName(table)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this change?
.map(MetadataUtil::toSchemaTableName) | ||
.map(view -> toSchemaTableName(view)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as above.
// public static SchemaTableName toSchemaTableName(QualifiedObjectName qualifiedObjectName, Metadata metadata, Session session) | ||
// { | ||
// String schemaName = metadata.normalizeIdentifier(session, qualifiedObjectName.getCatalogName(), qualifiedObjectName.getSchemaName()); | ||
// String tableName = metadata.normalizeIdentifier(session, qualifiedObjectName.getCatalogName(), qualifiedObjectName.getObjectName()); | ||
// return toSchemaTableName(schemaName, tableName); | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove this?
{ | ||
this.analysis = requireNonNull(analysis, "analysis is null"); | ||
this.session = requireNonNull(session, "session is null"); | ||
this.metadata = metadata; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Seems metadata shouldn't be null, so maybe add requireNonNull
check for metadata?
return getColumnMappings(analysis, mappedBaseColumns); | ||
return getColumnMappings(analysis, mappedBaseColumns, metadata, session); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems parameter metadata
and session
turns out to be useless in the end, so should we remove them?
@@ -124,7 +127,7 @@ public Map<String, Map<SchemaTableName, String>> getMaterializedViewColumnMappin | |||
*/ | |||
public Map<String, Map<SchemaTableName, String>> getMaterializedViewDirectColumnMappings() | |||
{ | |||
return getColumnMappings(analysis, directMappedBaseColumns); | |||
return getColumnMappings(analysis, directMappedBaseColumns, metadata, session); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as above.
@@ -211,7 +214,7 @@ protected Void visitComparisonExpression(ComparisonExpression node, Materialized | |||
return null; | |||
} | |||
|
|||
private static Map<String, TableColumn> getOriginalColumnsFromAnalysis(Analysis analysis) | |||
private static Map<String, TableColumn> getOriginalColumnsFromAnalysis(Analysis analysis, Metadata metadata, Session session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metadata
and session
are useless in this method, should we remove them?
@@ -222,15 +225,15 @@ private static Map<String, TableColumn> getOriginalColumnsFromAnalysis(Analysis | |||
field -> new TableColumn(toSchemaTableName(field.getOriginTable().get()), field.getOriginColumnName().get(), true))); | |||
} | |||
|
|||
private static Map<String, Map<SchemaTableName, String>> getColumnMappings(Analysis analysis, Map<TableColumn, Set<TableColumn>> columnMappings) | |||
private static Map<String, Map<SchemaTableName, String>> getColumnMappings(Analysis analysis, Map<TableColumn, Set<TableColumn>> columnMappings, Metadata metadata, Session session) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same as above.
{ | ||
this.session = requireNonNull(session, "session is null"); | ||
this.predicates = requireNonNull(predicates, "predicates is null"); | ||
this.metadata = metadata; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add requireNonNull
check for metadata?
Description
Improves identifier handling (schema, table) to align with SQL standards for better compatibility with case-sensitive and case-normalizing databases, while minimizing SPI-breaking changes.
Motivation and Context
RFC details - prestodb/rfcs#36
This PR focuses on generalizing schema and table name handling to better align with SQL standards. As per the current API behavior, identifiers are lowercased by default unless explicitly handled (RFC reference).
Column name support and related changes will be addressed in a follow-up PR. Currently, column names are lowercased at the SPI level (ColumnMetadata.java#L45). Removing this generic lowercase conversion will require updates to normalize column names via the metadata API in each connector.
Impact
Improves identifier handling (schema, table) to align with SQL standards for better compatibility with case-sensitive and case-normalizing databases, while minimizing SPI-breaking changes.
Here is how Analysis Time on local Mac with 8GB JVM for

EXPLAIN
TPCH Queries, tpch-sf100 with (master) and with the PR changes -Test Plan
Contributor checklist
Release Notes
Please follow release notes guidelines and fill in the release notes below.