Closed
Description
Describe the bug
to_parquet
lowercases column names passed in by dtype
causing the data to be written incorrectly to parquet and incorrect types in glue/athena (as in the data doesn't get written at all and shows up as null
).
How to Reproduce
import pandas as pd
import awswrangler as wr
data = [
{
"MetaData": {
"CreateTime": "2023-05-31T20:37:17-07:00",
"LastUpdatedTime": "2023-05-31T20:37:17-07:00"
},
"CurrencyRef": { "value": "USD", "name": "United States Dollar" },
"Line": [
{
"Id": "0",
"Description": "some description",
"Amount": 3000.72,
"DetailType": "JournalEntryLineDetail",
"JournalEntryLineDetail": {
"PostingType": "Debit",
"AccountRef": {
"value": "1",
"name": "Cash and cash equivalents"
}
}
},
{
"Id": "1",
"Description": "some description",
"Amount": 5128.72,
"DetailType": "JournalEntryLineDetail",
"JournalEntryLineDetail": {
"PostingType": "Credit",
"Entity": {
"Type": "Vendor",
"EntityRef": { "value": "1", "name": "Vendor" }
},
"AccountRef": {
"value": "2",
"name": "Accounts payable"
}
}
}
]
}
]
df = pd.DataFrame(data)
dtype = {
"MetaData": "struct<CreateTime:timestamp, LastUpdatedTime:timestamp>",
"CurrencyRef": "struct<value:varchar(255), name:varchar(255)>",
"Line": "array<struct<Id:string, Description:string, Amount:double, DetailType:string, JournalEntryLineDetail:struct<PostingType:string, AccountRef:struct<value:bigint,name:string>, EntityRef:struct<value:string, name:string>>>>"
}
wr.s3.to_parquet(
df=df,
dataset=True,
path='s3://path',
database='test',
table='complex_datatypes',
glue_table_settings=wr.typing.GlueTableSettings(
table_type='EXTERNAL_TABLE',
),
mode='overwrite',
use_threads=False,
catalog_versioning=True,
dtype=dtype,
)
Debug logs show column names and struct column names as lowercased:
DEBUG:awswrangler._data_types:athena_columns_types: {'metadata': 'struct<createtime:timestamp,lastupdatedtime:timestamp>', 'currencyref': 'struct<value:varchar(255),name:varchar(255)>', 'line': 'array<struct<id:string,description:string,amount:double,detailtype:string,journalentrylinedetail:struct<postingtype:string,accountref:struct<value:bigint,name:string>,entityref:struct<value:string,name:string>>>>'}
When I don't provide a dtype to to_parquet
column names are not lowercased and the data gets written correctly:
wr.s3.to_parquet(
df=df,
dataset=True,
path='s3://path',
database='test',
table='complex_datatypes',
glue_table_settings=wr.typing.GlueTableSettings(
table_type='EXTERNAL_TABLE',
),
mode='overwrite',
use_threads=False,
catalog_versioning=True,
)
Debug logs showing preserved casing for column names:
DEBUG:awswrangler._data_types:columns_types: {'metadata': StructType(struct<CreateTime: string, LastUpdatedTime: string>), 'currencyref': StructType(struct<name: string, value: string>), 'line': ListType(list<item: struct<Amount: double, Description: string, DetailType: string, Id: string, JournalEntryLineDetail: struct<AccountRef: struct<name: string, value: string>, Entity: struct<EntityRef: struct<name: string, value: string>, Type: string>, PostingType: string>>>)}
Expected behavior
Casing in dtype is preserved
Your project
No response
Screenshots
No response
OS
Mac
Python version
3.10.x
AWS SDK for pandas version
3.2.0
Additional context
Related issue: #612