Skip to content

to_parquet lowercases column names passed in by dtype #2357

Closed
@henriblancke

Description

@henriblancke

Describe the bug

to_parquet lowercases column names passed in by dtype causing the data to be written incorrectly to parquet and incorrect types in glue/athena (as in the data doesn't get written at all and shows up as null).

How to Reproduce

import pandas as pd
import awswrangler as wr

data = [
    {
      "MetaData": {
        "CreateTime": "2023-05-31T20:37:17-07:00",
        "LastUpdatedTime": "2023-05-31T20:37:17-07:00"
      },
      "CurrencyRef": { "value": "USD", "name": "United States Dollar" },
      "Line": [
        {
          "Id": "0",
          "Description": "some description",
          "Amount": 3000.72,
          "DetailType": "JournalEntryLineDetail",
          "JournalEntryLineDetail": {
            "PostingType": "Debit",
            "AccountRef": {
              "value": "1",
              "name": "Cash and cash equivalents"
            }
          }
        },
        {
          "Id": "1",
          "Description": "some description",
          "Amount": 5128.72,
          "DetailType": "JournalEntryLineDetail",
          "JournalEntryLineDetail": {
            "PostingType": "Credit",
            "Entity": {
              "Type": "Vendor",
              "EntityRef": { "value": "1", "name": "Vendor" }
            },
            "AccountRef": {
              "value": "2",
              "name": "Accounts payable"
            }
          }
        }
      ]
    }
]

df = pd.DataFrame(data)

dtype = {
    "MetaData": "struct<CreateTime:timestamp, LastUpdatedTime:timestamp>",
    "CurrencyRef": "struct<value:varchar(255), name:varchar(255)>",
    "Line": "array<struct<Id:string, Description:string, Amount:double, DetailType:string, JournalEntryLineDetail:struct<PostingType:string, AccountRef:struct<value:bigint,name:string>, EntityRef:struct<value:string, name:string>>>>"
}

wr.s3.to_parquet(
    df=df,
    dataset=True,
    path='s3://path',
    database='test',
    table='complex_datatypes',
    glue_table_settings=wr.typing.GlueTableSettings(
        table_type='EXTERNAL_TABLE',
    ),
    mode='overwrite',
    use_threads=False,
    catalog_versioning=True,
    dtype=dtype,
)

Debug logs show column names and struct column names as lowercased:

DEBUG:awswrangler._data_types:athena_columns_types: {'metadata': 'struct<createtime:timestamp,lastupdatedtime:timestamp>', 'currencyref': 'struct<value:varchar(255),name:varchar(255)>', 'line': 'array<struct<id:string,description:string,amount:double,detailtype:string,journalentrylinedetail:struct<postingtype:string,accountref:struct<value:bigint,name:string>,entityref:struct<value:string,name:string>>>>'}

When I don't provide a dtype to to_parquet column names are not lowercased and the data gets written correctly:

wr.s3.to_parquet(
    df=df,
    dataset=True,
    path='s3://path',
    database='test',
    table='complex_datatypes',
    glue_table_settings=wr.typing.GlueTableSettings(
        table_type='EXTERNAL_TABLE',
    ),
    mode='overwrite',
    use_threads=False,
    catalog_versioning=True,
)

Debug logs showing preserved casing for column names:

DEBUG:awswrangler._data_types:columns_types: {'metadata': StructType(struct<CreateTime: string, LastUpdatedTime: string>), 'currencyref': StructType(struct<name: string, value: string>), 'line': ListType(list<item: struct<Amount: double, Description: string, DetailType: string, Id: string, JournalEntryLineDetail: struct<AccountRef: struct<name: string, value: string>, Entity: struct<EntityRef: struct<name: string, value: string>, Type: string>, PostingType: string>>>)}

Expected behavior

Casing in dtype is preserved

Your project

No response

Screenshots

No response

OS

Mac

Python version

3.10.x

AWS SDK for pandas version

3.2.0

Additional context

Related issue: #612

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions