Skip to content

to_parquet with dtype mutates input DataFrame #669

Closed
@eferm

Description

@eferm

Describe the bug

Hi! When writing with to_parquet my input DataFrame seems to mutate in-place when passing the dtype parameter. Perhaps more specifically, the DataFrame changes in-place to the Pandas-equivalent nullable dtypes described in dtype.

My expectation is that the dtype casts would only be applied towards serialized data, and not that it would have side effects on my input DataFrame.

However, if this is expected behavior perhaps docs could be clarified?

To Reproduce

My environment:

$ python --version
Python 3.8.9
$ pip freeze | grep awswrangler
awswrangler==2.6.0

Simple repro:

>>> import awswrangler as wr
>>> import pandas as pd
>>> df = pd.DataFrame([1, 2, 3], columns=["a"])
>>> df.dtypes
a    int64
dtype: object
>>> wr.s3.to_parquet(df, "s3://bucket/path/to/file.parquet", dtype={"a": "bigint"})
{'paths': ['s3://bucket/path/to/file.parquet'], 'partitions_values': {}}
>>> df.dtypes
a    Int64
dtype: object

My work-around is to pass in a copy of the DataFrame:

>>> wr.s3.to_parquet(df.copy(), "s3://bucket/path/to/file.parquet", dtype={"a": "bigint"})
{'paths': ['s3://bucket/path/to/file.parquet'], 'partitions_values': {}}
>>> df.dtypes
a    int64
dtype: object

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions