Closed
Description
Describe the bug
Hi! When writing with to_parquet
my input DataFrame seems to mutate in-place when passing the dtype
parameter. Perhaps more specifically, the DataFrame changes in-place to the Pandas-equivalent nullable dtypes described in dtype
.
My expectation is that the dtype
casts would only be applied towards serialized data, and not that it would have side effects on my input DataFrame.
However, if this is expected behavior perhaps docs could be clarified?
To Reproduce
My environment:
$ python --version
Python 3.8.9
$ pip freeze | grep awswrangler
awswrangler==2.6.0
Simple repro:
>>> import awswrangler as wr
>>> import pandas as pd
>>> df = pd.DataFrame([1, 2, 3], columns=["a"])
>>> df.dtypes
a int64
dtype: object
>>> wr.s3.to_parquet(df, "s3://bucket/path/to/file.parquet", dtype={"a": "bigint"})
{'paths': ['s3://bucket/path/to/file.parquet'], 'partitions_values': {}}
>>> df.dtypes
a Int64
dtype: object
My work-around is to pass in a copy of the DataFrame:
>>> wr.s3.to_parquet(df.copy(), "s3://bucket/path/to/file.parquet", dtype={"a": "bigint"})
{'paths': ['s3://bucket/path/to/file.parquet'], 'partitions_values': {}}
>>> df.dtypes
a int64
dtype: object