Description
Describe the bug
When using the function s3.read_parquet_table(), Data Wrangler will first lookup the table URI and use this as the path argument for s3.read_parquet(). Relevant code snippet:
res: Dict[str, Any] = client_glue.get_table(**args)
try:
path: str = res["Table"]["StorageDescriptor"]["Location"]
The issue with this piece of code is that the storage descriptor returned by Glue does not necessarily contain a slash ("/") as suffix, meaning that when it is passed to s3.read_parquet(), it will expand everything including paths matching the table name. The issue becomes apparent when you have tables matching this format in Glue:
database.prefix
database.prefix_suffix
Attempting to read from database.prefix
will also match files under database.prefix_suffix
, resulting in an error of the form: InvalidArgumentValue: Object s3://<bucket>/<database>/<prefix_suffix>/<file>.parquet is not under the root path (s3://<bucket>/<database>/<prefix>/).
A simple fix could simply be to add the "/" suffix to the path when it's not present:
path: str = res["Table"]["StorageDescriptor"]["Location"]
path: str = path if path.endswith("/") else f"{path}/"
To Reproduce
Tested with Python 3.7.9 and Wranger 2.6.0, installed via pip.
Steps to reproduce the behavior:
- Create a Glue database "test_database" with two tables "test_table" and "test_table_extra".
- Write some data to these tables so they are populated.
- Use Data Wrangler to read data from "test_table", e.g.
import awswrangler as wr
df = wr.s3.read_parquet_table(database='test_database', table='test_table')
You will get an error indicating files belonging to test_table_extra
are not under the root path of test_table
.