Skip to content

Unable to use s3.read_parquet_table() to read from Glue tables whose name is the prefix of another table. #638

Closed
@vlieven

Description

@vlieven

Describe the bug
When using the function s3.read_parquet_table(), Data Wrangler will first lookup the table URI and use this as the path argument for s3.read_parquet(). Relevant code snippet:

res: Dict[str, Any] = client_glue.get_table(**args)
try:
    path: str = res["Table"]["StorageDescriptor"]["Location"]

The issue with this piece of code is that the storage descriptor returned by Glue does not necessarily contain a slash ("/") as suffix, meaning that when it is passed to s3.read_parquet(), it will expand everything including paths matching the table name. The issue becomes apparent when you have tables matching this format in Glue:

database.prefix
database.prefix_suffix

Attempting to read from database.prefix will also match files under database.prefix_suffix, resulting in an error of the form: InvalidArgumentValue: Object s3://<bucket>/<database>/<prefix_suffix>/<file>.parquet is not under the root path (s3://<bucket>/<database>/<prefix>/).

A simple fix could simply be to add the "/" suffix to the path when it's not present:

path: str = res["Table"]["StorageDescriptor"]["Location"]
path: str = path if path.endswith("/") else f"{path}/"

To Reproduce
Tested with Python 3.7.9 and Wranger 2.6.0, installed via pip.

Steps to reproduce the behavior:

  • Create a Glue database "test_database" with two tables "test_table" and "test_table_extra".
  • Write some data to these tables so they are populated.
  • Use Data Wrangler to read data from "test_table", e.g.
import awswrangler as wr
df = wr.s3.read_parquet_table(database='test_database', table='test_table')

You will get an error indicating files belonging to test_table_extra are not under the root path of test_table.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingminor releaseWill be addressed in the next minor releaseready to release

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions