Skip to content

(@scale): Reading a large number of small S3 objects is slow and might eventually fail #1982

Closed
@jaidisido

Description

@jaidisido

Describe the bug

In distributed mode, reading a large number of small S3 objects (e.g 20M files) is slow and might eventually fail.

This is caused by the list objects call which is not currently parallelised and represents a bottleneck.

As a side note, no information is surfaced to the user during the list object call, meaning they don't have visibility into why the job is hanging. Additional logging would improve that.

How to Reproduce

Setup

  • 20 million files with .csv extension
  • each file contains 5 lines with 4 columns

Script

import awswrangler as wr
logging.getLogger("awswrangler").setLevel(logging.DEBUG)

df = wr.s3.read_csv("s3://test/small-files/input/20m-csv-partition/")
print(df.head())

Logs

2023-02-02 01:00:33,152 - awswrangler._config - DEBUG - Applying default config argument verify with value None.
2023-02-02 01:00:33,155 - botocore.credentials - INFO - Found credentials in shared credentials file: ~/.aws/credentials
2023-02-02 01:00:33,268 - awswrangler.s3._list - DEBUG - args: {'Bucket': 'test', 'Prefix': 'small-files/input/20m-csv-partition/', 'PaginationConfig': {'PageSize': 1000}}
2023-02-02 01:00:33,550 - awswrangler.s3._list - DEBUG - Skipping empty file: s3://test/small-files/input/20m-csv-partition/
2023-02-02 02:00:16,026 - __main__ - INFO -   File "<stdin>", line 8, in <module>
  File "/opt/amazon/python3.9-ray/lib/python3.9/site-packages/awswrangler/s3/_read_text.py", line 281, in read_csv
    return _read_text_format(
  File "/opt/amazon/python3.9-ray/lib/python3.9/site-packages/awswrangler/s3/_read_text.py", line 91, in _read_text_format
    paths: List[str] = _path2list(
  File "/opt/amazon/python3.9-ray/lib/python3.9/site-packages/awswrangler/s3/_list.py", line 31, in _path2list
    paths: List[str] = list_objects(  # type: ignore
  File "/opt/amazon/python3.9-ray/lib/python3.9/site-packages/awswrangler/s3/_list.py", line 358, in list_objects
    return [path for paths in result_iterator for path in paths]
  File "/opt/amazon/python3.9-ray/lib/python3.9/site-packages/awswrangler/s3/_list.py", line 358, in <listcomp>
    return [path for paths in result_iterator for path in paths]
  File "/opt/amazon/python3.9-ray/lib/python3.9/site-packages/awswrangler/s3/_list.py", line 110, in _list_objects
    for page in response_iterator:  # pylint: disable=too-many-nested-blocks
  File "/opt/amazon/python3.9-ray/lib/python3.9/site-packages/botocore/paginate.py", line 269, in __iter__
    response = self._make_request(current_kwargs)
  File "/opt/amazon/python3.9-ray/lib/python3.9/site-packages/botocore/paginate.py", line 357, in _make_request
    return self._method(**current_kwargs)
  File "/opt/amazon/python3.9-ray/lib/python3.9/site-packages/botocore/client.py", line 530, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/opt/amazon/python3.9-ray/lib/python3.9/site-packages/botocore/client.py", line 960, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (ExpiredToken) when calling the ListObjectsV2 operation: The provided token has expired.
2023-02-02 02:00:16,026 - __main__ - WARNING - botocore.exceptions.ClientError: An error occurred (ExpiredToken) when calling the ListObjectsV2 operation: The provided token has expired.

Script eventually fails with an expired token

Expected behavior

No response

Your project

No response

Screenshots

No response

OS

Unix

Python version

3.9

AWS SDK for pandas version

3.0.0rc2

Tasks

  • Create 20M CSV files in S3 bucket (APG)
  • Test with ray.data.read_csv and compare performance
  • Consider delegating path listing to Ray or see if we can replicate the same logic
  • Explore parallelising S3 list objects call

Ray implementation: https://github.com/ray-project/ray/blob/master/python/ray/data/datasource/file_meta_provider.py#L189

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions