Closed
Description
Describe the bug
In distributed mode, reading a large number of small S3 objects (e.g 20M files) is slow and might eventually fail.
This is caused by the list objects call which is not currently parallelised and represents a bottleneck.
As a side note, no information is surfaced to the user during the list object call, meaning they don't have visibility into why the job is hanging. Additional logging would improve that.
How to Reproduce
Setup
- 20 million files with .csv extension
- each file contains 5 lines with 4 columns
Script
import awswrangler as wr
logging.getLogger("awswrangler").setLevel(logging.DEBUG)
df = wr.s3.read_csv("s3://test/small-files/input/20m-csv-partition/")
print(df.head())
Logs
2023-02-02 01:00:33,152 - awswrangler._config - DEBUG - Applying default config argument verify with value None.
2023-02-02 01:00:33,155 - botocore.credentials - INFO - Found credentials in shared credentials file: ~/.aws/credentials
2023-02-02 01:00:33,268 - awswrangler.s3._list - DEBUG - args: {'Bucket': 'test', 'Prefix': 'small-files/input/20m-csv-partition/', 'PaginationConfig': {'PageSize': 1000}}
2023-02-02 01:00:33,550 - awswrangler.s3._list - DEBUG - Skipping empty file: s3://test/small-files/input/20m-csv-partition/
2023-02-02 02:00:16,026 - __main__ - INFO - File "<stdin>", line 8, in <module>
File "/opt/amazon/python3.9-ray/lib/python3.9/site-packages/awswrangler/s3/_read_text.py", line 281, in read_csv
return _read_text_format(
File "/opt/amazon/python3.9-ray/lib/python3.9/site-packages/awswrangler/s3/_read_text.py", line 91, in _read_text_format
paths: List[str] = _path2list(
File "/opt/amazon/python3.9-ray/lib/python3.9/site-packages/awswrangler/s3/_list.py", line 31, in _path2list
paths: List[str] = list_objects( # type: ignore
File "/opt/amazon/python3.9-ray/lib/python3.9/site-packages/awswrangler/s3/_list.py", line 358, in list_objects
return [path for paths in result_iterator for path in paths]
File "/opt/amazon/python3.9-ray/lib/python3.9/site-packages/awswrangler/s3/_list.py", line 358, in <listcomp>
return [path for paths in result_iterator for path in paths]
File "/opt/amazon/python3.9-ray/lib/python3.9/site-packages/awswrangler/s3/_list.py", line 110, in _list_objects
for page in response_iterator: # pylint: disable=too-many-nested-blocks
File "/opt/amazon/python3.9-ray/lib/python3.9/site-packages/botocore/paginate.py", line 269, in __iter__
response = self._make_request(current_kwargs)
File "/opt/amazon/python3.9-ray/lib/python3.9/site-packages/botocore/paginate.py", line 357, in _make_request
return self._method(**current_kwargs)
File "/opt/amazon/python3.9-ray/lib/python3.9/site-packages/botocore/client.py", line 530, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/opt/amazon/python3.9-ray/lib/python3.9/site-packages/botocore/client.py", line 960, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (ExpiredToken) when calling the ListObjectsV2 operation: The provided token has expired.
2023-02-02 02:00:16,026 - __main__ - WARNING - botocore.exceptions.ClientError: An error occurred (ExpiredToken) when calling the ListObjectsV2 operation: The provided token has expired.
Script eventually fails with an expired token
Expected behavior
No response
Your project
No response
Screenshots
No response
OS
Unix
Python version
3.9
AWS SDK for pandas version
3.0.0rc2
Tasks
- Create 20M CSV files in S3 bucket (APG)
- Test with
ray.data.read_csv
and compare performance - Consider delegating path listing to Ray or see if we can replicate the same logic
- Explore parallelising S3 list objects call
Ray implementation: https://github.com/ray-project/ray/blob/master/python/ray/data/datasource/file_meta_provider.py#L189