Description
Describe the bug
I'm using Data Wrangler to process some Excel spreadsheets stored on S3. The spreadsheets are multi-sheet. I've found that wr.s3.read_excel
reads .xlsx
objects fine. But it fails when reading .xls
objects. My testing indicates that the problem with .xls
is for multi-sheet spreadsheets only. BTW, my testing also indicates that reading my test multi-sheet .xls
files works OK using the basic Pandas pd.read_excel
method.
Environment
asn1crypto==1.4.0
astroid==2.3.0
awswrangler==2.8.0
backcall==0.2.0
beautifulsoup4==4.9.3
boto3==1.17.86
botocore==1.20.45
certifi==2021.5.30
chardet==4.0.0
decorator==5.0.5
Django==2.0.2
et-xmlfile==1.1.0
git-remote-codecommit==1.15.1
idna==2.10
ikp3db==1.4.1
importlib-metadata==3.10.0
ipython==7.22.0
ipython-genutils==0.2.0
isort==4.3.21
jedi==0.18.0
jmespath==0.10.0
lazy-object-proxy==1.6.0
lxml==4.6.3
mccabe==0.6.1
numpy==1.20.3
openpyxl==3.0.7
pandas==1.2.4
parso==0.8.2
pbr==5.5.1
pexpect==4.8.0
pg8000==1.19.5
pickleshare==0.7.5
prompt-toolkit==3.0.18
ptyprocess==0.7.0
pyarrow==4.0.1
Pygments==2.8.1
pylint==2.4.4
pylint-django==2.3.0
pylint-flask==0.6
pylint-plugin-utils==0.6
PyMySQL==1.0.2
python-dateutil==2.8.1
pytz==2021.1
redshift-connector==2.0.881
requests==2.25.1
s3transfer==0.4.2
scramp==1.4.0
six==1.15.0
soupsieve==2.2.1
stevedore==3.3.0
traitlets==5.0.5
typed-ast==1.4.2
typing-extensions==3.7.4.3
urllib3==1.26.4
virtualenv==16.2.0
virtualenv-clone==0.5.4
virtualenvwrapper==4.8.4
wcwidth==0.2.5
wrapt==1.12.1
xlrd==2.0.1
zipp==3.4.1
To Reproduce
- In Excel create a new spreadsheet. Add a second tab to it called 'Sheet2'. Type a small table of rows and columns into that sheet.
- Save the spreadsheet as
demo_excel_multi_sheet.xlsx
. - Save the spreadsheet a second time as
demo_excel_multi_sheet.xls
. - Upload copies of both spreadsheets to a test S3 bucket.
- Run the following program using Python 3,
import awswrangler as wr
import pandas as pd
import boto3
import io
data_bucket = <Name of your bucket>
xlsx_object = 'demo_excel_multi_sheet.xlsx'
xls_object = 'demo_excel_multi_sheet.xls'
# XLSX on S3 using Data Wrangler works.
xlsx_s3_df = wr.s3.read_excel('s3://' + data_bucket + '/' + xlsx_object, sheet_name='Sheet2')
print(xlsx_s3_df.head())
# XLS locally also works.
xls_local_df = pd.read_excel(xls_object, sheet_name='Sheet2')
print(xls_local_df.head())
# XLS on S3 using Boto3 works.
s3 = boto3.resource('s3')
s3_client = boto3.client('s3')
obj = s3_client.get_object(Bucket=data_bucket, Key=xls_object)
data = obj['Body'].read()
df_boto = pd.read_excel(io.BytesIO(data), sheet_name='Sheet2')
print(df_boto.head())
# XLS on S3 using Data Wrangler doesn't work.
xls_s3_df = wr.s3.read_excel('s3://' + data_bucket + '/' + xls_object, sheet_name='Sheet2')
print(xls_s3_df.head())
Notice that the first 3 print(xyz.head())
statements succeed, but the 4th one is never reached (error message on previous line of code).
P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.