Skip to content

wr.s3.read_excel can't read multi-sheet .xls files #739

Closed
@johntelforduk

Description

@johntelforduk

Describe the bug

I'm using Data Wrangler to process some Excel spreadsheets stored on S3. The spreadsheets are multi-sheet. I've found that wr.s3.read_excel reads .xlsx objects fine. But it fails when reading .xls objects. My testing indicates that the problem with .xls is for multi-sheet spreadsheets only. BTW, my testing also indicates that reading my test multi-sheet .xls files works OK using the basic Pandas pd.read_excel method.

Environment

asn1crypto==1.4.0
astroid==2.3.0
awswrangler==2.8.0
backcall==0.2.0
beautifulsoup4==4.9.3
boto3==1.17.86
botocore==1.20.45
certifi==2021.5.30
chardet==4.0.0
decorator==5.0.5
Django==2.0.2
et-xmlfile==1.1.0
git-remote-codecommit==1.15.1
idna==2.10
ikp3db==1.4.1
importlib-metadata==3.10.0
ipython==7.22.0
ipython-genutils==0.2.0
isort==4.3.21
jedi==0.18.0
jmespath==0.10.0
lazy-object-proxy==1.6.0
lxml==4.6.3
mccabe==0.6.1
numpy==1.20.3
openpyxl==3.0.7
pandas==1.2.4
parso==0.8.2
pbr==5.5.1
pexpect==4.8.0
pg8000==1.19.5
pickleshare==0.7.5
prompt-toolkit==3.0.18
ptyprocess==0.7.0
pyarrow==4.0.1
Pygments==2.8.1
pylint==2.4.4
pylint-django==2.3.0
pylint-flask==0.6
pylint-plugin-utils==0.6
PyMySQL==1.0.2
python-dateutil==2.8.1
pytz==2021.1
redshift-connector==2.0.881
requests==2.25.1
s3transfer==0.4.2
scramp==1.4.0
six==1.15.0
soupsieve==2.2.1
stevedore==3.3.0
traitlets==5.0.5
typed-ast==1.4.2
typing-extensions==3.7.4.3
urllib3==1.26.4
virtualenv==16.2.0
virtualenv-clone==0.5.4
virtualenvwrapper==4.8.4
wcwidth==0.2.5
wrapt==1.12.1
xlrd==2.0.1
zipp==3.4.1

To Reproduce

  1. In Excel create a new spreadsheet. Add a second tab to it called 'Sheet2'. Type a small table of rows and columns into that sheet.
  2. Save the spreadsheet as demo_excel_multi_sheet.xlsx.
  3. Save the spreadsheet a second time as demo_excel_multi_sheet.xls.
  4. Upload copies of both spreadsheets to a test S3 bucket.
  5. Run the following program using Python 3,
import awswrangler as wr
import pandas as pd
import boto3
import io

data_bucket = <Name of your bucket>
xlsx_object = 'demo_excel_multi_sheet.xlsx'
xls_object = 'demo_excel_multi_sheet.xls'

# XLSX on S3 using Data Wrangler works.
xlsx_s3_df = wr.s3.read_excel('s3://' + data_bucket + '/' + xlsx_object, sheet_name='Sheet2')
print(xlsx_s3_df.head())

# XLS locally also works.
xls_local_df = pd.read_excel(xls_object, sheet_name='Sheet2')
print(xls_local_df.head())

# XLS on S3 using Boto3 works.
s3 = boto3.resource('s3')
s3_client = boto3.client('s3')
obj = s3_client.get_object(Bucket=data_bucket, Key=xls_object)
data = obj['Body'].read()
df_boto = pd.read_excel(io.BytesIO(data), sheet_name='Sheet2')
print(df_boto.head())

# XLS on S3 using Data Wrangler doesn't work.
xls_s3_df = wr.s3.read_excel('s3://' + data_bucket + '/' + xls_object, sheet_name='Sheet2')
print(xls_s3_df.head())

Notice that the first 3 print(xyz.head()) statements succeed, but the 4th one is never reached (error message on previous line of code).

P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingminor releaseWill be addressed in the next minor releaseready to release

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions