Skip to content

set encoding to utf-8 when no encoding is specified when reading/writing to s3 #1257

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 12, 2022

Conversation

cnfait
Copy link
Contributor

@cnfait cnfait commented Apr 5, 2022

Feature or Bugfix

  • Bugfix

Detail

  • set encoding to utf-8 when no encoding is specified when reading/writing to s3
  • on Windows, using athena.read_sql_query with ctas_approach=False and column names containing special characters fails with an UnicodeDecodeError. When setting the encoding to utf-8 the error is gone.
  • surfacing pandas_kwargs (or just an encoding argument) all the way up to read_sql_query may not be the best course of action given how utf-8 is widespread nowadays and Athena-own support of other encodings may be lacking and not working as intended. On the other hand, utf-8 can largely be considered the default and this change is unlikely to cause side-effects. I'm happy to discuss this decision though!

Relates

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-pDO66x4b9gEu
  • Commit ID: 8135053
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@cnfait cnfait self-assigned this Apr 5, 2022
Copy link
Contributor

@jaidisido jaidisido left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the premise that utf-8 could be a sane default here.

Instead of fixing it in every call however, we should probably consider fixing the issue at the source. It seems that utf-8 is already set as a default but at the method parameter level which is an anti-pattern. It should instead be defined within the method itself

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-pDO66x4b9gEu
  • Commit ID: af76f24
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido jaidisido merged commit 28e3b30 into main Apr 12, 2022
@jaidisido jaidisido deleted the windows-unicodedecodeerror branch April 14, 2022 10:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants