Skip to content

Append incremental n to duplicate cols recursively #1124

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jan 21, 2022

Conversation

kailukowiak
Copy link
Contributor

FIxes #1119 (hopefully)

I added:

  • One function rename_duplicate_columns which recursively appends _n to duplicated column names.
  • Added a flag to sanitize_dataframe_columns_names which can be ['warn', 'drop', 'rename'] will either leave the DF along, delete all but the first duplicated columns, or append a number to the duplicated column.
  • Added some tests in athena_test.py to test this functionality. I'm not exactly sure this is the right place but other column sanitizers were there.
  • Exported rename_duplicate_columns
  • Imported warnings.

I'm not sure if I followed how you handle warnings as I saw different syntax in other parts but it should be easy to modify if I was wrong.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-pDO66x4b9gEu
  • Commit ID: e3df738
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@kukushking
Copy link
Contributor

Static checking has failed. Excerpt from the build logs:

ERROR: /codebuild/output/src093804598/src/github.com/awslabs/aws-data-wrangler/awswrangler/catalog/_utils.py Imports are incorrectly sorted and/or formatted.
ERROR: /codebuild/output/src093804598/src/github.com/awslabs/aws-data-wrangler/awswrangler/catalog/__init__.py Imports are incorrectly sorted and/or formatted.

Please run fix.sh (or isort .) to fix

Copy link
Contributor

@jaidisido jaidisido left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, left a few comments.

Note: You can run the ./validate.sh script in the root locally to capture static check errors (isort, black, doc8...) before pushing to the repo

@@ -246,6 +246,12 @@ def test_athena_read_list(glue_database):
wr.athena.read_sql_query(sql="SELECT ARRAY[1, 2, 3]", database=glue_database, ctas_approach=False)


def test_sanitize_dataframe_column_names():
assert wr.catalog.sanitize_dataframe_columns_names(df=pd.DataFrame({'A': [1, 2]})).equals(pd.DataFrame({'a': [1, 2]})) # Unsure how to test for warnings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-pDO66x4b9gEu
  • Commit ID: 80ba39f
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-pDO66x4b9gEu
  • Commit ID: 4ea3662
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-pDO66x4b9gEu
  • Commit ID: f7a7265
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-pDO66x4b9gEu
  • Commit ID: 8a6e177
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-pDO66x4b9gEu
  • Commit ID: f0ccf2b
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-pDO66x4b9gEu
  • Commit ID: 656e908
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido
Copy link
Contributor

AWS CodeBuild CI Report

  • CodeBuild project: GitHubCodeBuild8756EF16-pDO66x4b9gEu
  • Commit ID: 97396ce
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@jaidisido jaidisido merged commit e1cd200 into aws:main Jan 21, 2022
@kailukowiak kailukowiak deleted the rename-dup-cols branch March 17, 2022 22:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

wr.catalog.sanitize_dataframe_columns_names does not sanitize enough
3 participants