Skip to content

Mysql upsert #613

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 23, 2021
Merged

Mysql upsert #613

merged 6 commits into from
Mar 23, 2021

Conversation

mattboyd-aws
Copy link
Contributor

Issue #, if available:
#608

Description of changes:
Adding basic upsert capabilities to mysql.to_sql() function.

mode supports 3 new values:

  • upsert_duplicate_key: Performs an upsert using ON DUPLICATE KEY clause. Requires table schema to have defined keys, otherwise duplicate records will be inserted.
  • upsert_replace_into: Performs upsert using REPLACE INTO clause. Less efficient and still requires the table schema to have keys or else duplicate records will be inserted
  • upsert_distinct: Inserts new records, including duplicates, then recreates the table and inserts DISTINCT records from old table. This is the least efficient approach, but handles scenarios where there are no keys on table.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@jaidisido jaidisido self-requested a review March 22, 2021 14:24
@jaidisido jaidisido added the enhancement New feature or request label Mar 22, 2021
@jaidisido jaidisido added this to the 2.7.0 milestone Mar 22, 2021
Copy link
Contributor

@jaidisido jaidisido left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff, thank you for submitting. I left a few comments

upsert_duplicate_key: Performs an upsert using `ON DUPLICATE KEY` clause. Requires table schema to have
defined keys, otherwise duplicate records will be inserted.
upsert_replace_into: Performs upsert using `REPLACE INTO` clause. Less efficient and still requires the
table schema to have keys or else duplicate records will be inserted upsert_distinct: Inserts new records,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we bring upsert_distinct into a new line for readability?

Comment on lines 337 to 346
if mode.strip().lower() not in [
"append",
"overwrite",
"upsert_replace_into",
"upsert_duplicate_key",
"upsert_distinct",
]:
raise exceptions.InvalidArgumentValue(
"mode must be one of append, overwrite, upsert_replace_into, upsert_duplicate_key, upsert_distinct"
)
Copy link
Contributor

@jaidisido jaidisido Mar 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we simplify this to:

    mode = mode.strip().lower()
    modes = [
        "append",
        "overwrite",
        "upsert_replace_into",
        "upsert_duplicate_key",
        "upsert_distinct",
    ]
    if mode not in modes:
        raise exceptions.InvalidArgumentValue(
            f"mode must be one of {', '.join(modes)}"
        )

This way we only define the list of modes once and likewise only have to perform the strip().lower() operation once on mode

_logger.debug("sql: %s", sql)
cursor.executemany(sql, (parameters,))
con.commit()
if mode.lower().strip() == "upsert_distinct":
temp_table = f"{table}_{''.join(random.choice(string.ascii_lowercase) for i in range(10))}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

temp_table = f"{table}_{uuid.uuid4().hex}" is simpler and only requires one library

Comment on lines 213 to 221
wr.mysql.to_sql(
df=df, con=mysql_con, schema="test", table=mysql_table, mode="upsert_distinct", use_column_names=True
)
wr.mysql.to_sql(
df=df, con=mysql_con, schema="test", table=mysql_table, mode="upsert_distinct", use_column_names=True
)
wr.mysql.to_sql(
df=df, con=mysql_con, schema="test", table=mysql_table, mode="upsert_distinct", use_column_names=True
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we test a few more edge cases for all methods please? For instance what if the primary key values change.

    df = pd.DataFrame({"c0": ["foo", "bar"], "c2": [1, 2]})

    wr.mysql.to_sql(
        df=df, con=mysql_con, schema="test", table=mysql_table, mode="upsert_replace_into", use_column_names=True
    )
    df2 = wr.mysql.read_sql_table(con=mysql_con, schema="test", table=mysql_table)
    assert bool(len(df2) == 2)
    wr.mysql.to_sql(
        df=df, con=mysql_con, schema="test", table=mysql_table, mode="upsert_replace_into", use_column_names=True
    )
    df3 = pd.DataFrame({"c0": ["baz", "bar"], "c2": [3, 2]})
    wr.mysql.to_sql(
        df=df3, con=mysql_con, schema="test", table=mysql_table, mode="upsert_replace_into", use_column_names=True
    )
    df4 = wr.mysql.read_sql_table(con=mysql_con, schema="test", table=mysql_table)
    assert bool(len(df4) == 3)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will work through more scenarios. Should they be placed into the same test functions with additional assertions or would it be better to have separate functions for each case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think additional assertions within the same test functions is fine unless you feel it's getting too big or the cases do not belong to the same logical test. Thanks!

@jaidisido jaidisido added the minor release Will be addressed in the next minor release label Mar 22, 2021
@senorkrabs
Copy link

The latest commit should have the requested changes. Please let me know if I missed anything.

Copy link
Contributor

@jaidisido jaidisido left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor change requested otherwise looks great, thanks

@jaidisido jaidisido merged commit d11ae04 into aws:main Mar 23, 2021
@jaidisido jaidisido linked an issue Mar 23, 2021 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request minor release Will be addressed in the next minor release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for upserting to MySQL
3 participants