-
Notifications
You must be signed in to change notification settings - Fork 707
Mysql upsert #613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mysql upsert #613
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff, thank you for submitting. I left a few comments
awswrangler/mysql.py
Outdated
upsert_duplicate_key: Performs an upsert using `ON DUPLICATE KEY` clause. Requires table schema to have | ||
defined keys, otherwise duplicate records will be inserted. | ||
upsert_replace_into: Performs upsert using `REPLACE INTO` clause. Less efficient and still requires the | ||
table schema to have keys or else duplicate records will be inserted upsert_distinct: Inserts new records, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we bring upsert_distinct
into a new line for readability?
awswrangler/mysql.py
Outdated
if mode.strip().lower() not in [ | ||
"append", | ||
"overwrite", | ||
"upsert_replace_into", | ||
"upsert_duplicate_key", | ||
"upsert_distinct", | ||
]: | ||
raise exceptions.InvalidArgumentValue( | ||
"mode must be one of append, overwrite, upsert_replace_into, upsert_duplicate_key, upsert_distinct" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we simplify this to:
mode = mode.strip().lower()
modes = [
"append",
"overwrite",
"upsert_replace_into",
"upsert_duplicate_key",
"upsert_distinct",
]
if mode not in modes:
raise exceptions.InvalidArgumentValue(
f"mode must be one of {', '.join(modes)}"
)
This way we only define the list of modes once and likewise only have to perform the strip().lower()
operation once on mode
awswrangler/mysql.py
Outdated
_logger.debug("sql: %s", sql) | ||
cursor.executemany(sql, (parameters,)) | ||
con.commit() | ||
if mode.lower().strip() == "upsert_distinct": | ||
temp_table = f"{table}_{''.join(random.choice(string.ascii_lowercase) for i in range(10))}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
temp_table = f"{table}_{uuid.uuid4().hex}"
is simpler and only requires one library
tests/test_mysql.py
Outdated
wr.mysql.to_sql( | ||
df=df, con=mysql_con, schema="test", table=mysql_table, mode="upsert_distinct", use_column_names=True | ||
) | ||
wr.mysql.to_sql( | ||
df=df, con=mysql_con, schema="test", table=mysql_table, mode="upsert_distinct", use_column_names=True | ||
) | ||
wr.mysql.to_sql( | ||
df=df, con=mysql_con, schema="test", table=mysql_table, mode="upsert_distinct", use_column_names=True | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we test a few more edge cases for all methods please? For instance what if the primary key values change.
df = pd.DataFrame({"c0": ["foo", "bar"], "c2": [1, 2]})
wr.mysql.to_sql(
df=df, con=mysql_con, schema="test", table=mysql_table, mode="upsert_replace_into", use_column_names=True
)
df2 = wr.mysql.read_sql_table(con=mysql_con, schema="test", table=mysql_table)
assert bool(len(df2) == 2)
wr.mysql.to_sql(
df=df, con=mysql_con, schema="test", table=mysql_table, mode="upsert_replace_into", use_column_names=True
)
df3 = pd.DataFrame({"c0": ["baz", "bar"], "c2": [3, 2]})
wr.mysql.to_sql(
df=df3, con=mysql_con, schema="test", table=mysql_table, mode="upsert_replace_into", use_column_names=True
)
df4 = wr.mysql.read_sql_table(con=mysql_con, schema="test", table=mysql_table)
assert bool(len(df4) == 3)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will work through more scenarios. Should they be placed into the same test functions with additional assertions or would it be better to have separate functions for each case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think additional assertions within the same test functions is fine unless you feel it's getting too big or the cases do not belong to the same logical test. Thanks!
The latest commit should have the requested changes. Please let me know if I missed anything. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor change requested otherwise looks great, thanks
Issue #, if available:
#608
Description of changes:
Adding basic upsert capabilities to
mysql.to_sql()
function.mode
supports 3 new values:upsert_duplicate_key
: Performs an upsert usingON DUPLICATE KEY
clause. Requires table schema to have defined keys, otherwise duplicate records will be inserted.upsert_replace_into
: Performs upsert usingREPLACE INTO
clause. Less efficient and still requires the table schema to have keys or else duplicate records will be insertedupsert_distinct
: Inserts new records, including duplicates, then recreates the table and insertsDISTINCT
records from old table. This is the least efficient approach, but handles scenarios where there are no keys on table.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.