-
Notifications
You must be signed in to change notification settings - Fork 500
[SYSTEMDS-3669] Computation of Shapley Values #1946
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9cfcc9d
to
e2974ea
Compare
I was able to run it on 5.000 to 50.000 samples and, as expected, it did scale linearly. And rbind() is still the slowest single operation for very large sample-sizes, but i was unable to directly write to rows of a pre-allocated matrix, which could be benifical in this case.
|
Hi @louislepage Once you want a review, mark it in this PR. As an initial comment, try to avoid including the run times or other CSV files, and modify your Python notebook to be scripts that run instead. Removing the notebook gives us clearer ways of comparing the performance and removes the need to start a notebook. Once you are happy with the behavior of your Shapley operator it would be great if you move it from staging to a builtin function. Thanks! |
28939cd
to
c0343c0
Compare
Thanks @Baunsgaard for your feedback! :) I rewrote the notebook as a python script and removed the CSVs. I will let you know as soon as I am done with this. Anyways, I still have to do some work on this before it can be moved to a builtin-function, so I better get to it! :D |
Thank you @louislepage for this contribution. As we talked about in the last meeting the code looks good so far. |
25d8af5
to
8937054
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1946 +/- ##
=======================================
Coverage ? 68.82%
Complexity ? 40707
=======================================
Files ? 1440
Lines ? 161565
Branches ? 31418
=======================================
Hits ? 111200
Misses ? 41303
Partials ? 9062 ☔ View full report in Codecov by Sentry. |
0bc46f4
to
80d837e
Compare
… to evaluate runtimes
After finishing the thesis, I started moving the explainer was to builtin as scripts/builtin/shapExplainer.dml. For the final merge, all files under scripts/staging/shapley_values will be removed. The experiments are available as part of the reproducibility repository of the thesis. Unit- and component tests were added in test/functions/builtin/part2/BuiltinShapExplainerTest.java which use tests written in DML in test/scripts/functions/builtin/shapExplainerUnit.dml and test/scripts/functions/builtin/shapExplainerComponent.dml. The DML file also stores the expected results for each test, results are compared within the javatest. This may be a bit unconventional, but it reduces the amount of files/code since otherwise all expected results would have to be written in java and each unit test would need to go to its own DML script. I hope this is fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution @louislepage! Overall it looks really good, I also like your reproducibility repository! I only left a few minor comments about formatting and tests.
# S Matrix holding the shapley values along the cols, one row per instance. | ||
# expected Double holding the average prediction of all instances. | ||
# ----------------------------------------------------------------------------- | ||
s_shapExplainer = function(String model_function, list[unknown] model_args, Matrix[Double] x_instances, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please reformat the function declarations so that you use 4 spaces for every parameter that is introduced in a new line and 2 spaces for the return value and move the opening bracket {
to a new line. Please double check for all other functions in this file.
example = function(String x, Double y, ...
Matrix[Double] z)
return Double r
{
}
|
||
# create row indicator vector ctable | ||
perm_mask_rows = seq(1,perm_cols) | ||
#TODO: col-vector and matrix mult? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this TODO still needed? If not please remove it and also double check for the other occurrences in this script.
|
||
@Test | ||
public void testPrepareMaskForPermutation() { | ||
runShapExplainerUnitTest("prepare_mask_for_permutation"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use these tests to test the whole functionality of the scripts (more like your component test). The test cases should cover different input parameters and edge cases. It would be great if you could modify the tests in that regard to keep them consistent with our other tests. Please add additional component tests for the different modes (HYBRID, SPARK).
LGTM - thanks @louislepage for this great work. I will now merge it in, despite the remaining minor comments, in order to move this along and facilitate follow-up work. During the merge, I fixed some rebase conflicts, fixed the formatting of the java tests (tabs over spaces), eliminated remaining warnings, and added a FIXME for the unnecessary padding with zero in the test. Thanks. |
This is a PR for a master thesis about computation of shapley values with systemds at scale.