[SYSTEMDS-3669] Computation of Shapley Values #1946

louislepage · 2023-11-19T19:46:16Z

This is a PR for a master thesis about computation of shapley values with systemds at scale.

louislepage · 2023-12-11T07:48:42Z

I rewrote the generic shapley value computation with sampling and added an example script, as well as a jupyter notebook in which i compared the results of the official SHAP package and my systemds implementation.

The reults (at least in the case of scaled data) look good, however I found that the rbind() calls during preparation of the instances matrix are very slow and take the longest. Therefor I would revisit this to make further optimizations, but I think I need some advice on how make those appends/writes to large matrices faster.

Here is a quick plot of the computed values for the 107 features of the tranformencoded adult dataset.
Both implementations used the full ~32000 samples as background data and ran for 10000 iterations for each sample.

louislepage · 2023-12-23T14:45:02Z

I was able to run it on 5.000 to 50.000 samples and, as expected, it did scale linearly.
However this also showed, that the python implementation just has a huge overhead for small sample sizes, but scales at virtually the same rate.

And rbind() is still the slowest single operation for very large sample-sizes, but i was unable to directly write to rows of a pre-allocated matrix, which could be benifical in this case.

Heavy hitter instructions:
  #  Instruction                    Time(s)  Count
  1  shapley_sampling                56.988      1
  2  shapley_sampling_prepare        29.756    107
  3  rbind                           22.870      2
  4  !                               10.097    107
  5  append                           6.878    219
  6  rand                             4.483    219
  7  leftIndex                        3.574    321

Baunsgaard · 2023-12-28T21:25:13Z

Hi @louislepage

Once you want a review, mark it in this PR.

As an initial comment, try to avoid including the run times or other CSV files, and modify your Python notebook to be scripts that run instead. Removing the notebook gives us clearer ways of comparing the performance and removes the need to start a notebook.

Once you are happy with the behavior of your Shapley operator it would be great if you move it from staging to a builtin function.
If it is unclear how to do this, comment on this PR, and we will help you out.

Thanks!

louislepage · 2024-01-26T08:01:16Z

Thanks @Baunsgaard for your feedback! :)

I rewrote the notebook as a python script and removed the CSVs.
But i still have to fix an rbind issue in the shapley computation in systemds.

I will let you know as soon as I am done with this.
Also, I think @christinadionysio told me, she would look after this PR, if I understood her correctly, as she is supervising my thesis.

Anyways, I still have to do some work on this before it can be moved to a builtin-function, so I better get to it! :D

christinadionysio · 2024-01-26T08:41:12Z

Thank you @louislepage for this contribution.

As we talked about in the last meeting the code looks good so far.
Please add the license header to the test_runtimes.sh file.
Additionally, I created a jira issue for your thesis, so you can replace [WIP] by [SYSTEMDS-3669].

codecov · 2024-07-13T16:46:46Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (main@505f871). Learn more about missing BASE report.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1946   +/-   ##
=======================================
  Coverage        ?   68.82%           
  Complexity      ?    40707           
=======================================
  Files           ?     1440           
  Lines           ?   161565           
  Branches        ?    31418           
=======================================
  Hits            ?   111200           
  Misses          ?    41303           
  Partials        ?     9062

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…nostic

…uation

… to evaluate runtimes

louislepage · 2024-07-25T05:13:52Z

After finishing the thesis, I started moving the explainer was to builtin as scripts/builtin/shapExplainer.dml.

For the final merge, all files under scripts/staging/shapley_values will be removed. The experiments are available as part of the reproducibility repository of the thesis.

Unit- and component tests were added in test/functions/builtin/part2/BuiltinShapExplainerTest.java which use tests written in DML in test/scripts/functions/builtin/shapExplainerUnit.dml and test/scripts/functions/builtin/shapExplainerComponent.dml. The DML file also stores the expected results for each test, results are compared within the javatest. This may be a bit unconventional, but it reduces the amount of files/code since otherwise all expected results would have to be written in java and each unit test would need to go to its own DML script. I hope this is fine.

christinadionysio

Thank you for your contribution @louislepage! Overall it looks really good, I also like your reproducibility repository! I only left a few minor comments about formatting and tests.

christinadionysio · 2024-07-30T08:26:37Z

scripts/builtin/shapExplainer.dml

+# S              Matrix holding the shapley values along the cols, one row per instance.
+# expected       Double holding the average prediction of all instances.
+# -----------------------------------------------------------------------------
+s_shapExplainer = function(String model_function, list[unknown] model_args, Matrix[Double] x_instances,


Please reformat the function declarations so that you use 4 spaces for every parameter that is introduced in a new line and 2 spaces for the return value and move the opening bracket { to a new line. Please double check for all other functions in this file.

example = function(String x, Double y, ... Matrix[Double] z) return Double r { }

christinadionysio · 2024-07-30T08:29:41Z

scripts/builtin/shapExplainer.dml

+
+  # create row indicator vector ctable
+  perm_mask_rows = seq(1,perm_cols)
+  #TODO: col-vector and matrix mult?


Is this TODO still needed? If not please remove it and also double check for the other occurrences in this script.

christinadionysio · 2024-07-30T08:35:01Z

src/test/java/org/apache/sysds/test/functions/builtin/part2/BuiltinShapExplainerTest.java

+
+    @Test
+    public void testPrepareMaskForPermutation() {
+        runShapExplainerUnitTest("prepare_mask_for_permutation");


We use these tests to test the whole functionality of the scripts (more like your component test). The test cases should cover different input parameters and edge cases. It would be great if you could modify the tests in that regard to keep them consistent with our other tests. Please add additional component tests for the different modes (HYBRID, SPARK).

mboehm7 · 2024-10-27T16:25:18Z

LGTM - thanks @louislepage for this great work. I will now merge it in, despite the remaining minor comments, in order to move this along and facilitate follow-up work. During the merge, I fixed some rebase conflicts, fixed the formatting of the java tests (tabs over spaces), eliminated remaining warnings, and added a FIXME for the unnecessary padding with zero in the test. Thanks.

louislepage marked this pull request as ready for review December 11, 2023 07:48

louislepage force-pushed the shap-values branch from 9cfcc9d to e2974ea Compare December 12, 2023 08:09

louislepage force-pushed the shap-values branch from 28939cd to c0343c0 Compare January 26, 2024 07:52

louislepage changed the title ~~[WIP] Computation of Shapley Values~~ [SYSTEMDS-3669] Computation of Shapley Values Feb 7, 2024

j143 added this to the systemds-3.2.0 milestone Feb 8, 2024

louislepage force-pushed the shap-values branch 2 times, most recently from 25d8af5 to 8937054 Compare March 23, 2024 09:01

louislepage force-pushed the shap-values branch from 8937054 to c91aa9b Compare April 2, 2024 09:36

louislepage force-pushed the shap-values branch 2 times, most recently from 0bc46f4 to 80d837e Compare July 25, 2024 05:00

louislepage and others added 14 commits July 25, 2024 07:01

[MA THESIS - SHAPLEY VALUES] add shapley sampling for xgboost models

8dfbe77

[MA THESIS - SHAPLEY VALUES] add prepare and compute to make model ag…

7fb9269

…nostic

[MA THESIS - SHAPLEY VALUES] add all in one function for sampling

ecdcd47

[MA THESIS - SHAPLEY VALUES] add scripts and jupyternotebook for eval…

8c5aaf4

…uation

[MA THESIS - SHAPLEY VALUES] new plot for evaluation

ad05b98

[MA THESIS - SHAPLEY VALUES] add sampling with replacement and script…

53dfd8a

… to evaluate runtimes

[MA THESIS - SHAPLEY VALUES] add evaluation results

65c6b8c

[MA THESIS - SHAPLEY VALUES] delete data directory

4d454c3

[MA THESIS - SHAPLEY VALUES] rewrite as python script

49cf4b7

[SYSTEMDS-3669] add license to bash script

aaba643

[SYSTEMDS-3669] use par for and copy direct to result matrix

333031d

[SYSTEMDS-3669] iterative prototype for shapley values by permutation

d1bc8b3

[SYSTEMDS-3669] optimized permutation shap and testsuite

3c0d8c8

[SYSTEMDS-3669] prototype of reuse of maskes for multirow case

170495d

louislepage added 21 commits July 25, 2024 07:01

[SYSTEMDS-3669] add testscripts for permutation runtime

36859d8

[SYSTEMDS-3669] add support for non-varying indices

d59eb5b

[SYSTEMDS-3669] add support for removal of non-varying indices

880639b

[SYSTEMDS-3669] add test for iterative approach

bd89b35

[SYSTEMDS-3669] bug fixes due to new formats

fd404b8

[SYSTEMDS-3669] add partitions to mask prep and fix typo

3ad6f3a

[SYSTEMDS-3669] add partitions support to by-row

3099b51

[SYSTEMDS-3669] finalised partitions for use with explainer

463091c

[SYSTEMDS-3669] add to permutation experiments script

0a46bbd

[SYSTEMDS-3669] turn parfor back on...

bfc7ac1

[SYSTEMDS-3669] add l2svm to experiments

4dc1591

[SYSTEMDS-3669] add l2svm to python experiments

9e45517

[SYSTEMDS-3669] minor informations logging touch ups

2f6e467

[SYSTEMDS-3669] add support for fnn and minor fixes

3e83317

[SYSTEMDS-3669] add final method in its own directory

24c4546

[SYSTEMDS-3669] add infos in final method

918ff22

[SYSTEMDS-3669] add license

d82c058

[SYSTEMDS-3669] move explainer to builtin

e702210

[SYSTEMDS-3669] refactor parameter names

e2b6e8c

[SYSTEMDS-3669] add first unit tests as java tests

055a886

[SYSTEMDS-3669] add unit tests and component test with dummy data

abf1b8a

louislepage force-pushed the shap-values branch from 80d837e to abf1b8a Compare July 25, 2024 05:02

christinadionysio reviewed Jul 30, 2024

View reviewed changes

mboehm7 closed this in b651db5 Oct 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYSTEMDS-3669] Computation of Shapley Values #1946

[SYSTEMDS-3669] Computation of Shapley Values #1946

Uh oh!

louislepage commented Nov 19, 2023

Uh oh!

louislepage commented Dec 11, 2023 •

edited

Loading

Uh oh!

louislepage commented Dec 23, 2023

Uh oh!

Baunsgaard commented Dec 28, 2023

Uh oh!

louislepage commented Jan 26, 2024

Uh oh!

christinadionysio commented Jan 26, 2024

Uh oh!

codecov bot commented Jul 13, 2024

Uh oh!

louislepage commented Jul 25, 2024

Uh oh!

christinadionysio left a comment

Uh oh!

christinadionysio Jul 30, 2024

Uh oh!

christinadionysio Jul 30, 2024

Uh oh!

christinadionysio Jul 30, 2024

Uh oh!

mboehm7 commented Oct 27, 2024

Uh oh!

Uh oh!

[SYSTEMDS-3669] Computation of Shapley Values #1946

[SYSTEMDS-3669] Computation of Shapley Values #1946

Uh oh!

Conversation

louislepage commented Nov 19, 2023

Uh oh!

louislepage commented Dec 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

louislepage commented Dec 23, 2023

Uh oh!

Baunsgaard commented Dec 28, 2023

Uh oh!

louislepage commented Jan 26, 2024

Uh oh!

christinadionysio commented Jan 26, 2024

Uh oh!

codecov bot commented Jul 13, 2024

Codecov Report

Uh oh!

louislepage commented Jul 25, 2024

Uh oh!

christinadionysio left a comment

Choose a reason for hiding this comment

Uh oh!

christinadionysio Jul 30, 2024

Choose a reason for hiding this comment

Uh oh!

christinadionysio Jul 30, 2024

Choose a reason for hiding this comment

Uh oh!

christinadionysio Jul 30, 2024

Choose a reason for hiding this comment

Uh oh!

mboehm7 commented Oct 27, 2024

Uh oh!

Uh oh!

louislepage commented Dec 11, 2023 •

edited

Loading