Skip to content

Feature/ssb benchmark #2280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2,765 commits into
base: main
Choose a base branch
from
Open

Feature/ssb benchmark #2280

wants to merge 2,765 commits into from

Conversation

ghafek
Copy link

@ghafek ghafek commented Jun 29, 2025

No description provided.

mboehm7 and others added 30 commits October 24, 2024 19:54
This patch resolves a remaining FIXME after improved rewrite code
coverage by fixing the expressions and other rewrite configs so the
test actually triggers the existing rewrite.
This patch makes some simple performance improvement in order to
reduce the runtime of the sparse component tests (300+s -> 30s). In
detail the runtime of specific tests improved as follows:

* SparseBlockMerge:         149s -> 14.7s
* SparseBlockIndexRange:    110s -> 13.4s
* SparseBlockGetFirstIndex:  29s ->  1.3s
This patch adds real-data tests for the new adasyn builtin function,
and changes the implementation to a vectorized implementation that
extracts over-sampled rows via a randomized permutation matrix multiply.
On the Diabetes dataset (with moderate class imbalance of 500 vs 268)
ADASYN slightly improves the test accuracy from 78.3 to 78.7%. It is
also noteworthy that the original ADASYN paper from 2008 only achieved
0.6831 and 0.6833 (with ADASYN) on this dataset.
This generalizes the adasyn test for additional real data set. On the
titantic dataset, adasyn gives a 1.6% improvement of test accuracy
(for a basic logreg model, 0.781 -> 0.797).
This patch fixes endless loops in transformencode, if the tfspec
references columns outside the column range.
The multi-threaded implementation of ultra-sparse matrices has a couple
of shortcomings (e.g., count column nnz, block allocation, too late
fallback to single-threaded). On a large 85M x 85M graph with 90M
non-zeros the transpose did not finish in hours. In this patch we now
introduces a more sophisticated sparse row iterator (row and column
lower/bounds) in order to facilitate a simple and fast transpose
ultra sparse operation. However, this implementation was still much
slower than falling back to single-threaded operations and thus use
single-threaded transpose for all ultra-sparse matrices instead of
if nnz < max(rows,cols). Now this operations completes in <9s.
There was a regression where all sparse matrix-vector elementwise
operations are now only executed single-threaded. This patch fixes
the most important branch for sparse-safe matrix-vector operations,
but in subsequent task we also need to fix all the other cases.

When running connected components on the Europe road network, the
individual binary multiply operations improved by 10-20x on a box with
48 vcores. End-to-end the entire components() invocation with 20
iterations improved from 282s (246s for b(*)) to 112s (75s for b(*)).
The 10x improvements do not carry fully through because the output MCSR
is converted to CSR when appending to the buffer pool (57s of 75s).
This patch adds the missing multi-threading for all cases of binary
elementwise operations, except one special case that directly constructs
a CSR output. Furthermore, in safeBinaryMVSparseDenseRow we now avoid
unnecessary allocation of temporary vectors by doing the filling inplace
on the first output row of every task.
This patch adds a test that systematically applies the single- and
multi-threaded writers/readers for matrices and frames, all formats,
as well as dense and sparse data.

These tests also revealed bugs in the hdf5 readers/writers where
incorrect data is read for single-threaded sparse as well as
multi-threaded dense and sparse.
Baunsgaard and others added 25 commits May 15, 2025 11:51
This commit adds vectorized kernels for matrix multiplication.

the vector API improves performance for single-threaded execution of our
AMD box improves by ~80% and Intel by ~60% for dense mm. These improvements
are with allocation overhead of the output and in ideal cases where the
input is cached and the JIT compilation is done.

The biggest change for users is that SystemDS now would require
`--add-modules=jdk.incubator.vector` to all execution calls. The commit
appropriately modifies all scripts to do this. However, all calling code
must be modified if it bypasses the bin/systemds and calls Java directly.

To measure the performance difference on your machine, use the added script:

src/test/scripts/performance/matrixMultiplication.sh

Closes apache#2216
This patch adds an initial version of the representation optimizer for the Scuro library. It is a two stage optimization where in the first step the best unimodal representation for given raw modalities is found and in the next step the k-best unimodal rerpesentations are combined into multimodal representations and evaluated against the target downstream task. Additionally, this patch adds tests for each stage of the optimizer.

Closes apache#2267
This patch downgrades the library versions of Scuro dependencies.

Closes apache#2269
This patch fixes the incorrect size propagation of unique which led
to incorrect results if the dimensions are used in subsequent ops.
Thanks to Chi-Hsin Huang for catching this bug.

Furthermore, this patch also includes minor updates for code quality
(removed unused imports, annotated unused functions)
This patch fixes issues of the test dml scripts in terms of missing
casts from 1-x-1 matrices to scalars. Interestingly, the test ran
fine in local environments because the parser validation runs
differently, and subsequently these 1-x-1 matrices where automatically
rewritten to scalars.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.