[SYSTEMDS-3782] Bag-of-words Encoder for CP #2130

e-strauss · 2024-10-18T15:35:13Z

[SYSTEMDS-3782] Bag-of-words Encoder for CP

This patch adds a new feature transformation Bag-of-words (bow) to the to SystemDS' parallel feature transformation framework UPLIFT. The Bag-of-words encoder works similarly to scikit-learn's CountVectorizor ( https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html ).

Currently, the operation is only supported for CP. I had to adapt the framework a little bit, because the bow encoders behaves differently than other encoders and can create multiple non-zero values from a single input column. In comparison, encoders like Dummycode (dc) create more columns, but result always in just one non-zero value from each dc encoder. So, when a bow encoders is involved, the nnz values for each row is not known upfront, which is problematic for a parallel apply with CSR matrix output.
I added a new field in the ColumnEncoder which contains the nnz for each row, which is known after the build is completed.

The encoding, consists again of two phases the build, where we find the number of distinct tokens in the column and an apply phase, where we count the occurrences of each distinct token in each row.

For a parallelised build phase, we need to know the number of distinct tokens to avoid OOM errors through many large dictionaries. Similarly to recode, we estimate the number of distinct tokens of the bow dictionary, by sampling from a subset of rows. In contrast, this estimation is computational more complex for each row than for recode through the whole tokenisation process. A small experimental evaluation showed that the estimation took significantly longer for bow in comparison to recode, but in the end did not make up more than 10% of the whole encoding process.
One option to tackle this, would be to increase the number of samples for the bow encoder, but this resulted in (significantly) higher error of the estimation.

Overall the parallelisation of a single bow encoder over row partitions showed good improvements in runtime (3.4x with 4 physical cores) for the amazon digital music review dataset (131k reviews, 8m tokens,127k distinct tokens).

single-threaded encoder: 2.5s
multi-threaded encoder: 0.73s
scikit-learn's CountVectorizer: 3s

The default tokenisation consists of the conversion to lowercase, replacing of all non-alphabetic characters to whitespaces and splitting on whitespaces.

The current technique makes two passes over the data (build + apply), which could be avoided by storing the count of distinct token counts of the build phase, and just setting the values correctly in the apply phase. But this results, in a much higher memory footprint, since we have to store additionally for each row a hashmap of token count.
A next step could be to estimate the memory consumption of this approach by using the meta data of distinct token estimation, and dynamically choose this technique if enough memory is avaible.

The tests include test with single transformation, mutilple bag-of-words transformations and transformations with other encoders (recode).

codecov · 2024-10-18T21:54:51Z

Codecov Report

Attention: Patch coverage is 89.05109% with 45 lines in your changes missing coverage. Please review.

Project coverage is 71.21%. Comparing base (d0f0837) to head (3264595).
Report is 12 commits behind head on main.

Files with missing lines	Patch %	Lines
...s/runtime/transform/encode/MultiColumnEncoder.java	85.40%	12 Missing and 8 partials ⚠️
...time/transform/encode/ColumnEncoderBagOfWords.java	91.00%	3 Missing and 15 partials ⚠️
...ntime/transform/encode/ColumnEncoderDummycode.java	55.55%	4 Missing ⚠️
...ime/transform/encode/ColumnEncoderPassThrough.java	60.00%	2 Missing ⚠️
.../sysds/runtime/transform/encode/ColumnEncoder.java	92.30%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2130      +/-   ##
============================================
+ Coverage     71.12%   71.21%   +0.08%     
- Complexity    43023    43150     +127     
============================================
  Files          1446     1447       +1     
  Lines        164151   164554     +403     
  Branches      32001    32086      +85     
============================================
+ Hits         116759   117189     +430     
+ Misses        38240    38207      -33     
- Partials       9152     9158       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

e-strauss · 2024-10-19T19:15:32Z

most of the missing code coverage comes from the MCSR code in the different encoders, which is currently disabled.

src/test/resources/datasets/amazonReview2023/Digital_Music_Text_Head16k.csv

mboehm7 · 2024-10-25T08:48:17Z

LGTM - thanks for the new encoder @e-strauss. During the merge I fixed unnecessary imports, removed wildcard imports, removed jetbrains annotations, and fixed some minor formatting issues.

e-strauss changed the title ~~[SYSTEMDS-3782] Bag-of-words Encoder for CP~~ [WIP] [SYSTEMDS-3782] Bag-of-words Encoder for CP Oct 18, 2024

e-strauss changed the title ~~[WIP] [SYSTEMDS-3782] Bag-of-words Encoder for CP~~ [SYSTEMDS-3782] Bag-of-words Encoder for CP Oct 18, 2024

e-strauss force-pushed the bow_squashed branch from f1ede62 to 676bb8a Compare October 19, 2024 21:38

Baunsgaard reviewed Oct 21, 2024

View reviewed changes

src/test/resources/datasets/amazonReview2023/Digital_Music_Text_Head16k.csv Outdated Show resolved Hide resolved

[SYSTEMDS-3782] Bag-of-words Encoder for CP

3264595

e-strauss force-pushed the bow_squashed branch from 676bb8a to 3264595 Compare October 21, 2024 10:25

mboehm7 closed this in 654cea9 Oct 25, 2024

e-strauss deleted the bow_squashed branch November 4, 2024 12:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYSTEMDS-3782] Bag-of-words Encoder for CP #2130

[SYSTEMDS-3782] Bag-of-words Encoder for CP #2130

Uh oh!

e-strauss commented Oct 18, 2024 •

edited

Loading

Uh oh!

codecov bot commented Oct 18, 2024 •

edited

Loading

Uh oh!

e-strauss commented Oct 19, 2024

Uh oh!

Uh oh!

mboehm7 commented Oct 25, 2024

Uh oh!

Uh oh!

[SYSTEMDS-3782] Bag-of-words Encoder for CP #2130

[SYSTEMDS-3782] Bag-of-words Encoder for CP #2130

Uh oh!

Conversation

e-strauss commented Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

e-strauss commented Oct 19, 2024

Uh oh!

Uh oh!

mboehm7 commented Oct 25, 2024

Uh oh!

Uh oh!

e-strauss commented Oct 18, 2024 •

edited

Loading

codecov bot commented Oct 18, 2024 •

edited

Loading