Skip to content

[SYSTEMDS-3782] Bag-of-words Encoder for CP #2130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

e-strauss
Copy link
Contributor

@e-strauss e-strauss commented Oct 18, 2024

[SYSTEMDS-3782] Bag-of-words Encoder for CP

This patch adds a new feature transformation Bag-of-words (bow) to the to SystemDS' parallel feature transformation framework UPLIFT. The Bag-of-words encoder works similarly to scikit-learn's CountVectorizor ( https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html ).

Currently, the operation is only supported for CP. I had to adapt the framework a little bit, because the bow encoders behaves differently than other encoders and can create multiple non-zero values from a single input column. In comparison, encoders like Dummycode (dc) create more columns, but result always in just one non-zero value from each dc encoder. So, when a bow encoders is involved, the nnz values for each row is not known upfront, which is problematic for a parallel apply with CSR matrix output.
I added a new field in the ColumnEncoder which contains the nnz for each row, which is known after the build is completed.

The encoding, consists again of two phases the build, where we find the number of distinct tokens in the column and an apply phase, where we count the occurrences of each distinct token in each row.

For a parallelised build phase, we need to know the number of distinct tokens to avoid OOM errors through many large dictionaries. Similarly to recode, we estimate the number of distinct tokens of the bow dictionary, by sampling from a subset of rows. In contrast, this estimation is computational more complex for each row than for recode through the whole tokenisation process. A small experimental evaluation showed that the estimation took significantly longer for bow in comparison to recode, but in the end did not make up more than 10% of the whole encoding process.
One option to tackle this, would be to increase the number of samples for the bow encoder, but this resulted in (significantly) higher error of the estimation.

Overall the parallelisation of a single bow encoder over row partitions showed good improvements in runtime (3.4x with 4 physical cores) for the amazon digital music review dataset (131k reviews, 8m tokens,127k distinct tokens).

  • single-threaded encoder: 2.5s
  • multi-threaded encoder: 0.73s
  • scikit-learn's CountVectorizer: 3s

The default tokenisation consists of the conversion to lowercase, replacing of all non-alphabetic characters to whitespaces and splitting on whitespaces.

The current technique makes two passes over the data (build + apply), which could be avoided by storing the count of distinct token counts of the build phase, and just setting the values correctly in the apply phase. But this results, in a much higher memory footprint, since we have to store additionally for each row a hashmap of token count.
A next step could be to estimate the memory consumption of this approach by using the meta data of distinct token estimation, and dynamically choose this technique if enough memory is avaible.

The tests include test with single transformation, mutilple bag-of-words transformations and transformations with other encoders (recode).

@e-strauss e-strauss changed the title [SYSTEMDS-3782] Bag-of-words Encoder for CP [WIP] [SYSTEMDS-3782] Bag-of-words Encoder for CP Oct 18, 2024
Copy link

codecov bot commented Oct 18, 2024

Codecov Report

Attention: Patch coverage is 89.05109% with 45 lines in your changes missing coverage. Please review.

Project coverage is 71.21%. Comparing base (d0f0837) to head (3264595).
Report is 12 commits behind head on main.

Files with missing lines Patch % Lines
...s/runtime/transform/encode/MultiColumnEncoder.java 85.40% 12 Missing and 8 partials ⚠️
...time/transform/encode/ColumnEncoderBagOfWords.java 91.00% 3 Missing and 15 partials ⚠️
...ntime/transform/encode/ColumnEncoderDummycode.java 55.55% 4 Missing ⚠️
...ime/transform/encode/ColumnEncoderPassThrough.java 60.00% 2 Missing ⚠️
.../sysds/runtime/transform/encode/ColumnEncoder.java 92.30% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2130      +/-   ##
============================================
+ Coverage     71.12%   71.21%   +0.08%     
- Complexity    43023    43150     +127     
============================================
  Files          1446     1447       +1     
  Lines        164151   164554     +403     
  Branches      32001    32086      +85     
============================================
+ Hits         116759   117189     +430     
+ Misses        38240    38207      -33     
- Partials       9152     9158       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@e-strauss e-strauss changed the title [WIP] [SYSTEMDS-3782] Bag-of-words Encoder for CP [SYSTEMDS-3782] Bag-of-words Encoder for CP Oct 18, 2024
@e-strauss
Copy link
Contributor Author

most of the missing code coverage comes from the MCSR code in the different encoders, which is currently disabled.

@mboehm7
Copy link
Contributor

mboehm7 commented Oct 25, 2024

LGTM - thanks for the new encoder @e-strauss. During the merge I fixed unnecessary imports, removed wildcard imports, removed jetbrains annotations, and fixed some minor formatting issues.

@mboehm7 mboehm7 closed this in 654cea9 Oct 25, 2024
@e-strauss e-strauss deleted the bow_squashed branch November 4, 2024 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants