[SYSTEMDS-3782] Bag-of-words Encoder for CP #2130
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[SYSTEMDS-3782] Bag-of-words Encoder for CP
This patch adds a new feature transformation Bag-of-words (bow) to the to SystemDS' parallel feature transformation framework UPLIFT. The Bag-of-words encoder works similarly to scikit-learn's CountVectorizor ( https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html ).
Currently, the operation is only supported for CP. I had to adapt the framework a little bit, because the bow encoders behaves differently than other encoders and can create multiple non-zero values from a single input column. In comparison, encoders like Dummycode (dc) create more columns, but result always in just one non-zero value from each dc encoder. So, when a bow encoders is involved, the nnz values for each row is not known upfront, which is problematic for a parallel apply with CSR matrix output.
I added a new field in the ColumnEncoder which contains the nnz for each row, which is known after the build is completed.
The encoding, consists again of two phases the build, where we find the number of distinct tokens in the column and an apply phase, where we count the occurrences of each distinct token in each row.
For a parallelised build phase, we need to know the number of distinct tokens to avoid OOM errors through many large dictionaries. Similarly to recode, we estimate the number of distinct tokens of the bow dictionary, by sampling from a subset of rows. In contrast, this estimation is computational more complex for each row than for recode through the whole tokenisation process. A small experimental evaluation showed that the estimation took significantly longer for bow in comparison to recode, but in the end did not make up more than 10% of the whole encoding process.
One option to tackle this, would be to increase the number of samples for the bow encoder, but this resulted in (significantly) higher error of the estimation.
Overall the parallelisation of a single bow encoder over row partitions showed good improvements in runtime (3.4x with 4 physical cores) for the amazon digital music review dataset (131k reviews, 8m tokens,127k distinct tokens).
The default tokenisation consists of the conversion to lowercase, replacing of all non-alphabetic characters to whitespaces and splitting on whitespaces.
The current technique makes two passes over the data (build + apply), which could be avoided by storing the count of distinct token counts of the build phase, and just setting the values correctly in the apply phase. But this results, in a much higher memory footprint, since we have to store additionally for each row a hashmap of token count.
A next step could be to estimate the memory consumption of this approach by using the meta data of distinct token estimation, and dynamically choose this technique if enough memory is avaible.
The tests include test with single transformation, mutilple bag-of-words transformations and transformations with other encoders (recode).