Skip to content

Model output predication limit of top 100 #25

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dmhenke opened this issue Apr 7, 2025 · 3 comments
Open

Model output predication limit of top 100 #25

dmhenke opened this issue Apr 7, 2025 · 3 comments

Comments

@dmhenke
Copy link

dmhenke commented Apr 7, 2025

At present, I do not see how to output more than the top 100 predicted hits on a model. For example:
_```
model <- paragraph2vec(x = df_d2v, type = "PV-DM"
vocab <- summary(model, type = "vocabulary", which = "docs")
Sentences <- "my bag of words"
sentences <- setNames(sentences, sentences)
sentences <- strsplit(sentences, split = " ")
model_predictions <-predict(
model,
newdata = sentences,
type = "nearest", which = "sent2doc", top_n = 100)

dim(model_predictions) is at max 100 rows

There appears to be no way to output the predictions for a model with more than 100 "vocabulary"s/doc_id.  Is there a workaround for generating predictions on all available "vocabulary"/doc_id?

Thank you,
David
@jwijffels
Copy link
Collaborator

jwijffels commented Apr 7, 2025

If you use sent2doc at the C++ side an array of length 100 is created - see https://github.com/bnosac/doc2vec/blob/master/src/rcpp_doc2vec.cpp#L151-L164
This array is fixed size, if you need a bigger array, you would have to rewrite a whole part of the C++ code to make it an extensible array. I remember 5 years ago trying this out but stopped due to the amount of work involved of rewriting the code.
That is also why I've put at the R side a stop condition which checks that top_n is not larger than 100: https://github.com/bnosac/doc2vec/blob/master/R/paragraph2vec.R#L380

@lkmklsmn
Copy link

lkmklsmn commented Apr 9, 2025

I want to calculate and export the similarity between a single sentence and ALL docs in a given model. How do you suggest I go about accomplishing this?

@jwijffels
Copy link
Collaborator

You could remove the line at https://github.com/bnosac/doc2vec/blob/master/R/paragraph2vec.R#L380 and extend the array to bigger than 100 at https://github.com/bnosac/doc2vec/blob/master/src/rcpp_doc2vec.cpp#L151-L164 rebuild the package and test it out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants