Skip to content

Make levels return a CategoricalArray #425

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Make levels return a CategoricalArray #425

wants to merge 2 commits into from

Conversation

nalimilan
Copy link
Member

Having levels preserve the eltype of the input is sometimes useful to write generic code. This is only slightly breaking as the result still compares equal to the previous behvior returning unwrapped values.

Fixes #390.

Having `levels` preserve the eltype of the input is sometimes useful
to write generic code. This is only slightly breaking as the result
still compares equal to the previous behvior returning unwrapped values.
@ablaom
Copy link

ablaom commented May 29, 2025

Further to my posts at #390, I have run MLJ integration tests against this branch of CategoricalArrays.jl.

Only two models fail, both provided by the package BetaML.jl:

"PegasosClassifier (BetaML)", "PerceptronClassifier (BetaML)"

I will investigate further shortly and report back here. cc @sylvaticus

I have separately tested CategoricalDistributions, MLJModelInterface, and MLJBase, against this branch, with no fails.

@sylvaticus
Copy link

I'll too have a look, but I need a few days....

@nalimilan
Copy link
Member Author

Thanks. In parallel, let's try whether Nanosoldier can check all direct dependencies:

@nanosoldier `runtests(ALL)`

@nanosoldier
Copy link

Update on PkgEvalJob 87b50fc vs. 11d43c1: Accepted

@nalimilan
Copy link
Member Author

Better check against current release:

@nanosoldier `runtests(ALL, vs = "#v0.10.8")`

@nanosoldier
Copy link

Update on PkgEvalJob 87b50fc vs. 99faa56: Accepted

@nanosoldier
Copy link

Update on PkgEvalJob 87b50fc vs. 11d43c1: Running

@nanosoldier
Copy link

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

Report summary

✖ Packages that failed

14 packages failed only on the current version.

  • Package fails to precompile: 1 packages
  • Package has test failures: 3 packages
  • Package tests unexpectedly errored: 9 packages
  • Tests became inactive: 1 packages

44 packages failed on the previous version too.

✔ Packages that passed tests

57 packages passed tests on the previous version too.

➖ Packages that were skipped altogether

5 packages were skipped on the previous version too.

@nanosoldier
Copy link

Update on PkgEvalJob 87b50fc vs. 99faa56: Running

@nanosoldier
Copy link

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

Report summary

✖ Packages that failed

17 packages failed only on the current version.

  • Package fails to precompile: 1 packages
  • Package has test failures: 3 packages
  • Package tests unexpectedly errored: 10 packages
  • Tests became inactive: 1 packages
  • Test duration exceeded the time limit: 2 packages

41 packages failed on the previous version too.

✔ Packages that passed tests

1 packages passed tests only on the current version.

  • Other: 1 packages

56 packages passed tests on the previous version too.

➖ Packages that were skipped altogether

5 packages were skipped on the previous version too.

@nalimilan
Copy link
Member Author

OK, so 17 new failures with this PR compared with 0.10.8, and 14 new failures when comparing this PR against current master. According to another run, there are 13 new failures when comparing master with 0.10.8. A significant part seems to be related to MLJ.

So it doesn't look like this PR is really more breaking than the previous ones. About 40 dependent packages still pass their tests. Unfortunately, an equivalent number failed even with 0.10.8 (due to using Julia 1.11?), so we can't tell whether they are affected by the changes or not, but results on other packages probably give a representative picture of the breakage rate.

@nalimilan
Copy link
Member Author

@sylvaticus Do you know when you'll have the time to look at this?

@sylvaticus
Copy link

sylvaticus commented Jun 5, 2025

@sylvaticus Do you know when you'll have the time to look at this?

Hello @nalimilan : how can I test a Pool Request of another package on my package ?

Currently, the 2 MLJ function interfaces for the BetaML PerceptronClassifier model are:

function MMI.fit(model::PerceptronClassifier, verbosity, X, y)
 x = MMI.matrix(X)                     # convert table to matrix
 allClasses = levels(y)
 typeof(verbosity) <: Integer || error("Verbosity must be a integer. Current \"steps\" are 0, 1, 2 and 3.")  
 verbosity = mljverbosity_to_betaml_verbosity(verbosity)
 fitresult = BetaML.Perceptron.perceptron(x, y; θ=model.initial_coefficients, θ₀=model.initial_constant, T=model.epochs, nMsgs=0, shuffle=model.shuffle, force_origin=model.force_origin, return_mean_hyperplane=model.return_mean_hyperplane,rng=model.rng, verbosity=verbosity)
 cache=nothing
 report=nothing
 return (fitresult,allClasses), cache, report
end

and

function MMI.predict(model::Union{PerceptronClassifier,PegasosClassifier}, fitresult, Xnew)
    fittedModel      = fitresult[1]
    classes          = fittedModel.classes
    allClasses       = fitresult[2] # as classes do not includes classes unsees at training time
    nLevels          = length(allClasses)
    nRecords         = MMI.nrows(Xnew)
    modelPredictions = BetaML.Perceptron.predict(MMI.matrix(Xnew), fittedModel.θ, fittedModel.θ₀, fittedModel.classes) # vector of dictionaries y_element => prob
    predMatrix       = zeros(Float64,(nRecords,nLevels))
    # Transform the predictions from a vector of dictionaries to a matrix
    # where the rows are the PMF of each record
    for n in 1:nRecords
        for (c,cl) in enumerate(allClasses)
            predMatrix[n,c] = get(modelPredictions[n],cl,0.0)
        end
    end
    predictions = MMI.UnivariateFinite(allClasses,predMatrix,pool=missing)
    return predictions
end

Perhaps it is enough that I change my calls from levels(y) to sort(unique(y)) ?? What do you think ?

@nalimilan
Copy link
Member Author

You can do ]add CategoricalArrays#nl/levels.

With this PR, levels(y) returns the same thing as sort(unique(y)), except that it keeps unused levels (as before). To get back the previous behavior you could do unwrap.(levels(y)). But without knowing the error I can't really help.

@sylvaticus
Copy link

Thank you, Should now be solved in BetaML#Master

@nalimilan
Copy link
Member Author

OK. Do you know why this is needed? In theory the idea was that having levels return a CategoricalVector shouldn't be a problem in most cases, and useful in others.

@ablaom Is this PR OK for you?

@sylvaticus
Copy link

sylvaticus commented Jun 5, 2025 via email

@ablaom
Copy link

ablaom commented Jun 9, 2025

I have not had time to fully investigate and won't have time for a week or so. It's possible MLJ will have to review every package that uses these methods, which will be unpleasant, but I think the change makes sense, so long as it is tagged as breaking. This is definitely breaking behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make levels return a CategoricalArray
5 participants